Speech QC Queries


This is a list of queries to be run from time to time to analyze the consistency of our data, i.e. to correct errors and standardize formatting. Notice that the SELECT, FROM, and ORDER BY clauses are all roughly the same. The goal is to display specific information in a specific order so that we can systematically identify and correct public statements containing errors.

Bear in mind some of these queries will return intimidatingly large numbers. Beware of false positives. Out of almost 900,000 speeches, errors are to be expected. Depending on the error, it may be possible to give the query to IT who can fix some errors automatically. Others will have to be done manually.

IT can perform global changes if we provide a .csv with zero false positives or other errors. If IT does a global change and we identify errors later, they will be next to impossible to isolate and fix so it's better to be 100% sure before hand.

Titles - A query that displays all speeches created after a certain date (with title and speech text)
SELECT s.speech_id, c.candidate_id, s.title, s.created, s.speechtext 
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id) 
WHERE s.created >= '06-15-2015' 
ORDER BY s.created DESC


Pledges - A query that finds speeches with the type: "pledge":
SELECT s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s
WHERE s.speechtype_id = 11
GROUP BY s.speech_id
ORDER BY s.title


Thomas - A query that finds speeches with Page or Time breaks:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc using (speech_id) JOIN candidate c using (candidate_id)
WHERE s.speechtext LIKE '%[Page:%' OR s.speechtext LIKE '%[Time:%'
ORDER BY c.candidate_id, s.speechdate


Thomas - A query that finds the thomas speeches with incorrect URLs:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.url LIKE '%homas.%' AND s.url NOT LIKE '%vote%' AND s.url NOT LIKE '%juan%' AND s.url NOT LIKE '%governor%' AND s.url != 'http://thomas.loc.gov'
ORDER BY s.url, s.speechdate


Thomas - A query that finds the speeches with thomas.loc.gov URLs, but do not have the type Floor Speech
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.url LIKE '%homas.%' AND s.url NOT LIKE '%vote%' AND s.url NOT LIKE '%juan%' AND s.url NOT LIKE '%governor%' AND s.speechtype_id != 14
ORDER BY s.url


Thomas - A query that finds the speeches with thomas.loc.gov URLs, but do not have Washington, DC as the location:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.url LIKE '%thomas.loc.gov%' AND s.location != 'Washington, DC'
ORDER BY s.location


Hearing - A query that finds speeches of type hearing, but the title does not reflect our formatting rules:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE title NOT LIKE '%Hearing of the%' AND speechtype_id = 13
ORDER BY c.candidate_id, s.speechdate


Hearing - A query that finds speeches with Hearing of the in the title, but the type is not hearing:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE title LIKE '%Hearing of the%' AND speechtype_id != 13
ORDER BY c.candidate_id, s.speechdate


Interview - A query that finds the speeches of type "interview", but the title does not reflect our formatting rules:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.title NOT LIKE '%Transcript%' AND s.speechtype_id = 3
ORDER BY c.candidate_id, s.speechdate


Interview - A query that finds the speeches with Transcript in the title, but the type is not interview:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.title LIKE '%Transcript%' AND s.speechtype_id != 3
ORDER BY c.candidate_id, s.speechdate


Interview - A query that finds speeches from a long time ago with the location as the news show on which they appeared. It's likely these errors will have to be corrected manually. Most are marked as speeches as well:
SELECT speech_id, title, location, speechdate
FROM speech
WHERE location LIKE '%Meet the Press%' OR location LIKE '%60%' OR location LIKE '%FOX%' OR location LIKE '%Hardball%' OR location LIKE '%Morning%' OR location LIKE '%ABC%' OR location LIKE '%CNN%' OR location LIKE '%Larry King%' OR location LIKE '%Late Edition%' OR location LIKE '%This Week%' OR location LIKE '%Hannity%' OR location LIKE '%MSNBC%' OR location LIKE '%CBS%'


Letter - A query that finds speeches of type letter, but the the title does not reflect our formatting rules:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.title NOT LIKE 'Letter to%' AND s.speechtype_id = 2
ORDER BY c.candidate_id, s.speechdate


Letter - A query that finds speeches where the title includes the words letter to, but the type is not letter:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.title LIKE 'Letter to%' AND s.speechtype_id != 2
ORDER BY c.candidate_id, s.speechdate


Issue Position - A query that finds speeches of type Issue Position, but the title does not reflect our formatting rules:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.speechtype_id = 7 AND s.title NOT LIKE '%Issue Position:%'
ORDER BY c.candidate_id, s.speechdate


Issue Position - A query that finds speeches where the title includes the words Issue Position, but the type is not "Issue Position":
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.speechtype_id != 7 AND s.title LIKE '%Issue Position:%'
ORDER BY c.candidate_id, s.speechdate


Washington, DC - A query that finds some of the incorrect variations of Washington, DC in the speeches table:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.location LIKE '%ashingto%' AND s.location LIKE '%D%' AND s.location != 'Washington, DC'
ORDER BY s.location, c.candidate_id, s.speechdate


Spanish - A query that finds speeches which include the words prensa, declaracion, discurso that is, probably written in Spanish: Alex Coffman went through the results and fixed 95% of the errors found by this query.
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.speechtext LIKE '%discurso%' OR s.speechtext LIKE '%prensa%' OR s.speechtext LIKE '%declaracion%'
ORDER BY c.candidate_id, s.speechdate


Text - A query that finds speeches where click here (regardless of capitalization) appears in the speechtext:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.speechtext ILIKE '%click here%'
ORDER BY c.candidate_id, s.speechdate


URL - A query that finds speeches lacking a URL:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.url = ''
ORDER BY s.speechdate, c.candidate_id


URL - A query that finds speeches lacking a URL, but has one mentioned in the speech text (could be the source):
SELECT s.speechdate, c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.url = '' AND s.speechtext LIKE '%http://%'
ORDER BY s.speechdate


Dates - A query that finds speeches lacking a speechdate:
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id)
WHERE s.speechdate IS NULL AND s.speechtype_id != 7
ORDER BY c.candidate_id


Dates - A query that displays all speeches created in the future (with title and speech text)
SELECT s.speech_id, c.candidate_id, s.title, s.created, s.speechtext 
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id) 
WHERE s.created > NOW()
ORDER BY s.created DESC


Missing State of the States - A query that finds speeches by governors between Jan 01 YYYY and March 15 YYYY (in search of State of the State speeches):
SELECT c.candidate_id, s.speech_id, s.speechdate, s.title, s.url, s.speechtype_id, s.location
FROM speech s JOIN speech_candidate sc USING (speech_id) JOIN candidate c USING (candidate_id) JOIN office_candidate oc USING (candidate_id) JOIN office o USING (office_id)
WHERE o.officetype_id = 'G' AND s.speechtype_id = 1 AND s.speechdate BETWEEN 'YYYY-01-01' AND 'YYYY-03-15'
ORDER BY s.location


Cats and Tags - A query that finds speeches with a certain tag, but not a certain category (originally used to find speeches with 'sanctions-on-x' without Trade as a category):
SELECT DISTINCT a.speech_id, a.candidate_id, a.title, a.speechdate
FROM

(Select s.speech_id, scan.candidate_id, s.title, s.speechdate
FROM speech_category sc JOIN speech s USING (speech_id) JOIN speech_tag st USING (speech_id) JOIN speech_candidate scan USING (speech_id)
WHERE st.tag_id = ########) a

WHERE a.speech_id NOT IN

(Select s.speech_id
FROM speech_category sc JOIN speech s USING (speech_id) JOIN speech_tag st USING (speech_id)
WHERE sc.category_id = ####### AND st.tag_id = #######)
ORDER BY a.speech_id


SpeechesGuide
Useful Research Queries
Speech Research Queries
There are 2 comments on this page. [Show comments]
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki