Revision [7625]
This is an old revision of ResearchSQLTutorial made by JasonChow on 2013-02-05 13:32:26.
First off, a brief description of how information is stored in an SQL database. You should be familiar with how spreadsheets are laid out at this point. if not, please ask a director to provide you access to, 'The_Long_Spreadsheet_Guide' in Google Docs and examine it before proceeding.
Spreadsheets store information in what looks like a two-dimensional table with an x and y axis (example below).
| A | B | C | D | |
|---|---|---|---|---|
| 1 | A1 | B1 | C1 | D1 |
| 2 | A2 | B2 | C2 | The Mighty Ducks |
| 3 | A3 | B3 | C3 | D3 |
| 4 | D4 | B4 | C4 | D4 |
Cells in spreadsheets are referenced by their position in the grid of the table. In the example above, 'The Mighty Ducks' is displayed in cell D2. Using a spreadsheet to store information is a quick and easy solution for MOST data, however we run into issues with repeating data, and large files sizes among others. Also it may be more difficult to standardize data entry and display if data is directly entered into a spreadsheet by different people with different ideas of what "standards" are. What if we had a large historical spreadsheet for candidates running for elections in a particular state? We may have something that looks like the table below, and imagine if we have even more data associated with each row (like biographical information), there could be an infinite number of columns.
| A | B | C | D | |
|---|---|---|---|---|
| 1 | Schumer, Chuck | NY | U.S. Senate | 2010 |
| 2 | Charles Schumer | NY | U.S. Senate | 2004 |
| 3 | Chuck Ellis Schumer | New York | U.S. Senate | 1998 |
| 4 | Schumer | New York | U.S. House District 9 | 1992 |
Using a relational database model with SQL, we are better able to standardize data, and efficiently store it. In looking at the table above, we can see that the official's name not only has a lot of variation, but given the different character lengths it will not be as easy to know how much storage space will be used. Now let's say our data DOES include biographical information in the table above. This bio information would not change from row to row, and we'd basically have biographical information in columns repeated row after row until the new candidate is listed.
SQL allows us to break this information down into tables where each row of data is unique, store each distinct instance of this data only once, and relate the information stored in tables to each other. Since each row is unique, we can associate ids to every row, and use these ids to reference the data stored in the row. Thinking about how this information can be broken down so each table holds distinct information, we can have a candidate table with all biographical information for candidates, and a state table with information about each state. Can you think of any other tables we could break down the the table above into? This may be a difficult concept to grasp, so feel free to take a look at the examples below.
candidate table
| candidate_id | lastname | firstname | middlename | nickname | birthplace | birthdate | homestate | family | education | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 26976 | Schumer | Charles | Ellis | Chuck | Brooklyn, NY | 11/23/1950 | NY | Wife: Iris; 2 Children: Jessica, Alison | JD, Harvard University, 1974 BA, Harvard University, 1971 |
| 2 | 27105 | Boehner | John | A. | West Chester, OH | 11/17/1949 | OH | Wife: Debbie; 2 Children: Lindsay, Tricia | BA, Business, Xavier University, 1977 | |
| 3 | some_id... | some_lastname... | some_firstname... | some_middlename... | some_nickname... | some_birthplace... | some_birthdate... | some_homestate... | some_family... | some_education... |
office table
| office_id | officetype_id | name | title | |
|---|---|---|---|---|
| 1 | 6 | C | U.S. Senate | Senator |
| 2 | 5 | C | U.S. House | Representative |
| 3 | 9 | L | State Senate | Senator |
| 4 | 8 | L | State House | Representative |
| 5 | 7 | L | State Assembly | Assembly Member |
| 6 | some_officeid... | some_officetypeid... | some_name... | some_title... |
As you can see using this method, we can store unique candidate information in one table, and distinct office information in another. This is fine and all, but these tables are only good for information about candidates and offices separately. What happens if we want to find out which offices a candidate has held? If we had a table (junction table) sitting between the candidate and office tables that included candidate_id's as they RELATE to office_id's, we'd be able to bring these two pieces of information together (the RELATIONAL database model comes from these relationships between tables). The table below shows an office_candidate junction table.
office_candidate table
| office_id | candidate_id | |
|---|---|---|
| 1 | 6 | 26976 |
| 2 | 5 | 26976 |
| 3 | some_officeid... | some_candidateid... |
Since we know that Chuck Schumer has served in both the U.S. House and U.S. Senate, our junction table should show candidate_id of 26976 associated with office_id's of 5 AND 6, which equal the U.S. House and U.S. Senate respectively. You can reference the prior office and candidate tables to verify the id's. In our office_candidate table, which offices would Speaker Boehner's candidate_id be associated with?
So let's say we want to display candidate bio information along with their offices, using our junction table, we can join our candidate bio information with our office information. Now that we've briefly gone over some simple concepts about how relational databases work, let's get into the syntax. Hopefully the above concepts will make more sense as you gain some hands-on experience. Feel free to come back and look over the concepts above at any point.
You will need to have OpenOffice.org's Base (OOo Base) program setup to access our database. Please see your director for login information and to get you setup.
The basic SQL command to retrieve information from an SQL database is the SELECT command. You will have to know the table you want to pull information from, the fields in the table you want to pull information from. The SELECT command will always need to be coupled with a FROM specifying where we are selecting information from.
The syntax is as follows: SELECT *fields_from_table* FROM *table_of_interest* ;
Notes:
- We can select all fields in a table, if we had no particular fields of interest by using the asterisk. So if we wanted our query to return all information in all fields in our table of interest, our command would be: SELECT * FROM *table_of_interest* ;
- When selecting fields we're interested in, we need to separate the field names with commas ','.
Notice the semi-colon at the end. Each SQL query is ended by a semi-colon. This tells the program that we are done composing our query. While the OOo Base SQL editor will automatically add the semi-colon if you've forgotten when constructing queries, it's still good form to include them as it's standardized across SQL implementations. This will allow you to construct working queries through Microsoft's Access and SQL Server, MySQL, PostgreSQL, IBM's DB2, Oracle, and SQLite, among others.
To start off, let's construct a query on the office table. We want to know all the office information we have in our database. Before scrolling down, using the SELECT command, how would you construct a query for this?
SELECT * FROM office ;
The office table is relatively small compared to some of the other tables we have in our database. We may run into speed issues when working with data from our speeches, or even candidate tables (our candidate table has 135,547 candidates with all related biographical information). To get a sense of all the fields and data the candidate table stores, lets apply some conditions to limit the data returned in our next query. To limit information returned, we can use the WHERE clause in our query. It follows this syntax: SELECT *fields_from_table* FROM *table_of_interest* WHERE *conditions_we_specify* ;
Knowing the field names are able to apply the specific conditions we're interested in. The following fields are a selection of all fields from our candidate table: candidate_id, lastname, firstname, middlename, nickname, suffix, state_id, family, religion. Knowing these field names, let's say we wanted to show all candidates information in our candidate table who have a home state of Hawaii. How would you construct this query?
SELECT * FROM candidate WHERE state_id = 'HI' ;
Notes:
- Given that the state_id field stores data in text form, we must include single quotes.
- If we were referencing numerical fields such as id's, we would not need these single quotes.
The resulting dataset should include all fields in our candidate table of those with a home state of Hawaii. Now suppose we wanted to display only the candidate_id, firstname, and lastname fields of all candidates in our database in Hawaii. How would you construct this query?
SELECT candidate_id, firstname, lastname FROM candidate WHERE state_id = 'HI' ;
Now let's find all the people with the firstname of 'Joe' from Hawaii, and display their candidate_id, firstname, and lastname. We can use logical operators such as 'AND' and 'OR' in our WHERE clause to limit the results returned.
SELECT candidate_id, firstname, lastname FROM candidate WHERE state_id = 'HI' AND firstname = 'Joe' ;
Now let's explore how we can sort our data. Using ORDER BY, we are able to sort our dataset returned. In our SELECT command, ORDER BY should come after our WHERE clause, and we will just tell it the fields we want our resulting dataset to be ordered by. Let's order (sort) our returned list of Joe's in Hawaii by lastname in ascending order.
SELECT candidate_id, firstname, lastname FROM candidate WHERE state_id = 'HI' AND firstname = 'Joe' ORDER BY lastname ;
By default, all sorts are in ascending order. To specify a descending sort, we need to include DESC after the field we want to sort in descending order. Additionally we can specify multiple fields on our sort by separating the field names with commas.
SELECT candidate_id, firstname, lastname FROM candidate WHERE state_id = 'HI' AND firstname = 'Joe' ORDER BY lastname DESC ;
Now let's construct a query that will show the candidate_id's, firstname, lastname, and state_id's of all people in our candidate table from Hawaii and Idaho with firstname of 'Joe'. Let's then sort the state_id in descending order first, then sort the lastname returned in ascending order. How would you construct this query?
SELECT candidate_id, firstname, lastname, state_id FROM candidate WHERE (state_id = 'HI' OR state_id = 'ID') AND firstname = 'Joe' ORDER BY state_id DESC, lastname ;
Using IN, we can specify limits contained within our WHERE clause that's easier to look at. When using IN we basically list the values we are interested in inside parenthesis '()' of a particular field name. The above example would look like this:
SELECT candidate_id, firstname, lastname, state_id FROM candidate WHERE state_id IN ('HI', 'ID') AND firstname = 'Joe' ORDER BY state_id DESC, lastname ;
Now, let's start returning data from multiple tables. We can use JOIN to bring together information from multiple tables as long as there's a field we can relate the information stored. Let's think about this step by step. Look back up at the office table and office_candidate junction table shown prior to us beginning our exploration of SQL syntax. Can you find a field contained in both tables? The office table has fields of office_id, officetype_id, name, and title. The office_candidate table has fields of office_id, and candidate_id.
We can see that office_id is contained in both tables, so we're able to use that field as our joining field.
Let's display only candidate_id's and their associated office names for our list of candidates from Hawaii and Idaho. The syntax for JOIN in our SELECT statement is as follows:
SELECT office_candidate.candidate_id, office.name FROM office JOIN office_candidate ON (office.office_id = office_candidate.office_id) WHERE (office_candidate.state_id = 'HI' OR office_candidate.state_id = 'ID') ;
Notes:
- When joining tables, we need to explicitly specify how the tables will be joined. This can be done using ON after the JOIN.
Go through the query above, and try to understand how it was constructed. Notice how we have not only built the field names we're interested in into the query, but we have also specified what table these fields are in. This is necessary as larger joins are used, and the same field names may be used even if the tables are different. Think about a candidate who may have moved and possibly ran for office in different states. It's possible that the candidate table includes state_id information for them in their current home state, while the election_candidate table, which includes their election history may also include a state_id field for an election they ran for prior to moving. SQL allows us to rename fields and tables, which helps us better make sense of the syntax we use, and the resultant dataset. Renaming tables in our JOIN will help make our syntax less cluttered.
Let's trying using new names for our tables in the JOIN.
SELECT oc.candidate_id, o.name FROM office o JOIN office_candidate oc ON (o.office_id = oc.office_id) WHERE (oc.state_id = 'HI' OR oc.state_id = 'ID') ;
You can see from the above query, that after listing which tables we are selecting our data from, we've renamed the office table to o, and the office_candidate table to oc.
Something else we can use in PostgreSQL, which may not be available in other SQL implementations is USING with our JOIN. This replaces the ON portion of the JOIN, and allows us to quickly specify what our joining field is. An example is below.
SELECT oc.candidate_id, o.name FROM office o JOIN office_candidate oc USING (office_id) WHERE oc.state_id IN ('HI', 'ID') ;
For lengthy queries, USING may help us not only make our code easier to look at, but also save some typing. However, as with the above explanation of sometimes having the same fields in multiple tables, we may need to use ON instead if we need to explicitly state what our joining field is going to be.
Now that you've had a chance to look over how JOIN works, let's construct a query that returns the some candidate names along with their offices for all people named 'Joe' from 'ID' and 'HI'. Let's then sort our resulting dataset by state_id in descending order, then by lastname in ascending order.
Here are some hints to get you going, if you're wondering where to start:
- Think about what tables you need
- Think about what fields you need
- Think about how you would JOIN tables together
- Think about how to limit the data returned to only data you're interested in
- Think about how we're going to sort our data
This is not the easiest stuff to wrap your head around, so feel free to ask if you need help. Once you've finished going over this, feel free to try out some of the TrainingPracticeQueries on the wiki. Try out this tutorial if you are interested in further developing your SQL skills.