Background for SQL Tutorial: PVSWiki

Background for SQL Tutorial

SQL stands for Structured Query Language, and is used as a way to manipulate, store, and retrieve information in many relational databases. To learn more, you can read the link to the Wikipedia page above. At PVS we use the PostgreSQL implementation of SQL. In this tutorial we will focus on the SELECT command to retrieve information stored in our SQL database. A good tutorial for further information is located here.

A brief description of how information is stored in an SQL database
You should be familiar with how spreadsheets are laid out at this point. Spreadsheets store information in what looks like a two-dimensional table with an x and y axis (example below).

	A	B	C	D
1	A1	B1	C1	D1
2	A2	B2	C2	The Mighty Ducks
3	A3	B3	C3	D3
4	A4	B4	C4	D4

Cells in spreadsheets are referenced by their position in the grid of the table. In the example above, 'The Mighty Ducks' is displayed in cell D2. Using a spreadsheet to store information is a quick and easy solution for MOST data, however we run into issues with repeating data, and large files sizes among others. Also it may be more difficult to standardize data entry and display if data is directly entered into a spreadsheet by different people with different ideas of what "standards" are. What if we had a large historical spreadsheet for candidates running for elections in a particular state? We may have something that looks like the table below, and imagine if we have even more data associated with each row (like biographical information), there could be any number of columns.

	A	B	C	D
1	Schumer, Chuck	NY	U.S. Senate	2010
2	Charles Schumer	NY	U.S. Senate	2004
3	Chuck Ellis Schumer	New York	U.S. Senate	1998
4	Schumer	New York	U.S. House District 9	1992

Using a relational database model with SQL, we are better able to standardize data, and efficiently store it. In looking at the table above, we can see that the official's name not only has a lot of variation, but given the different character lengths it will not be as easy to know how much storage space will be used. Now let's say our data DOES include biographical information in the table above. This bio information would not change from row to row, and we'd basically have biographical information in columns repeated row after row until a candidate is listed.

SQL allows us to break this information down into tables where each row of data is unique, store each distinct instance of this data only once, and relate the information stored in tables to each other. Since each row is unique, we can associate IDs (typically numerical, but not always) to every row, and use these IDs to reference the data stored in the row. Thinking about how this information can be broken down so each table holds distinct information, we can have a candidate table with all biographical information for candidates, and an office table with information about each office. Can you think of any other tables we could break down the table above into? This may be a difficult concept to grasp, so feel free to take a look at the examples below.

candidate table

	candidate_id	lastname	firstname	middlename	nickname	birthplace	birthdate	homestate	family	education
1	26976	Schumer	Charles	Ellis	Chuck	Brooklyn, NY	11/23/1950	NY	Wife: Iris; 2 Children: Jessica, Alison	JD, Harvard University, 1974, BA, Harvard University, 1971
2	27105	Boehner	John	A.	West Chester, OH	11/17/1949	OH	Wife: Debbie; 2 Children: Lindsay, Tricia		BA, Business, Xavier University, 1977
3	some_id...	some_lastname...	some_firstname...	some_middlename...	some_nickname...	some_birthplace...	some_birthdate...	some_homestate...	some_family...	some_education...

office table

	office_id	officetype_id	name	title
1	6	C	U.S. Senate	Senator
2	5	C	U.S. House	Representative
3	9	L	State Senate	Senator
4	8	L	State House	Representative
5	7	L	State Assembly	Assembly Member
6	some_officeid...	some_officetypeid...	some_name...	some_title...

As you can see using this method, we can store unique candidate information in one table, and distinct office information in another. This is fine and all, but these tables are only good for information about candidates and offices separately. What happens if we want to find out which offices a candidate has held? If we had a third table (often called a junction table) sitting between the candidate and office tables that included candidate_id's as they RELATE to office_id's, we'd be able to bring these two pieces of information together (the RELATIONAL database model comes from these relationships between tables). The table below shows an office_candidate junction table. We can refer to the distinct id fields created for each row of data as a primary key. Ideally there should be one, and only one unique data record associated with the primary key. It would be confusing to have the primary key refer to two rows with the same values. This may happen at times when one person enters in candidate bio information, and a different individual enters in the same bio information under a new candidate_id. In these instances, we must pick one candidate_id, so that only one id is associated with speeches and statements the candidate has made, ratings interest groups have given the candidate, key votes taken by the official, and associated offices held by the official. When we have these same primary keys in other tables used to relate our information, we refer the these fields as foreign keys. Think of it like this, if the key is used to identify an individual row of data within a particular table, it is likely a primary key, specific to that table. If the key is used to reference another table's specific row of data it is likely that it is a foreign key. Knowing this, would the candidate_id field in the candidate table be a primary or foreign key? Would the candidate_id field in the office_candidate junction table be a primary or foreign key?

office_candidate table

	office_id	candidate_id
1	6	26976
2	5	26976
3	some_officeid...	some_candidateid...

Since we know that Chuck Schumer has served in both the U.S. House and U.S. Senate, our junction table should show candidate_id of 26976 associated with office_id's of 5 AND 6, which equal the U.S. House and U.S. Senate respectively. You can reference the prior office and candidate tables to verify the id's. In our office_candidate table, which offices would Speaker Boehner's candidate_id be associated with?

So let's say we want to display candidate bio information along with their offices. Using our junction table, we can join our candidate bio information with our office information. Now that we've briefly gone over some simple concepts about how relational databases work, let's get into some code and syntax. Hopefully the above concepts will make more sense as you gain some hands-on experience. Feel free to come back and look over the concepts above at any point.

The Language
You will need to have the pgAdmin program setup to access our database. The SQL Specialist will help you with this after your SQL training.

The basic SQL command to retrieve information from an SQL database is the SELECT command. You will have to know the table you want to pull information from, the fields in the table you want to pull information from. The SELECT command will always need to be coupled with a FROM specifying where we are selecting information from. The syntax is as follows: SELECT [fields_from_table] FROM [table_of_interest] ;

Notes:
We can select all fields in a table, if we had no particular fields of interest by using the asterisk. So if we wanted our query to return all information in all fields in our table of interest, our command would be: SELECT * FROM [table_of_interest]
When selecting fields we're interested in, we need to separate the field names with commas ','.

To start off, let's construct a query on the office table. We want to know all the office information we have in our database. Before scrolling down, using the SELECT command, how would you construct a query for this?

SELECT *
FROM office

The office table is relatively small compared to some of the other tables we have in our database. We may run into speed issues when working with data from our speeches, or even candidate tables (our candidate table has nearly 150k candidates with all related biographical information). To get a sense of all the fields and data the candidate table stores, lets apply some conditions to limit the data returned in our next query. To limit information returned, we can use the WHERE clause in our query. It follows this syntax: SELECT [fields_from_table] FROM [table_of_interest] WHERE [conditions_we_specify]

Knowing the field names, we are able to apply the specific conditions we're interested in. The following fields are a selection of all fields from our candidate table: candidate_id, lastname, firstname, middlename, nickname, suffix, state_id, family, religion. Let's say we wanted to show all candidate's information in our candidate table who have a home state of Hawaii. How would you construct this query?

SELECT *
FROM candidate
WHERE state_id = 'HI'

Notes:
Given that the state_id field stores data in text form, we must include single quotes.
If we were referencing numerical fields such as IDs, we would not need these single quotes but it is a good habit.

The resulting dataset should include all fields in our candidate table of those with a home state of Hawaii. Now suppose we wanted to display only the candidate_id, firstname, and lastname fields of all candidates in our database in Hawaii. How would you construct this query?

SELECT candidate_id, firstname, lastname
FROM candidate
WHERE state_id = 'HI'

Now let's find all the people with the firstname of 'Joe' from Hawaii, and display their candidate_id, firstname, and lastname. We can use logical operators such as 'AND' and 'OR' in our WHERE clause to limit the results returned.

SELECT candidate_id, firstname, lastname
FROM candidate
WHERE state_id = 'HI' AND firstname = 'Joe'

Now let's explore how we can sort our data. Using ORDER BY, we are able to sort our dataset returned. In our SELECT command, ORDER BY should come after our WHERE clause, and we will just tell it the fields we want our resulting dataset to be ordered by. Let's order (sort) our returned list of Joe's in Hawaii by lastname in ascending order.

SELECT candidate_id, firstname, lastname
FROM candidate
WHERE state_id = 'HI' AND firstname = 'Joe'
ORDER BY lastname

By default, all sorts are in ascending order. To specify a descending sort, we need to include DESC after the field we want to sort in descending order. Additionally we can specify multiple fields on our sort by separating the field names with commas. When you want to order by multiple fields, the order you list those fields matters.

SELECT candidate_id, firstname, lastname
FROM candidate
WHERE state_id = 'HI' AND firstname = 'Joe'
ORDER BY lastname DESC

Now let's construct a query that will show the candidate_id's, firstname, lastname, and state_id's of all people in our candidate table from Hawaii and Idaho with firstname of 'Joe'. Let's then sort the state_id in descending order first, then sort the lastname returned in ascending order. How would you construct this query?

SELECT candidate_id, firstname, lastname, state_id
FROM candidate
WHERE (state_id = 'HI' OR state_id = 'ID') AND firstname = 'Joe'
ORDER BY state_id DESC, lastname

Using IN, we can specify limits contained within our WHERE clause that's easier to look at. When using IN we basically list the values we are interested in inside parenthesis '()' of a particular field name. The above example would look like this:

SELECT candidate_id, firstname, lastname, state_id
FROM candidate
WHERE state_id IN ('HI', 'ID') AND firstname = 'Joe'
ORDER BY state_id DESC, lastname

Now, let's start returning data from multiple tables. We can use JOIN to bring together information from multiple tables as long as there's a field we can relate the information stored. Let's think about this step by step. Look back up at the office table and office_candidate junction table shown prior to us beginning our exploration of SQL syntax. Can you find a field contained in both tables? The office table has fields of office_id, officetype_id, name, and title. The office_candidate table has fields of office_id, and candidate_id. We can see that office_id is contained in both tables, so we're able to use that field as our joining field.

Let's display only candidate_id's and their associated office names for our list of candidates from Hawaii and Idaho. The syntax for JOIN in our SELECT statement is as follows:

SELECT office_candidate.candidate_id, office.name
FROM office
JOIN office_candidate ON (office.office_id = office_candidate.office_id)
WHERE (office_candidate.state_id = 'HI' OR office_candidate.state_id = 'ID')

More Notes:
When joining tables, we need to explicitly specify how the tables will be joined. This can be done using ON after the JOIN.

SELECT oc.candidate_id, o.name
FROM office o JOIN office_candidate oc ON (o.office_id = oc.office_id)
WHERE (oc.state_id = 'HI' OR oc.state_id = 'ID')

Go through the query above, and try to understand how it was constructed. Notice how we have not only built the field names we're interested in into the query, but we have also specified what table these fields are in. This is necessary as larger joins are used, and the same field names may be used even if the tables are different. Think about a candidate who may have moved and possibly ran for office in different states. It's possible that the candidate table includes state_id information for them in their current home state, while the election_candidate table, which includes their election history may also include a state_id field for an election they ran for prior to moving. SQL allows us to rename fields and tables, which helps us better make sense of the syntax we use, and the resultant dataset. Renaming tables in our JOIN will help make our syntax less cluttered. Let's trying using new names for our tables in the JOIN.

You can see from the above query, that after listing which tables we are selecting our data from, we've renamed the office table to "o", and the office_candidate table to "oc". So now our reference to the candidate_id field is typed as oc.candidate_id. Likewise our reference to the office name is listed as o.name.

Now let's say we further wanted to give the actual office name a better column descriptor. We can easily rename fields by specifying the new field name after we refer to it. In the exacmple about, we can use:

SELECT oc.candidate_id, o.name office
FROM office o
JOIN office_candidate oc ON (o.office_id = oc.office_id)
WHERE (oc.state_id = 'HI' OR oc.state_id = 'ID')

Something else we can use in PostgreSQL, which may not be available in other SQL implementations is USING with our JOIN. This replaces the ON portion of the JOIN, and allows us to quickly specify what our joining field is. Here is an example:

SELECT oc.candidate_id, o.name
FROM office o
JOIN office_candidate oc USING (office_id)
WHERE oc.state_id IN ('HI', 'ID')

For lengthy queries, USING may help us not only make our code easier to look at, but also save some typing. However, as with the above explanation of sometimes having the same fields in multiple tables, we may need to use ON instead if we need to explicitly state what our joining field is going to be.

If you are hoping to return results that might not have entries in a particular field, use LEFT JOIN. You may want to figure out which candidates are from a particular city/state and what speeches they've given, if any. The list returned to you will likely be too short if you simply JOIN to speech_candidate: there are probably candidates from that city who have never given a speech and therefore never appear in the speech_candidate table. They won't appear in the results. If you want results even when a certain field is blank, use LEFT JOIN. It functions the same as JOIN and it's always safer.

Now that you've had a chance to look over how JOIN works, let's construct a query that returns some candidate names along with their offices for all people named 'Joe' from 'ID' and 'HI'. Let's then sort our resulting dataset by state_id in descending order, then by lastname in ascending order.

Here are some hints to get you going, if you're wondering where to start:
Think about what tables you need
Think about what fields you need
Think about how you would JOIN tables together
Think about how to limit the data returned to only data you're interested in
Think about how we're going to sort our data

CategoryTechnicalSkills
CategoryResearch

PVSWiki : ResearchSQLTutorial

Background for SQL Tutorial