SQL at PVS: PVSWiki

SQL at PVS

New to SQL

Useful Links
ResearchSQLTutorial
TrainingPracticeQueries
PracticeQueriesAnswers
PvsadminSchema
Database Map
PgAdmin

Please note: interacting with our database through SQL is a privilege for high-performing interns and staff. Its use and practice should not interfere with the accomplishment of your normal responsibilities and duties as previously described in any way.

We use PostgreSQL and documentation is located here: https://www.postgresql.org/

Before being trained, please read this page. It will give you some background for the upcoming training:
http://wiki.votesmart.org/ResearchSQLTutorial

During training, we will practice with these queries. It's unlikely we'll get through all 7, so please try the others in your spare time.
http://wiki.votesmart.org/TrainingPracticeQueries

This map of our database may help when you're looking for information about the tables: https://dl.dropboxusercontent.com/u/50470502/pvsadmin.public.png
Or, if you learn better with tables and text, check out the PvsadminSchema

Once you become comfortable with these basic concepts, we can show you how to use different types of JOINs, as well as subqueries. More info on joins can be found on the SQLJoins page.

Understand that this is not the easiest stuff to wrap your head around, so feel free to ask if you need help. Try out this tutorial if you are interested in further developing your SQL skills.

How to Train

The Vote Smart SQL training routine has two main sections and three overarching knowledge areas. The sections are addressed in order and the different knowledge areas are emphasized as they come up within each section. The whole training session usually lasts from 2-4 hours depending on how quickly the trainees pick it up and the number of questions. Before the training session, I have them read over this wiki page: http://wiki.votesmart.org/ResearchSQLTutorial (this is mentioned at the top of this page) just to get a handle on some of the vocabulary.

In order of lowest to highest types of knowledge, those areas are:
1. knowledge about databases - general database structures (tables, fields, cells, data types, etc.) and relationships as well as the organization specific data as seen on https://dl.dropboxusercontent.com/u/50470502/pvsadmin.public.png (while still conceptually useful, this map may be outdated).
2. knowledge about the SQL language - it is important to stress that this is a language in that it requires an understanding of vocabulary and rules of grammar not to mention style and structure.
3. knowledge about how to ask a question - ask for the data that will answer the question and build in checks to avoid duplicates, false positives, and other errors.

Part 1

The first main section is most closely related to knowledge area # 1 and it covers basic ideas about relational databases, tables, fields, datatypes, along with some details about our tables and fields that they should already know through their work in admin. I make sure to define the various terms starting with the most basic:

A piece of data is the fundamental unit in a database. These items are saved as they arrived, whole and intact, in a unique location within the larger database structure, without being broken down, duplicated or otherwise altered.

One example is a birthdate, another is a first name, another is a last name. Notice that the individual letters or numbers of each piece of data are not pieces of data themselves (at least not in the context of Vote Smart's database).
However, not all units are equal. There are different types and each type can have different characteristics not to mention sizes. We will talk more about the different data types shortly.
This may be obvious or not, sometimes it depends on seemingly arbitrary database specific conventions that may not be conventions in another database or evem in another table in the same database.
Imagine an entire database as a large cloud of millions, billions, trillions of individual pieces of data. When we consider it like this the image is immense, unwieldy, incomprehensible and not at all user friendly. Hold on to that thought for a moment.

A table is a flat file, basically like a spreadsheet, with rows and columns where the intersection of one column and one row is a cell.

A Table is a mid level organizational unit of a database. Some databases can have many tables, some only a few. Some have so many tables that they are then organized into “schemas”. For the purposes of this training and 99% of the SQL work you will do at PVS, we will only concern ourselves with the tables in Vote Smart's “Public” schema.
In a table, a row (the horizontal one) is properly called a “record”, and a column (the vertical one) is properly called a “field”.

A record is a series of horizontally adjacent cells that are all related to one another because of their common relationship to a “primary key”. The primary key is what keeps a record together and ensures the first name “Barack” and the last name “Obama” appear together instead of “Barack” and “Romney”, for example. If you did find a discrepency like this, that would be a research error.
A field is a series of vertically adjacent cells that are all related to one another by their data type. For example, in the “Birthdate” field, you would not find the first name “Barack”. If you did find a discrepency like this, that could be a research error, although many of our tables and fields have various restrictions and rules to prevent this kind of thing.

Each piece of data occupies a cell in a Table. This is the location where it is saved.
There are several types of table: storage, junction, and definition.

A storage table is large and basically stores large amounts of data. It's the place you go to look for something. The PVS database has a number of storage tables.

One example is our Candidate table where all sorts of data on each candidate is stored. We'll take a look at this shortly.
Another is our speech table where all of our public statements are stored.

A junction table is one that joins two other tables in order to establish relationships between them and the data within each. This is where the “Relational” part of relational database comes from.

We have a junction table that joins the primary keys from the candidate table to the speech table in order to indicate that Barack Obama gave the 2015 State of the Union Address and not someone else.

A definition table is another efficiency structure. It allows us to create a relationship between two terms, one of which is drastically shorter or easier to reproduce and the other would be a longer term that might appear frequently throughout the database.

The gender table is one example, where male is related to Y and female is related to X. Now across the hundreds of thousands of times where we might otherwise have used 4 characters in one cell to indicate that a politician is a male and 6 characters to indicate that she is female, we only have one character. You can do the math for how much storage space this saves and therefore how much more quickly various pieces of data can be accessed.

Go back to our image of a cloud of data, snap your fingers and imagine that it is organized into tables. Oh look, there it is: https://dl.dropboxusercontent.com/u/50470502/pvsadmin.public.png
With this organizational model, it becomes fairly straightforward to identify an individual cell or ranges of cells by specifying a table, a row, and a cell, or ranges thereof.

It is worth mentioning that dividing the database into a series of tables enhances computational efficiency by allowing us to access select data with a lower burden on our individual computer's memory. The difference is comparable to trying to read Moby Dick by looking at all 1200 pages all at once versus looking at one page at a time.

There will be some tables that you will want to avoid opening using these computers, because the tables are so large they may crash the computer.

Let's look at the “Candidate” table as an example. BE SURE TO EMPHASIZE THAT THEY ARE NOT TO MAKE AND DEFINITELY NOT SAVE CHANGES WHEN VIEWING A TABLE. I believe this permission has been restricted by the IT department, but it's better to be safe than sorry.
If you want to think of a heirarchy of units in a database, from the largest it is: “THE DATABASE”, then schemas (optional), then tables (including records and fields), then individual pieces of data.

At this point, we've given the trainees a decent introduction to databases, structures, vocabulary, and some of our data. That should form the foundation of knowledge area #1.

Part 2

The second main section really covers the basics of knowledge area #2 and some key parts of #3. Here are some definitions and notes for those basic terms:

'SELECT' command - tells the computer that you want to see data from the listed fields.

mention how to call a field by it's full name/location - [table.field]
in a given query fields can be renamed with 'AS' - [table.field AS newname]
this is where aggregate functions like COUNT are used

'FROM' clause - tells the computer what tables to use or "where to look".

mention how to rename tables (generally to something shorter that makes sense) - [tablename newname]

'JOIN' using 'ON' and 'USING' - combines tables in order to accommodate more complex queries and produce more complex datasets.

required when you want to use data from more than one table.
think of the hinge of a door - you want to use a field (or combination of fields in more advanced queries) that is common to both tables but uniquely identifies each record. For this reason, "State_ID" is rarely a valuable joining field.

'WHERE' clause and limiting factors

must be an expression that the computer can evaluate for veracity. "WHERE religion = christian" is good because the computer can identify records where this is true or false. "WHERE christian" is not because it is not an expression.
other operators include but are not limited to: >, >=, <, <=, =!, LIKE, ILIKE, BETWEEN () AND (), NULL, NOT NULL
in simple queries, combine limiting factors with 'AND' and 'OR'

'GROUP BY' - required when using aggregate functions like COUNT in your SELECT clause
'ORDER BY' - orders the query results

ASC is default and you have to specify DESC
separate multiple ordering fields with commas

It's good form to end all queries with a semi-colon ;

Have the trainees take turns working through the other practice queries. As the trainer, you should help them with direct assistance, leading questions, as well as misleading questions to emphasize certain points. This is especially useful with items related to knowledge area #3. They can help each other as well.

We never get through all 7 of them, so I tell the trainees to work on the others in their spare time. Before ending the training I like to show them some interesting queries we've saved to the wiki (see below). This last step gives the new trainees an idea of what's possible with slightly more advanced queries.

Useful Research Queries

These pages include queries to find errors (pyquals and others) as well as measures of productivity and generally interesting queries.

Answers to practice queries with notes
Queries for the Categorization project
Queries for the Politicians project
Queries for the Public Statements project
Queries for the Special Interest Groups project
Queries for Political Resources and associated projects
Queries for Campaign Finance data

ResearchQueries

Returns candidate table showing preferred name

Returns all currently active officials in state legislatures with some biographical information

Queries For Candidates (Office Seeking)

Queries For Committees

Queries For Endorsements

Queries for Former Office Holders - To display lists of state and federal officials from prior years.

PVSWiki : UsefulResearchQueries