What is it?

Web scraping is the process of getting data from the web through automated programs. Think of it like copy-paste, only much faster. It is generally a more efficient and accurate process than manual entry. Many other groups and individuals take third-party information to clean, aggregate, and/or display in an original format (see attached file below for some legal context). What makes scraping unique at PVS is that we take data from thousands of sources rather than a handful. This makes prioritizing efficient automation key. At PVS, scraping speeds up our collection processes, which allows us to focus on the quality and display of the data.


What tools do we need?

R, an open-source statistical programming language, has demonstrated flexibility, efficiency, and approachability. It is a tool that can scrape and clean data as well as interact with our database using PostgreSQL, all in the same script. R scraping is a holistic approach. The software also provides powerful graphical tools that may prove useful.

RStudio is another piece of open-source software that provides a graphical interface for working in R. This makes viewing and working with data easier.

Python is also used for some scripts. As website's have become more modernized, several libraries in python that are used to scrape those websites have proven to be slightly more stable.

Text editors like Atom and Visual Studio Code work great for coding in python.

Packages and Libraries that have proven to be useful in the past are listed in here.

Note: All automated script processes have been put to a stop.

How do we use it?

There are five main steps in our automated collection process: scraping, cleaning, merging, importing, and QC. Ideally, the first four are all executed by the same program. The process will vary depending on department and task.

The first step is to scrape the data from the primary source. This will require the researcher to understand the format of the source (HTML, PDF, XLS, etc.) and how to extract the data. A script may require a loop that scrapes over multiple URL's or downloads multiple PDF's. It may have to account for irregular spacing or graphics. Past scripts may be instructive as to which packages and what approach to use. Each source will require a slightly different scrape; the research programmer must tweak their script to fit the source.

The second is to clean the data. Percentage symbols, string entries, and other extraneous information that do not fit PVS data standards should be deleted or translated. The ideal product is a clean table that fits in our database.

Third, the data will be merged with identifying information from our database. This usually involves pulling out the relevant candidates to be matched by last name, first initial, state id, party, or other fields. A candidate ID (or other key, for non-candidate data) should match up with each record taken by the scrape. The result will be a CSV file that can be added to our database using the key identifiers.

The fourth step is for IT to bulk import the CSV file into the database. This involves a program that consumes, standardizes, and imports the CSV.

The fifth and final step is to check the information that has been imported. Like any other QC process, the goal is to ensure a high quality of content with human eyes. The difference with scraping QC is that the researcher should take careful note of any error patterns, which will be communicated to the programmer. The QC researcher should also check for an "Errors" CSV file that has unmatched/problematic data from the scraped source. These problem entries need to be manually added on admin.

Managing Scripts

All active scripts are located on the shared drive (Next Cloud) under /research/scripts. That directory is a direct reflection of our GitHub Repository.

There are currently six broad categories for research scripts:
1. Fully automated processes, including both scraping and importing (NRA, Numbers USA, and Planned Parenthood)
2. Large data scrapes for particular groups (American Conservative Union and National Federation of Independent Business state legislature scrapes)
3. SIG specific scripts that scrape a specified legislature from a "static" website
4. A script to handle any SIG that uses Voter Voice to publish ratings
5. Campaign finance scripts that merge data from CRP and NIMSP with Vote Smart's database
6. Quality control scripts that have a specific purpose (removing duplicate entries from a csv file)

Voter Voice

Voter Voice is a service used by some groups to publish political information relevant to their organization like advocacy campaigns, ratings, etc. The following is a description for managing the central Voter Voice script.

The process for scraping Voter Voice can be generally divided into four parts:
1) Adding the SIG to VoterVoice_SIGs.csv in the /Special Interest Groups/Crontabs/ directory.
2) Running VoterVoice.py in /research/scripts/
3) Checking the csv that contains how officials' names were matched with Vote Smart's candidate names (matched using fuzzymatcher not exact match).
4) Harvesting the csv file found in /Special Interest Groups/Crontabs/ for the given group.

There is not set time on when to run the Voter Voice script as the ratings may update at different times for any set of SIGs. Running the script too frequently may cause a headache checking/harvesting every time. A recommended time frame is 5-6 times a year.

Note: If staff does not have someone who can run the VoterVoice.py script manually, ratings from Voter Voice can be manually saved as a pdf and subsequently entered as any other rating.

What can we do with it?

Scraping can be used to accomplish tasks in multiple departments. In SIGs, scorecards have been scraped to produce ratings data as well as some endorsements. PCT has used a scraping procedure on candidate lists for Election Monitoring purposes. Other sub-departments could also benefit from automated collection (i.e., speeches from Thomas, official rosters, etc.). These projects should be approached with an eye for efficiency: manual entry may prove better in some instances. For large sources such as Thomas, SIG databases, API's, and so forth, automated collection permits a better use of our human capital.

Attachments
File Last modified Size
Legal Issues with Web Scraping.doc 2016-07-25 09:50 38Kb
There are 2 comments on this page. [Show comments]
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki