Research Scraping: PVSWiki

Revision history for ResearchScraping

Revision [57432]

Last edited on 2021-02-03 11:22:30 by Johanan Tai

Additions:

=={{color hex="#DD0000" text="Note: All automated script processes have been put to a stop."}}==

Deletions:

~~=={{color hex="#DD0000" text="Note: All automated script process has been put to a stop."}}==~~

Revision [57431]

Edited on 2021-02-03 11:20:42 by Johanan Tai

Additions:

=={{color hex="#DD0000" text="Note: All automated script process has been put to a stop."}}==

Revision [54368]

Edited on 2018-12-14 11:46:41 by david

Additions:

Note: If staff does not have someone who can run the VoterVoice.py script manually, ratings from Voter Voice can be manually saved as a pdf and subsequently entered as any other rating.

Revision [54327]

Edited on 2018-12-10 15:20:40 by david

Additions:

There is not set time on when to run the Voter Voice script as the ratings may update at different times for any set of SIGs. Running the script too frequently may cause a headache checking/harvesting every time. A recommended time frame is 5-6 times a year.

Deletions:

Revision [54326]

Edited on 2018-12-10 13:00:21 by david

Additions:

===**Voter Voice**===
Voter Voice is a service used by some groups to publish political information relevant to their organization like advocacy campaigns, ratings, etc. The following is a description for managing the central Voter Voice script.
The process for scraping Voter Voice can be generally divided into four parts:
1) Adding the SIG to VoterVoice_SIGs.csv in the /Special Interest Groups/Crontabs/ directory.
2) Running VoterVoice.py in /research/scripts/
3) Checking the csv that contains how officials' names were matched with Vote Smart's candidate names (matched using fuzzymatcher not exact match).
4) Harvesting the csv file found in /Special Interest Groups/Crontabs/ for the given group.

Revision [54320]

Edited on 2018-12-10 11:21:43 by david

Additions:

Packages and Libraries that have proven to be useful in the past are listed in [[ScriptLibraries here]].

Deletions:

~~Packages and Libraries that have proven to be useful in the past are listed in ScriptLibraries.~~

Revision [54317]

Edited on 2018-12-10 11:12:42 by david

Additions:

Packages and Libraries that have proven to be useful in the past are listed in ScriptLibraries.

Deletions:

~~Packages and Libraries that have proven to be useful in the past are listed here.~~

Revision [54316]

Edited on 2018-12-10 11:10:56 by david

Additions:

[[https://www.python.org/ Python]] is also used for some scripts. As website's have become more modernized, several libraries in python that are used to scrape those websites have proven to be slightly more stable.
Text editors like [[https://atom.io/ Atom]] and [[https://code.visualstudio.com/ Visual Studio Code]] work great for coding in python.
Packages and Libraries that have proven to be useful in the past are listed here.

Revision [54298]

Edited on 2018-12-07 14:59:03 by david

Additions:

All active scripts are located on the shared drive (Next Cloud) under /research/scripts. That directory is a direct reflection of our [[https://github.com/votesmart/research-scripts/ GitHub Repository]].
1. Fully automated processes, including both scraping and importing ([[NRARatings NRA]], Numbers USA, and Planned Parenthood)

Deletions:

All active scripts are located on the shared drive (NextCloud) under /research/scripts. That directory is a direct reflection of our [[https://github.com/votesmart/research-scripts/ GitHub Repository]].
1. Fully automated processes, including both scraping and importing ([[NRARatings NRA]], NumbersUSA, and Planned Parenthood)

Revision [54297]

Edited on 2018-12-07 14:58:37 by david

Additions:

1. Fully automated processes, including both scraping and importing ([[NRARatings NRA]], NumbersUSA, and Planned Parenthood)

Deletions:

~~1. Fully automated processes, including both scraping and importing ([[NRAQuirks NRA]], NumbersUSA, and Planned Parenthood)~~

Revision [54296]

Edited on 2018-12-07 14:57:52 by david

Additions:

1. Fully automated processes, including both scraping and importing ([[NRAQuirks NRA]], NumbersUSA, and Planned Parenthood)

Deletions:

~~1. Fully automated processes, including both scraping and importing (NRA, NumbersUSA, and Planned Parenthood)~~

Revision [54295]

Edited on 2018-12-07 14:55:08 by david

Additions:

3. SIG specific scripts that scrape a specified legislature from a "static" website
4. A script to handle any SIG that uses Voter Voice to publish ratings
5. Campaign finance scripts that merge data from CRP and NIMSP with Vote Smart's database
6. Quality control scripts that have a specific purpose (removing duplicate entries from a csv file)

Deletions:

3. SIG specific scrapes that scrape a specified legislature from a "static" website
4. A script to handle any SIG that uses Voter Voice to publish grades
5. Campaign Finance scripts that merge data from CRP and NIMSP
6. Quality Control scripts that have a specific purpose (removing duplicate entries from a csv file)

Revision [54294]

Edited on 2018-12-07 14:54:02 by david

Additions:

2. Large data scrapes for particular groups (American Conservative Union and National Federation of Independent Business state legislature scrapes)

Deletions:

~~2. Large data scrapes for particular groups (American Conservative Union and National Federation of Independent Business State Legislature scrapes)~~

Revision [54293]

Edited on 2018-12-07 14:53:28 by david

Additions:

All active scripts are located on the shared drive (NextCloud) under /research/scripts. That directory is a direct reflection of our [[https://github.com/votesmart/research-scripts/ GitHub Repository]].
There are currently six broad categories for research scripts:
1. Fully automated processes, including both scraping and importing (NRA, NumbersUSA, and Planned Parenthood)
2. Large data scrapes for particular groups (American Conservative Union and National Federation of Independent Business State Legislature scrapes)
3. SIG specific scrapes that scrape a specified legislature from a "static" website
4. A script to handle any SIG that uses Voter Voice to publish grades
5. Campaign Finance scripts that merge data from CRP and NIMSP
6. Quality Control scripts that have a specific purpose (removing duplicate entries from a csv file)

Deletions:

Revision [54292]

Edited on 2018-12-07 14:40:12 by david

Additions:

Deletions:

~~All active scripts are located on the shared drive NextCloud under /research/scripts. That directory is a direct reflection of our [[https://github.com/votesmart/research-scripts/ GitHub Repository]].~~

Revision [54291]

Edited on 2018-12-07 14:39:57 by david

Additions:

All active scripts are located on the shared drive NextCloud under /research/scripts. That directory is a direct reflection of our [[https://github.com/votesmart/research-scripts/ GitHub Repository]].

Deletions:

Revision [54275]

Edited on 2018-12-05 17:17:02 by david

Additions:

====**Managing Scripts**====
All active scripts are located on the shared drive (NextCloud) under /research/scripts. That directory is a direct reflection of our [[https://github.com/votesmart/research-scripts/ GitHub Repository]].

Revision [54258]

Edited on 2018-12-05 15:41:02 by david

Deletions:

The product of these scrapes need to be imported into the database using a python script. IT is in charge of all bulk entry; therefore, programming the scraping and import processes separately is sensible.

Revision [20205]

Edited on 2017-06-21 14:20:22 by walker

Additions:

Scraping can be used to accomplish tasks in multiple departments. In SIGs, scorecards have been scraped to produce ratings data as well as some endorsements. PCT has used a scraping procedure on candidate lists for Election Monitoring purposes. Other sub-departments could also benefit from automated collection (i.e., speeches from Thomas, official rosters, etc.). These projects should be approached with an eye for efficiency: manual entry may prove better in some instances. For large sources such as Thomas, SIG databases, API's, and so forth, automated collection permits a better use of our human capital.

Deletions:

Scraping can be used to accomplish tasks in multiple departments. In SIGs, scorecards have been scraped to produce ratings data as well as some endorsements. PCT has used a scraping procedure on candidate lists for QC purposes. Other sub-departments could also benefit from automated collection (i.e., speeches from Thomas, official rosters, etc.). These projects should be approached with an eye for efficiency: manual entry may prove better in some instances. For large sources such as Thomas, SIG databases, API's, and so forth, automated collection permits a better use of our human capital.

Revision [18124]

Edited on 2016-07-25 09:50:11 by walker [added document that includes legal context]

Additions:

Deletions:

Revision [18123]

Edited on 2016-07-25 09:48:28 by walker [added document that includes legal context]

Additions:

[[file]]

Revision [16104]

Edited on 2015-06-18 11:56:05 by walker [added document that includes legal context]

Additions:

Web scraping is the process of getting data from the web through automated programs. Think of it like copy-paste, only much faster. It is generally a more efficient and accurate process than manual entry. Many other groups and individuals take third-party information to clean, aggregate, and/or display in an original format. What makes scraping unique at PVS is that we take data from thousands of sources rather than a handful. This makes prioritizing efficient automation key. At PVS, scraping speeds up our collection processes, which allows us to focus on the quality and display of the data.
[[http://www.r-project.org/about.html R]], an open-source statistical programming language, has demonstrated flexibility, efficiency, and approachability. It is a tool that can scrape and clean data as well as interact with our database using PostgreSQL, all in the same script. [[https://github.com/rdpeng/courses/tree/master/03_GettingData/02_03_readingFromTheWeb R scraping]] is a holistic approach. The software also provides powerful graphical tools that may prove useful.
[[http://www.rstudio.com/about/ RStudio]] is another piece of open-source software that provides a graphical interface for working in R. This makes viewing and working with data easier.
The product of these scrapes need to be imported into the database using a python script. IT is in charge of all bulk entry; therefore, programming the scraping and import processes separately is sensible.
====**How do we use it?**====
There are **five main steps** in our automated collection process: **scraping**, **cleaning**, **merging**, **importing**, and **QC**. Ideally, the first four are all executed by the same program. The process will vary depending on department and task.
The **first** step is to scrape the data from the primary source. This will require the researcher to understand the format of the source (HTML, PDF, XLS, etc.) and how to extract the data. A script may require a loop that scrapes over multiple URL's or downloads multiple PDF's. It may have to account for irregular spacing or graphics. Past scripts may be instructive as to which packages and what approach to use. Each source will require a slightly different scrape; the research programmer must tweak their script to fit the source.
The **second** is to clean the data. Percentage symbols, string entries, and other extraneous information that do not fit PVS data standards should be deleted or translated. The ideal product is a clean table that fits in our database.
**Third,** the data will be merged with identifying information from our database. This usually involves pulling out the relevant candidates to be matched by last name, first initial, state id, party, or other fields. A candidate ID (or other key, for non-candidate data) should match up with each record taken by the scrape. The result will be a CSV file that can be added to our database using the key identifiers.
The **fourth** step is for IT to [[RatingsBulkImport bulk import]] the CSV file into the database. This involves a program that consumes, standardizes, and imports the CSV.
The **fifth** and final step is to check the information that has been imported. Like any other QC process, the goal is to ensure a high quality of content with human eyes. The difference with scraping QC is that the researcher should take careful note of any error patterns, which will be communicated to the programmer. The QC researcher should also check for an "Errors" CSV file that has unmatched/problematic data from the scraped source. These problem entries need to be manually added on admin.
Scraping can be used to accomplish tasks in multiple departments. In SIGs, scorecards have been scraped to produce ratings data as well as some endorsements. PCT has used a scraping procedure on candidate lists for QC purposes. Other sub-departments could also benefit from automated collection (i.e., speeches from Thomas, official rosters, etc.). These projects should be approached with an eye for efficiency: manual entry may prove better in some instances. For large sources such as Thomas, SIG databases, API's, and so forth, automated collection permits a better use of our human capital.

Deletions:

Web scraping is the process of getting data from the web through automated programs. Think of it like copy-paste, only much faster. It is generally a more efficient and accurate process than manual entry. Many groups and individuals take third-party information to clean, aggregate, and/or display this data in an original format. At PVS, scraping speeds up our collection processes, which allows us to focus on the quality and display of the data. What makes scraping unique at PVS is that we take data from thousands of sources rather than a handful. This makes prioritizing efficient automation key.
R, an open-source statistical programming language, has demonstrated flexibility, efficiency, and approachability for scraping. It is a tool that can scrape and clean data as well as interact with our database using PostgreSQL, all in the same script. [[https://github.com/rdpeng/courses/tree/master/03_GettingData/02_03_readingFromTheWeb R scraping]] is a holistic approach.
The product of these scrapes still need to be imported into the database using a python script. Traditionally and at present, IT is in charge of all bulk entry, therefore separating the scraping and import procedures is sensible.
====**How to use it?**====
There are **five main steps** in our automated collection process: **scraping**, **cleaning**, **merging**, **importing**, and **QC**. Ideally, the first four are all executed by the program. The process will vary depending on department and task.
The **first** is to scrape the data from the primary source. This will require the researcher to understand the format of the source (HTML, PDF, XLS, etc.) and how to extract the data. A script may require a loop that scrapes over multiple URL's or downloads multiple PDF's. Past scripts may be instructive as to which packages to use and what approach to take. Each source will require a slightly different scrape; the research programmer must tweak their script to fit the source.
The **second** step is to clean the data. Percentage symbols, irregular string entries, and other extraneous information that does not fit PVS data standards should be deleted or translated. The ideal product is a clean table that would fit in our database.
**Third,** the data will be merged with identifying information from our database. This usually involves pulling out the relevant candidates to be matched by last name, first initial, state id, party, or other factors. A candidate ID (or other key, for non-candidate data) should match up with every record taken by the scrape. The result will be a CSV file that can be added to our database using the key identifiers.
The **fourth** step is for IT to [[RatingsBulkImport bulk import]] the CSV file into the database. This involves a program that consumes and standardizes the CSV.
The **fifth** and final step is to check for quality control the information that has been imported. Like any other QC process, the goal is to ensure a high quality of content with human eyes. The difference with scraping QC is that the researcher should take careful note of any error patterns, which will be communicated with the programmer to be fixed. The QC researcher should also check for an "Errors" CSV file that has collected unmatched/problematic data from the scraped source. These problem entries need to be manually added on admin.
Scraping can be used to accomplish tasks in multiple departments. In SIGs, scorecards have been scraped to produce ratings data as well as some endorsements. PCT has used a scraping procedure for QC purposes on candidate lists. Other sub-departments could also benefit from automated collection (i.e., speeches from Thomas, official rosters, etc.). These projects should be approached with an eye for efficiency: manual entry may prove better in some instances. For large sources such as Thomas, SIG databases, API's, and so forth, automated collection permits a better use of our human capital.

Revision [16101]

Edited on 2015-06-18 11:44:30 by walker [added document that includes legal context]

Additions:

Scraping can be used to accomplish tasks in multiple departments. In SIGs, scorecards have been scraped to produce ratings data as well as some endorsements. PCT has used a scraping procedure for QC purposes on candidate lists. Other sub-departments could also benefit from automated collection (i.e., speeches from Thomas, official rosters, etc.). These projects should be approached with an eye for efficiency: manual entry may prove better in some instances. For large sources such as Thomas, SIG databases, API's, and so forth, automated collection permits a better use of our human capital.

Deletions:

The Firefox Update Scanner has the potential to be very useful in many of the Departments and Sub-Departments at Vote Smart. Anything that saves time, or provides value to our users (website or API) makes using the Firefox Scanner worthwhile. Currently, we are using the Firefox Update Scanner to track webpages. For large volume projects like our city/county officials program, we don't have the man power to frequently track the 4,500 web pages containing the names of over 30,000 officials. This results in larger, "all in one swoop," projects. Instead we are able to dedicate fewer resources over a longer period of time, and keep the database more consistently up to date (2 birds!!!). Another example of a benefit we can receive from the FF Scanner is to ensure we can gather information that is only available for a short period of time. For example, in the Special Interest Groups Sub-Department, the NRA only releases their scorecards (ratings or endorsements) for a month a time. Being able to collect these scorecards in a timely manner is essential. There are many other applications of the FF Scanner (i.e. PCT candidate lists, voter registration, polling place, etc..), and applying the usefulness of the FF Scanner to all is possible, with the right oversight and care.
scorecards, endorsements, PCT candidate lists (QC)
speeches
Scraping can be used to accomplish tasks in multiple departments

Revision [16100]

Edited on 2015-06-18 10:58:41 by walker [added document that includes legal context]

Additions:

====**What tools do we need?====**
R, an open-source statistical programming language, has demonstrated flexibility, efficiency, and approachability for scraping. It is a tool that can scrape and clean data as well as interact with our database using PostgreSQL, all in the same script. [[https://github.com/rdpeng/courses/tree/master/03_GettingData/02_03_readingFromTheWeb R scraping]] is a holistic approach.
The product of these scrapes still need to be imported into the database using a python script. Traditionally and at present, IT is in charge of all bulk entry, therefore separating the scraping and import procedures is sensible.
There are **five main steps** in our automated collection process: **scraping**, **cleaning**, **merging**, **importing**, and **QC**. Ideally, the first four are all executed by the program. The process will vary depending on department and task.
Scraping can be used to accomplish tasks in multiple departments

Deletions:

[[https://github.com/rdpeng/courses/tree/master/03_GettingData/02_03_readingFromTheWeb R scraping]]
There are **three main processes** to using the Firefox Update Scanner. The **first** is to create a Firefox Profile (similar to a Google+ account). Creating a Firefox Profile with a general email (i.e. ratings@votesmart.org, city@votesmart.org, county1@votesmart.org, etc..) ensures that each "scanner" can be saved over time and passed down to your predecessor. In addition, using a FF Profile, and one that is under a general email, allows multiple people to use the FF Scanner. In case you are confused (FF is weird), a FF Profile allows you to sync your bookmarks and other stuff (history, tabs etc.) to the profile and allow you to access the same information on a different computer.
The **second** process involves the initial setup of and uploading of webpages into the FF Scanner. Please go to the [[FirefoxUpdateScannerAddBookmarks Adding Bookmarks]] page for more details.
The **third** process is completing scans of the already uploaded pages in coordination with database updates. Please go to the [[FirefoxUpdateScannerCheckBookmarks Scanning Bookmarks]] page for more details.
scrape, clean, merge, import, webchecks
There are **five main steps** in our automated collection process: **scraping**, **cleaning**, **merging**, **importing**, and **QC**. Ideally, the first four are all executed by the program.

Revision [16099]

Edited on 2015-06-18 10:39:51 by walker [added document that includes legal context]

Additions:

There are **five main steps** in our automated collection process: **scraping**, **cleaning**, **merging**, **importing**, and **QC**. Ideally, the first four are all executed by the program.
The **first** is to scrape the data from the primary source. This will require the researcher to understand the format of the source (HTML, PDF, XLS, etc.) and how to extract the data. A script may require a loop that scrapes over multiple URL's or downloads multiple PDF's. Past scripts may be instructive as to which packages to use and what approach to take. Each source will require a slightly different scrape; the research programmer must tweak their script to fit the source.
**Third,** the data will be merged with identifying information from our database. This usually involves pulling out the relevant candidates to be matched by last name, first initial, state id, party, or other factors. A candidate ID (or other key, for non-candidate data) should match up with every record taken by the scrape. The result will be a CSV file that can be added to our database using the key identifiers.
The **fourth** step is for IT to [[RatingsBulkImport bulk import]] the CSV file into the database. This involves a program that consumes and standardizes the CSV.
The **fifth** and final step is to check for quality control the information that has been imported. Like any other QC process, the goal is to ensure a high quality of content with human eyes. The difference with scraping QC is that the researcher should take careful note of any error patterns, which will be communicated with the programmer to be fixed. The QC researcher should also check for an "Errors" CSV file that has collected unmatched/problematic data from the scraped source. These problem entries need to be manually added on admin.

Deletions:

There are **five main steps** in our automated collection process: **scraping**, **cleaning**, **merging**, **importing**, and **webchecking**. Ideally, the first four are all executed by the program.
The **first** is to scrape the data from the primary source. This will require the researcher to understand the format of the source (HTML, PDF, XLS, etc.) and how to extract the data. A script may require a loop that scrapes over multiple URL's or downloads multiple PDF's. Past scripts may be instructive as to which packages to use and what approach to take. Each source will require a slightly different scrape; the researcher must tweak their script to fit the source.
**Third,** the data will be merged with identifying information from our database. This usually involves pulling out the relevant candidates to be matched by last name, first initial, state id, party, or other factors. A candidate ID should match up with every piece of data.

Revision [16098]

Edited on 2015-06-18 10:26:23 by walker [added document that includes legal context]

Additions:

Deletions:

Web scraping is the process of getting data from the web through automated programs. Think of it like copy-paste, only much faster. It is generally a more efficient and accurate process than manual data entry. At PVS, scraping speeds up our collection processes, which allows us to focus on the quality and display of the data.

Revision [16092]

The oldest known version of this page was created on 2015-06-18 09:34:15 by walker [added document that includes legal context]

PVSWiki : ResearchScraping

Revision history for ResearchScraping

Revision [57432]

Additions:

Deletions:

Revision [57431]

Additions:

Revision [54368]

Additions:

Revision [54327]

Additions:

Deletions:

Revision [54326]

Additions:

Revision [54320]

Additions:

Deletions:

Revision [54317]

Additions:

Deletions:

Revision [54316]

Additions:

Revision [54298]

Additions:

Deletions:

Revision [54297]

Additions:

Deletions:

Revision [54296]

Additions:

Deletions:

Revision [54295]

Additions:

Deletions:

Revision [54294]

Additions:

Deletions:

Revision [54293]

Additions:

Deletions:

Revision [54292]

Additions:

Deletions:

Revision [54291]

Additions:

Deletions:

Revision [54275]

Additions:

Revision [54258]

Deletions:

Revision [20205]

Additions:

Deletions:

Revision [18124]

Additions:

Deletions:

Revision [18123]

Additions:

Revision [16104]

Additions:

Deletions:

Revision [16101]

Additions:

Deletions:

Revision [16100]

Additions:

Deletions:

Revision [16099]

Additions:

Deletions:

Revision [16098]

Additions:

Deletions:

Revision [16092]