Bulk Import via Harvester: PVSWiki

Bulk Import via Harvester

Bulk import is a method of entry where whole scorecards can be uploaded to the database via a spreadsheet, thus bypassing one-by-one entry through admin. This method allows us to post more ratings to our site in a shorter amount of time.

The Harvester module on pypvsadmin is what allows us to upload ratings directly into the database without tying up IT resources. In order to use the Harvester, however, the data must be precisely formatted.

There are two critical steps in harvesting ratings. First, a comma-separated values file (or CSV – think basic spreadsheet) must be created with the data to be imported into admin. Second, the uploaded scorecard must have additional information added to it (such as name, cats, tags, etc.) through admin.

Tabula is a helpful resource in converting PDFs to spreadsheets. Follow the website and README instructions.

Whether scraping or index-matching, the following protocols must be followed in order for the Harvester to work.

Formatting

The Harvester will only work if working CSV is properly formatted. In order to upload the CSV needs 8 columns:

candidate_id
sig_rating
our_rating
span
sig_id
usesigrating
ratingsession
ratingformat_id

candidate_id: The first column should be for candidates_ids. The candidate or official associated with an id should match the original scorecard. The Harvester will not work if there are duplicate candidate_id. link to wherever duplicates are discussed.
sig_rating: This column is used to enter the original rating from the SIG. sig_rating does not need to be modified for our_rating to appear correctly. Enter scores as is or as close to the original as possible.
our_rating: Use this column to enter the translation of the sig_rating. If a formula was used to translate, be sure to have formula converted to text or numeric.
span: Enter the year of the scorecard. There cannot be blank cells in this or any of the following fields.
sig_id: In order for the Harvester to know what SIG to associate with the rating, the sig_id must be included. No blanks cells.
usesigrating: Usesigrating is how the Harvester knows whether to display sig_rating or our_rating on the website. usesigrating = 't' displays the sig_rating. usesigrating = 'f' displays our_rating. In almost all cases usesigrating should be set as 'f'.
ratingsession This column is to determine if the session being rated is the first, second, or complete session.

First session = 1
Second session = 2
Full session = 3
Unknown = -1

ratingformat_id: Ratingformat_id tells us the format of sig_rating. Formats include numeric, string, open, grade scale. This affects how a scorecard is displayed on our site.

Numeric = 1
String = 3
Open = -1
Grade Scale = 2

Now, the scorecard is ready to be harvested. Save this document in the 'Ratings' folder for the SIG as a .csv file using the following naming format:

span_sig.name_harvester.csv

An example harvest file can be found in the drive in the national group folder "National Federation of Independent Business" '2013-2014_National_Federation_of_Independent_Business_Scorecard.csv'

See: FormattingDataInExcel

Uploading a scorecard using the Harvester

Go to pyadmin Ratings Harvester. Follow the instructions to select and upload your CSV file(s).
Refresh the page. If the most recent job is marked 'COMPLETE' under 'Completed Harvests', then the scorecard has been imported into the database. If a job shows an error, read the error and fix the CSV. Frequently errors are due to typos in the header or duplicate candidate_ids.

Check your work! Be sure that there are no duplicates, all 8 fields are present and correctly named, all fields have the correct data type, etc.
If you still cannot locate the problem, ask your supervisor

Once successfully uploaded, go to the CEC tracking sheet on the Google Drive and fill out the corresponding entry cells. Make sure to denote the type of entry as 'harvest' or 'Scrape'.

Webchecking a Bulk Import Scorecard

Webchecking a scorecard that has been imported into the database via the harvester requires a few extra steps than a checking a scorecard that was manually added. This is because the harvester only needs the data points listed above to upload any files. The rest of the information associated with a rating needs to be completed using admin. This should include:

Name
Rating description
Rating text
Categories

These should be copied from the SIG's cats unless the scorecard is on a separate/additional issue.

If the harvest is from a scrape, candidates that appear in the '...Errors.csv' file need to be manually added to admin. Refer to the original scorecard if necessary.

The same rules apply to manual entry here as elsewhere: do not enter + or – grades, or negatives numeric scores, manually on admin.

Once all the appropriate information is added to the rating, start a webcheck as usual. Make sure to look out for any patterns of errors.

Release the scorecard to the Internal web.
Check the ratings on skittles against the primary scorecard.
If there are errors:

If they fit a common pattern, make a note on the CEC tracking sheet and tag the person who entered the scorecard.
This will help us to improve our processes. Examples of common errors that can be improved are mismatches based on nicknames, hyphenated last names, etc.
Correct the errors on admin.

When complete, release the scorecard to live web.

Quickly scan the scorecard on the live web to be sure it looks accurate.

Keep Track of your Progress
Update the Google Drive tracking sheets accordingly.

Attachments
File	Last modified	Size
2014_NA_ACU_Scorecard_conversion.ods	2015-09-08 15:18	`105Kb`

PVSWiki : RatingsBulkImport