Dupes Clean-Up
Documentation
Jamie Willett
Database Specialist
Project Start: late-July 2016
Projected Project End Date: February 2017, earliest
Documentation
Jamie Willett
Database Specialist
Project Start: late-July 2016
Projected Project End Date: February 2017, earliest
Introduction
This is the duplicate candidates project and will always in this document and outside of this document be referred to as the Dupes Project, or just Dupes. Since the beginning of time, there have been dupes created in our database. Right now, we just don't have enough checks and balances in place to circumvent the occurrences of duplicate creation but hopefully between now and the project's end, we will have thought of some ways to help us in not creating them.
Project Development
This project developed after scouring the Drive for instances of a similar project. We all know dupes happen. And that they happen frequently enough that there had to have been a similar project sometime during the history of Vote Smart. Lo and behold, in research/Confirmed Inactive by Mowery/Profiles (old)/Dupes Queries/Duplicate Bios/ there is some documentation on a project from 2004-2005. (Feel free to read any of that documentation.) This project had a scope of 1034 sets of possible duplicate candidates. This project started during an off-year to get the candidates good to go before the 2006 primary season. They set the possibility of a staff member getting through all 1034 potential sets would take them a little over a month. Of course, 2004-2005 Vote Smart was a completely different time than 2016 is.
I have yet to find any documentation of another project that revolves around dupes. Thus, this means that the dupes haven't been cleaned up since the 2004-2005 project. This is alarming and also gives me ten years of dupes to handle. That's a daunting task for anybody to handle, even if I have years of experience with bios.
There has yet to be a clear step by step process with how to deal with dupes. Originally it started with the Google doc Duplicate Candidates - to research as we have been documenting finding dupes since November 2015. However, upon writing the query in Appendix A, I have since changed my course of action. The results in this query are candidates with absolutely no information attached to them: no campaign finance, no elections, no offices, no education, etc. This has turned into my step 1.
Step 1: Deal with any dupes that pop up in the query from Appendix A.
These are the easiest dupes to deal with because usually they exist a number away from the right candidate_id. These dupes are also easiest to deal with because I do them myself. It's a simple a) identify as a dupe, b) give the original more information for future identification purposes, c) document the bad candidate_id and name of the person, and d) delete the bad candidate_id.
There are 1,160 potential bad candidates within this query. Yesterday (7/28/2016), in just the afternoon, I got through 36 people, of which 25 were dupes, 5 were not dupes, and 6 were either just purely bad (former intern creating a candidate for themselves because they're funny) or an office that we don't cover (like a state circuit court judge). Thus, I feel like I could get through roughly 80 people a day, and this step should take 14 days.
Query A, minus the portion of the WHERE clause that goes “ AND oc.office_candidate_id IS NULL” also gives a lot of potential dupes. I have combed through this in it's entirety identifying dupes and have come across roughly 185 dupes and 115 potential dupes. I started going through these on 15 September 2016, averaging 80 per week. These should take me four weeks to go through.
Step 2: Deal with the dupes in the 'Duplicate Candidates – to research' document, first focusing on the ones that are higher priority because of the 2016 election.
2016 higher priority dupes were mostly dealt with before the election between both IT and myself. There have been a few surfacing since election day that have been “newly” elected, but between PCT QC and the bios sub-department these are being found quickly and will be in the next mantis ticket to IT.
As the project developed, more sheets were added to that Google doc...
To fully explain the sheets of the Google Doc:
* To Research: These are dupes that are found by every day research by either the database specialist, PCT, political resources, or bios.
* Manually Merged: All of these have just been merged by hand: they have either nothing attached to them or just a local office attached and therefore we don't need to bother IT with the merging.
* local dupes: The easiest of all dupes. These dupes just have local offices attached to them and therefore are easier to manually merge. But sometimes harder to do external research on because some localities and their people don't have good web presences.
* even more potential dupes: These are very, very potential. The ones in yellow are the ones researchers were most sure on. Going through these yellow ones, however, I have come across some that are definitely not dupes. Thus, all in this sheet might need a second round of research from only very good researchers.
* Dupe Hunting: Everybody with the same lastname and same firstname dependent on state_id. See Appendix B for the queries. I used two queries, one based around office_candidate and the other based around election_candidate. These is probably a way to combine these for the future so that there won't be a need for both queries. The query might look familiar to anybody who deals with PyQuals as it's a modified version of the duplicate candidates one.
* For IT: Dupes that have data that I can't merge by myself and need IT's assistance. These dupes have all of their biographical information from the bio page of admin transferred over.
* Mantis Tickets: Just keeping track of all dupes that have been merged by IT. The database specialist still has to go through the candidate_id deletions. Note: Only put highly trusted people on this project who know what they're doing.
Project Thoughts:
Local dupes are the easiest to go through. It's easy to figure out the dupes and there is no reason to go through IT because there's nothing (speeches, campaign finance, ratings, endorsements, etc.) for them to merge between the two candidate_id's.
For these, if there's a picture attached to one of the id's, that's the id I keep. If there aren't pictures, I keep the lower candidate_id. Unless, one candidate_id has more data (offices) attached, and that will be the id that I keep.
I only delete a candidate_id if I can 100% confirm that it's a dupe. This is done by using social media, old historical county/city information, and old archived news articles. It's a very in-depth research process that is more akin to what I originally thought a “Research Associate” would spend their time doing (compared to what a Vote Smart “Research Associate” does). I hope it's more of the future of Vote Smart now that we're going to be incorporating maybe hopefully some interns who are very keen on a research aspect of life.
The thought that this project would be simple enough to end in October 2016 is laughable. As of November 2016, the project is ROUGHLY 24% completed. (725/3062) Hopefully more because my estimates do not account for how many of those are not dupes (3062 is just potential dupes, not 100% confirmed dupes).
Final stats:
Dupes Merged:Appendix A: candidates with all information IS NULL
SELECT c.candidate_id, c.firstname, c.lastname, c.family, ea.election_address_id, ew.election_webaddress_id, ed.endorse_candidate_id, rc.rating_candidate_id, sc.speech_candidate_id, fc.finsource_candidate_id, cc.contact_candidate_id, fcc.factcheck_candidate_id, e.experience_id, d.education_id, oc.office_candidate_id, ec.election_candidate_id, c.created FROM candidate c LEFT JOIN biotext_candidate bc ON (bc.candidate_id = c.candidate_id) LEFT JOIN election_candidate ec ON (ec.candidate_id = c.candidate_id) LEFT JOIN election_address ea ON (ea.election_candidate_id = ec.election_candidate_id) LEFT JOIN election_webaddress ew ON (ea.election_candidate_id = ew.election_candidate_id) LEFT JOIN endorse_candidate ed ON (ed.election_candidate_id = ec.election_candidate_id) LEFT JOIN rating_candidate rc ON (rc.candidate_id = c.candidate_id) LEFT JOIN speech_candidate sc ON (sc.candidate_id = c.candidate_id) LEFT JOIN finsource_candidate fc ON (fc.candidate_id = c.candidate_id) LEFT JOIN contact_candidate cc ON (cc.candidate_id = c.candidate_id) LEFT JOIN factcheck_candidate fcc ON (fcc.candidate_id = c.candidate_id) LEFT JOIN experience e ON (e.candidate_id = c.candidate_id) LEFT JOIN education d ON (d.candidate_id = c.candidate_id) LEFT JOIN office_candidate oc ON (oc.candidate_id = c.candidate_id) WHERE d.education_id IS NULL AND e.experience_id IS NULL AND fcc.factcheck_candidate_id IS NULL AND cc.contact_candidate_id IS NULL AND fc.finsource_candidate_id IS NULL AND sc.speech_candidate_id IS NULL AND rc.rating_candidate_id IS NULL AND ed.endorse_candidate_id IS NULL AND ew.election_webaddress_id IS NULL AND ea.election_address_id IS NULL AND ec.election_candidate_id IS NULL AND oc.office_candidate_id IS NULL ORDER BY c.created
Appendix B: Dupe Hunting sheet queries
election_candidate
SELECT split_part(z.candidates, '~@~', 1) AS lastname, split_part(z.candidates, '~@~', 2) AS firstname, split_part(z.candidates, '~@~', 3) AS suffix, split_part(z.candidates, '~@~', 4) AS state_id, split_part(z.candidates, '~@~', 5) AS office_id, split_part(z.candidates, '~@~', 6) AS local_id, z.count
FROM
(SELECT x.candidates, COUNT(x.candidates)
FROM
(SELECT candidate_id, CONCAT_WS('~@~', lastname, firstname, suffix, c.state_id, oc.office_id) AS candidates
FROM candidate c
LEFT JOIN election_candidate oc
USING (candidate_id)
GROUP BY candidate_id, CONCAT_WS('~@~', lastname, firstname, suffix, c.state_id, oc.office_id)) y
JOIN
(SELECT CONCAT_WS('~@~', lastname, firstname, suffix, c.state_id, oc.office_id) AS candidates, COUNT(CONCAT_WS('~@~', lastname, firstname, suffix, c.state_id, oc.office_id))
FROM candidate c
LEFT JOIN election_candidate oc
USING (candidate_id)
GROUP BY CONCAT_WS('~@~', lastname, firstname, suffix, c.state_id, oc.office_id)) x
USING (candidates)
WHERE x.count > 1
GROUP BY x.candidates) z
WHERE z.count > 1
ORDER BY office_id descoffice_candidate
SELECT split_part(z.candidates, '~@~', 1) AS lastname, split_part(z.candidates, '~@~', 2) AS firstname, split_part(z.candidates, '~@~', 3) AS suffix, split_part(z.candidates, '~@~', 4) AS state_id, split_part(z.candidates, '~@~', 5) AS office_id, split_part(z.candidates, '~@~', 6) AS local_id, z.count
FROM
(SELECT x.candidates, COUNT(x.candidates)
FROM
(SELECT candidate_id, CONCAT_WS('~@~', lastname, firstname, suffix, c.state_id, oc.office_id, local_id) AS candidates
FROM candidate c
LEFT JOIN office_candidate oc
USING (candidate_id)
GROUP BY candidate_id, CONCAT_WS('~@~', lastname, firstname, suffix, c.state_id, oc.office_id, local_id)) y
JOIN
(SELECT CONCAT_WS('~@~', lastname, firstname, suffix, c.state_id, oc.office_id, local_id) AS candidates, COUNT(CONCAT_WS('~@~', lastname, firstname, suffix, c.state_id, oc.office_id, local_id))
FROM candidate c
LEFT JOIN office_candidate oc
USING (candidate_id)
GROUP BY CONCAT_WS('~@~', lastname, firstname, suffix, c.state_id, oc.office_id, local_id)) x
USING (candidates)
WHERE x.count > 1
GROUP BY x.candidates) z
WHERE z.count > 1
ORDER BY office_id desc< Back to the Database Specialist Data Standards