TagsCleanUpDocumentation2016: PVSWiki

Tags Clean-Up
Documentation
Jamie Willett
Database Specialist
Project Start: March 2016
Project End: July 2016

Introduction
Tags were first implemented as a way to give our data some substance. If a speech was about the ACA, we would tag Affordable-Care-Act; if a degree was a BA, we would tag it bachelors-degree. Tags came about that in the future of the project, they would eventually become a part of our website. However, we needed to populate the tags before the project got to the public, similar to Public Statement’s Key Speeches program. Now, while I still believe that tags will become another portion of our website, Vote Smart has decided to start developing a mobile app that will use tags.

With this thought in mind - this revolutionary idea of being able to have a Vote Smart app for cell phones - and a few other thoughts, the idea of a new position was born: Database Specialist. The role of Database Specialist has to be filled by somebody who has a large grasp of Vote Smart’s data and all the little intricacies of said data.

Tags have never had a massive clean up since the program first started in 2013. Thus, it was a perfect project. Vote Smart never had the resources or the time to put into such a massive project as well as nobody had ever wanted to be submersed 100% into tags for the majority of two months. Vote Smart as a whole is always looking at the present, nothing further and nothing in the past. We are an organization based wholly on elections and helping the voter out. However, the tags project is going to benefit the future of the organization as well as help out with old, historic data. It’s the best of both worlds.

From Abortion to Women: A Final Report of the Category Project by Ben Raker, Project Manager, September 2013 gives good insights to the overall development of the Categorization and Tagging Project. (If you would like to see it, the Research Director holds on to the printed copy). Tags first started out as being broad and not super specific. Church-of-St-Mary-in-Pulaski-Virginia would be a bad tag because it is too specific, and would instead get merged with the tag Christianity. As I've come to look through the tags these past two months, tags as a whole have gotten quite specific – and maybe a way to combat the creation of new tags is to tell people to stop being so incredibly specific.

Project Development
I decided to handle the tags project in a series of steps. This makes it easier for me to handle as well as having great documentation on everything that happens during the course of the project.

All Database Specialist information on the Drive is located at research/Active/Department Projects/Database Specialist. On the Google Drive, the folder is called Database Specialist Projects. For the tags project, all documentation will be kept in the research drive on the computer, unless otherwise noted.

Step 1: Research & the Terrible Tags
The first order of business was to clean up the Research and the Terrible Tags Google doc. Then, the tags editor would take over it afterwards when it was more of a manageable document. The tags editor is different from the database specialist, and is usually the person in speeches with the most experience. As of writing this, that person is Britton Fraser. The main reason why the database specialist is not the tags editor: he/she is outside the realm of keeping up on the news like the speeches department - speeches deals day after day after day with current events via the speeches they are collecting from Congresspeople. Another reason is that speeches is still the sub-department of the organization that handles training cats and tags (technically not true because the research director at the time, Bill Carrigan does the training, but it's still very much a speeches thing).

This part of the project was quite easy and was done quite quickly. The tags editor has now taken over the document and will work on it on his own pace, whenever he has the free time to do so. I will occasionally help him with these.

Step 2: Cut-Offs
The amount of tags that had been used once at the beginning of this project numbered above 39,000. The objective was to go through these tags and see which tags should be merged, which to keep, and which to have IT mass delete. However, we didn’t want to include tags that had been recently created - those tags still need to have the opportunity to develop and form relationships/connections with other pieces of data.

Between the database specialist, the research director, and the national director, we decided that there should be a cut-off somewhere and revolving around a specific event in American politics that had happened in December 2014. In December 2014, the only real big thing was when President Obama opened up our relations with Cuba - this was when the tag Cuba-US-relations was created, tag_id 99329.

Tags that have a tag_id that is higher than 99329 are not to be touched unless they’re misspelled. Tags that have an id lower than 99329 that had been used once numbered a little bit below 35,500.

The idea was that this project would take two months of 100% complete time of the database specialist. I think we grossly underestimated how long it would take to do and how hard it would be. Overall, it’s easy work, but spending 8 hours merging tags does give a certain strain on an individual.

Luckily, next year in 2017 when the database specialist does this same project, they will have a lot less tags to go through. For the 2017 project, the tag_id selected as their cut-off point will involve something huge that happened in American politics that was first tagged in December 2015. When the 2018 project happens, the tag_id selected as the cut-off point will deal with something that happened in American politics that was first tagged in December 2016, and so on for the duration of Vote Smart as an organization that has a database specialist.

Step 3: The Actual Project Begins
The project began with a giant SQL query that is in Appendix A. From there, this data was transferred into LibreOffice. There are two different tracking documents that will be used for the project. One for tags that were only used once and the other is what will eventually be sent to IT and is called 'tags to be deleted.ods'.

This first tracking sheet is called ‘tags_used_once.ods’ and is located in research/Active/Department Projects/Database Specialist/Tags Project on the drive. From here, the document is split into multiple sheets to keep it a more organized project:

‘Tags not gone through’ is the main task at hand. All the work gets done in that one and then transferred to the other sheets or to the ‘tags to be deleted.ods’ document.
‘Tags that NEED to be merged’ is simply tags that I didn’t want to merge immediately. I have found it is easier to merge tags 200 at a time and not after every single one I come across. I believe it saves time, too.
‘Merged tags’ includes all the tags that have been merged. As I work through ‘tags that NEED to be merged,’ they join this sheet.
‘Tags to keep’ are all the tags I believe we should keep around.
‘Ask OTHERS if tag is important, if it’s not delete’ are all the tags that I think could pertain to another researcher as a tag worth keeping or at least discussing reasons to keep a tag around.
‘Tags already deleted’ are all tags that have deleted = TRUE, which means that the tag was already deleted.

There is also a key to understanding the ‘tags_used_once.ods’ document:
Red background = this tag has already been deleted
Cyan background = this tag should be deleted
Green background = this tag should be kept
Neon green background = this tag needs to be merged
Blue background = this tag has been merged
Pink background = ask others if this tag is important.

All colors should change accordingly if any changes get made. Say, you have a tag that needs to be merged and then you merge it, the background color would change from neon green to blue. Honestly, this was just good for when I was looking at a few thousand tags and wanted to sit further away from my computer.

After I developed a way of tracking things, I had to think about things that I knew definitely mattered, definitely could just be merged with something less specific, and those that could definitely just be deleted. This chart is in Appendix B.

Originally, we had the idea to delete all of the tags that had been used just once. This idea turned out to be a dud because of tags like Jeff-Duncan - he’s a current US Representative and will no doubt be tagged again. Thus, there are some tags that are tagged once that should definitely still be kept.

One would think that the final number of all the sheets will equal the original starting number, but I feel like it probably isn't going to be this way. While I am fantastic at keeping track of my work, there are some that can just slip right through the cracks. Another thing that I don't take into consideration: if a tag that had been used once, but when I finally get around to checking about its category, or something like that of its nature, if it has been now been used 10 or so times, I delete it from the spreadsheet – because if I do do a hunt for bad tags that have been tagged more than once, I will find those tags then.

Step 4: Categorizing Tags
When it comes time to categorize where a tag_id is used (be it biotext, measure, speech, sig, etc.), there are multiple ways to go about doing it.

The first is manually; every tag in the 'tags to be deleted.ods' doc gets run through the tags editor, one by one. This method is extremely time consuming and should only be used as a last resort. This method is also flawed because sometimes the tag editor doesn't know how to categorize a tag and thus leaves the field blank (see mantis ticket 7580).

The second method uses SQL. The query is in Appendix C; it's a variation of the v_tag_usage table in the public folder in SQL. With this second method, you copy about 500-1,000 tag_ids from the 'tags to be deleted.ods' doc and put them into a website that can remove the line between them and adds a comma instead (I used sortmylist.com). You copy these results and stick them into the WHERE clause (WHERE t.tag_id IN (tag_id1, tag_id2, tag_idn, etc.). Run the query and save both it and SQL. Open a blank document and open up your data sources (F4). Copy the returns from the query you made into B1. In column A, copy the original list from 'tags to be deleted.ods.' You're going to then compare the list in column A to the list in column B because there are going to be tag_id's in column A that aren't in column B (because of tag merging) and there are going to be tag_id's in B that are going to be listed twice because somebody may have recently used it in a different category than it was originally used in. We want the category to match up correctly with its tag. While this can get annoying, it's not nearly as annoying as the first option. -- I know because I've extensively done both.

The latter option takes a lot less time: I catted 11,000+ tags in a day and half. The former option took about 4 days or so to do about 5,000.

Step 5: Other eyes on ‘tags to be deleted.ods’
While I do have a really good grasp on tags from working in both bios and speeches, I could miss things from spending 8-5, 5 days a week, week after week on this project. From here, I would like to have a member of each sub-department and a person from PCT (if they decide to assist) go over the tags and make any notes to the tracking doc: whether they think we should instead keep a tag or if they know of any tags that a tag could be merged with instead of being deleted.

This tracking doc was moved into a Google doc so it would be easier to have more than one pair of eyes on it at a time. The Google doc is then split between the different categories: speech, biotext, rating, sig, measure, npatanswer, cong, and committee. There is also a label for People and this sheet is made up of people (roughly 90% of all the people in the 23,000 tags – I'm sure I missed a few that are still in speech, biotext, etc.) and it's also a combination of all the categories mixed together. One person from Profiles should look at the biotext tags; one person from SIGS should look at both the rating and the sig tags; PCT would look at npatanswer; a member of the TX office would look at cong; and at least two people from speeches would look at speech. The speech category, as one could probably guess, accounts for at least 80% of all tags in the 'tags to be deleted.ods' document.

As people in other departments go through this doc, there will also be a notes section, and there they can put their thoughts on if they would like to do something other than delete that tag.

This was a fairly easy step. Speeches had the most amount of tags but they had their interns go over some tags, as well. All sub-departments managed to go over their tags in under ten days.

Step 6: What type of deletion (as you can tell completed much before this step)
We have the opportunity to go with one of two options of deletion: delete-regular or a full delete.

Delete-regular will delete the tag from us using it, but on any items this tags is attached to, it will stay there. This is the better option (and the current option in existence) because once you delete a tag using this regular method, you cannot re-add it - it stays deleted. On the query itself (Appendix A), tags that are deleted-regularly have a check mark in the column that is labeled ‘deleted.’

Full delete will delete the tag_id from not only the tag table but also from other tags tables it is associated with (speech_tag, experience_tag, biotext_tag, etc.) This option would make it so that the tag had never existed in the first place, which means that the tag could possibly be made again further down the road with a new tag_id associated with it. This second option is a larger request but is more than likely probably doable by IT.

We decided option delete-regular was the better decision. Delete-regular will make it so that if a person tried to recreate that tag, it won’t reshow up in the tags manager. It will continue showing up in the tags query as being deleted, and still assigned the same tag_id number. You can check this by using a recently deleted tag with the tags editor.
The idea is to minimize the effort done by IT. They are busy with many different projects.

Step 7: Involving IT
After everyone has given the document a once-over, and I have made all the necessary changes, it is time to get IT involved. This will be done via a mantis ticket and changing ‘tags to be deleted.ods’ into a .csv file. From here, the waiting period until they have the time to handle it.

Mantis ticket = 8037, submitted 8/15/2016, worked on by IT Director Mike Shultz, and completed 8/18/2016.

Step 8: Project Complete!

Final Thoughts on the Project:
While I am a very optimistic person both with how much I can accomplish in a given amount of time and how much I enjoy doing projects that others aren't working on, I didn't realize at the start of the project how long it was going to take to get through a 30,000+ rowed document on tags that were used once and whether or not we should keep that tag. Even with 2+ years of being super familiar with the tags system, I am discovering a world of tags beyond even me. And it gets kind of annoying every day to play with tags. But, I did this to myself – I just highly don't recommend it to anybody else for the future of the organization – though maybe keeping up with it year after year will be better for the project and not as annoying. The person would only be going through maybe 5,000 tags and that is a hell of a lot easier to handle than 35,000 tags.

As it goes, at this point in time I'm not sure how useful the tags system is to the future of the organization. Yes, we have constantly talked about it's implementation but even with these changes I am afraid that it (the tags project in general) will still be too large to be useful to the public world. This is why I think the project needs to go further and have a check through the rest of the tags that have been tagged more than once. That is a much larger project than this one, but it would be worth it in the end.

Anybody can make a tag, and often times the tag created is either misspelled or not properly formatted or gets entirely too specific. Maybe for the future of the project, it would be good to introduce a system in which interns can't create tags – we (Britton and I) have talked about adding a permission to admin that makes it so you can only tag if you have that permission added to you. Then, after they've proven themselves after a week or two, the interns could have the permission added to them. This could also work for new staff members. I know it's not just interns that are making new sloppy tags, but a lot of the new bad tags created are created during a slew of new arrivals to Vote Smart.

In the binder called From Abortion to Women: A Final Report of the Category Project by Ben Raker, Project Manager, September 2013: tags should not get too specific. Right now, we have a case of being extremely overly specific. Tags should be more broad. Mr. Raker said that tags was the hardest part for him and other associates in the project in terms of accuracy and consistency (page 11). I highly agree with this point and refer back to the last paragraph: an associate/intern should not be able to create tags unless they've proved themselves capable of the tags that are already at their disposal. This would combat all the tags that get created when an intern first starts.

It's also hard to keep track of all the tags that you're merging. Yes, you go through your tracking doc, but if you come across something in the tags editor while in the search for the tag from the tracking doc, you'll fix those tags at the same time. So, the pace of the project is only getting reflected from the tracking doc, but you're also merging a lot more tags than you think you are. So, on one hand you think you're behind on the project, but on the other hand, you also think that you're ahead of schedule because you're merging other tags that either have been used once or have been used more than once. Either way, thinking mid-June for getting this project done was a gross underestimation by myself.

Some tags were so messy that it was better to just do all of them all at once, regardless of how many times they had been tagged. For example, two areas I cleaned up, with the consultation of the tags editor, were [state]-Department-of-[whatever] and subcommittee names. For these, I would run a query with WHERE tag.name LIKE (or ILIKE) '%%'. This would grab all possible tags for me and I could transfer them into LibreOffice Calc to further play with the data. The first of these two aforementioned projects inside this project, just became State-Department-of-[whatever] across the board. The second is a little trickier in that I have to do a Google search to figure out what subcommittee the tag is referring to and then change that tag into a properly formatted tag that was discussed between the tags editor and I. [House/Senate]-[Committee]-[Subcommittee] ie: House-Armed-Services-Subcommittee-on-Readiness.

All in all, while I did get annoyed with the project a few times (and LibreOffice is the worst program ever to have a 35,000+ rowed document), it was a great project that would not have been done any time soon otherwise. I do highly recommend another project starting up maybe roughly around October-November to do a sweep of the tags tagged more than once. I also think that Vote Smart should have more than one database specialist. I cover a lot of different information, but of the main goal of cleaning up the database – we have so much data, and a decent amount of errors. This is not a one person job, no matter how much I think I am capable of huge, seemingly unending projects.

This project also identified a lesser problem that is already in a mantis ticket: tags that refuse to merge. Ticket = 7294. It doesn't seem like there's a pattern as of yet for it.

Final stats:

Tags Merged: 7,359
Tags Kept: 6,707
Tags that Won't Merge: 22
Tags to be Deleted: 21,778

Appendix A: Query Used

SELECT t.tag_id, t.deleted, t.name, count (t.tag_id) AS usage
FROM
(( SELECT ac.tag_id
FROM (
SELECT biotext_tag.tag_id
FROM biotext_tag
UNION ALL
SELECT additional_tag.tag_id
FROM additional_tag
UNION ALL
SELECT additional_candidate_tag.tag_id
FROM additional_candidate_tag
UNION ALL
SELECT education_tag.tag_id
FROM education_tag
UNION ALL
SELECT experience_tag.tag_id
FROM experience_tag) ac
UNION ALL
SELECT committee_tag.tag_id
FROM committee_tag
UNION ALL
SELECT congcategory_tag.tag_id
FROM congcategory_tag
UNION ALL
SELECT measure_tag.tag_id
FROM measure_tag
UNION ALL
SELECT npatanswer_tag.tag_id
FROM npatanswer_tag
UNION ALL
SELECT rating_tag.tag_id
FROM rating_tag
UNION ALL
SELECT sig_tag.tag_id
FROM sig_tag
UNION ALL
SELECT speech_tag.tag_id
FROM speech_tag
UNION ALL
SELECT party_tag.tag_id
FROM party_tag ) at
JOIN tag t
USING (tag_id))
GROUP BY t.tag_id, t.name
ORDER BY count (t.tag_id), t.tag_id asc, t.deleted;

Appendix B: Chart on General Rules

Tags Generally Kept *Unless misspelled	Tags Generally Merged * Unless misspelled	Tags Generally Deleted
Colleges	College Sports Teams ex: Penn-State-Nittany-Lions becomes Pennsylvania-State-University	[name]-Month
Tribes (Native Americans)	Awards/medals Becomes awards-and-medals	[name]-Day
Reservations (Native Americans)	Churches Becomes Christianity	Lesser known acronyms
Boats (USS-[name])	Random roads/bridges Becomes roads-bridges-and-highways	[name]-Week
Acts on the federal level	[name]-hospitals Becomes hospitals	Acts on state level
Countries	[name]-Fire-Company Becomes fire-companies	General-[name]
Unions	[name]-Watershed Becomes watersheds	[name]-Township
Rivers	[name]-Medical-Center Becomes medical-centers	[name]-confirmation
Political Action Committees	[name]-Newspaper Becomes newspapers	[name]-nomination
Hurricanes	[name]-museum Becomes museums	small banks
Battlefields	[name]-Public-Library Becomes public-libraries	Dr-[name]
Air Force Bases	Ambassador-to-[name] Becomes [name]-US-relations	Reverend-[name]
Naval Bases	[small town]-[state] Becomes [state]	Rev-[name]
Court cases	Port-of-[name] Becomes ports-harbors-and-waterways	Random irrelevant person's name
Pro sports teams	Minor league sports teams Becomes whatever sport it is	Small towns
National Wildlife Refuges	Random committees Becomes the properly formatted name	Creeks
Animals	Random subcommittees Becomes the properly formatted name	[name]-Task-Force
sanctions-on-[name]	Random sporting place ex: Mile-High-Stadium merges into Denver-Broncos	[name]-Law
National forests	[state]-Department-of[name] Becomes State-Department-of-[name]	Shootings (tagged once)
[name]-Party	[name]-fire Becomes wildfires	Operations
[country]-elections	[name]-Courthouse Becomes courthouses	Mountains
I-[#] (interstate tags)	[name]-road/highway/bridge Becomes roads-bridges-and-highways
Subcommittees	US-[#] Becomes roads-bridges-and-highways
	[name]-Beach Becomes beaches
	[name]-State-Park Becomes state-parks
	[name]-Zoo Becomes zoos
	[name]-Trail Becomes recreational-trails
	[name]-Mine Becomes mines
	[name]-airport (small airports) Becomes airports
	Random airplane Becomes airplanes

For a complete list of Appendix B with notes, please see research/Active/Department Projects/Database Specialist/Tags Project/Tags Clean-Up Documentation.ods.

Appendix C: Categorization of Tags

SELECT t.tag_id, t.name, at.sect, count
(t.tag_id) AS count   
FROM ((
SELECT ac.tag_id, 'biotext'::text AS sect
FROM ( SELECT biotext_tag.tag_id
 FROM biotext_tag
 UNION ALL
SELECT additional_tag.tag_id
FROM additional_tag
 UNION ALL
SELECT additional_candidate_tag.tag_id
FROM additional_candidate_tag
UNION ALL
SELECT education_tag.tag_id
FROM education_tag
 UNION ALL
SELECT religion_tag.tag_id
FROM religion_tag
 UNION ALL
SELECT experience_tag.tag_id
 FROM experience_tag ) ac
 UNION ALL
SELECT committee_tag.tag_id, 'committee'::text AS sect
FROM committee_tag
UNION ALL
SELECT congcategory_tag.tag_id, 'cong'::text AS sect
 FROM congcategory_tag
UNION ALL
SELECT measure_tag.tag_id, 'measure'::text AS sect
 FROM measure_tag
UNION ALL
SELECT npatanswer_tag.tag_id ,'npatanswer'::text AS sect
 FROM npatanswer_tag
UNION ALL
 SELECT rating_tag.tag_id ,'rating'::text AS sect
FROM rating_tag
 UNION ALL
 SELECT sig_tag.tag_id, 'sig'::text AS sect
FROM sig_tag
 UNION ALL
SELECT speech_tag.tag_id, 'speech'::text AS sect
FROM speech_tag
UNION ALL
 SELECT party_tag.tag_id, 'party'::text AS sect
 FROM party_tag) at
JOIN tag t 
USING (tag_id))
WHERE t.tag_id IN (86471)
GROUP BY t.tag_id, t.name, at.sect
ORDER BY t.tag_id;
(WHERE t.tag_id IN (t.tag_id1, t.tag_id2, t.tag_id3, t.tag_id4, t.tag_id5, t.tag_id6, t.tag_idn)

< Back to the Database Specialist Data Standards

PVSWiki : TagsCleanUpDocumentation2016

Final stats: