I’m back again with my annual festival of statistics for the TripleJ Hottest 100! I had even less time to muck about with it this year, so I’ve used the same process as last year with a few minor tweaks.
I sampled data by grabbing posts to Instagram tagged with #hottest100 from when the voting opened until it closed.
I also grabbed tweets on Twitter in the same time period, filtering for ones that had a ‘media’ file attached, which is usually an image of some sort.
The raw post data was filtered to get a list of unique URLs so that duplicates from retweets, reblogs, likes, etc. were all filtered out as much as possible. I ended up with 6,182 unique URLs from Instagram, and only 635 from Twitter.
I’m not sure why there are so few image URLs from Twitter, and I didn’t have much time to look into it in detail. I did notice a lot, and I mean a lot of retweets of a certain betting firm’s tweet about Justin Bieber. A cursory look suggests that it’s just a bunch of Bieber bots and rabid fans who’ll retweet anything mildly positive about His Bieberishness.
I ignored the Twitter results and concentrated on the Instagram images. I downloaded them using wget (and a slow random sleep between fetches to keep the bandwidth load down) and then ran them through the OCR program tesseract to get a corpus of text.
I also grabbed the list of songs TripleJ included in their voting website, so my method will have ignored any songs that people added themselves that got a significant amount of votes. I haven’t had time to fix up my scripts to take write-ins into account.
I then processed the text with a custom script that uses Levenshtein distance to figure out if the OCRed text is a close enough match to a song in the song list. Every match is a vote. The biggest change from last year is a move to using a pool of processes instead of carving up the corpus and submitting a subset to one each of a fixed process. This smooths out the processing load and is a bit more efficient overall than what I did last time.
The last step is to print out the vote tallies for each song in CSV format so it’s easier to load into spreadsheets and stuff.
I ended up with 29,805 votes for individual songs. I heard on TripleJ yesterday that there were a bit over two million votes this year, so this is a sample size of just under 1.5% of the total vote, which is enough to be very confident of who will win. Over 95% confident, in fact.
I’ve uploaded the results for this year into the same Google Sheets format as last year. Click here to check it out.
Tomorrow I’ll hopefully be tracking the results throughout the day and we’ll see how well the method holds up with predicting songs. I’m not expecting it to be any better than last year, though hopefully it won’t be much worse.
I’ve done some mucking about since I hit publish, thanks to some errors pointed out by people I’ve shared my data with.
The matcher is the biggest source of error, since it’s trying to find matches in what is centre-justified OCR text that runs across multiple lines sometimes. It tries to compensate by allowing partial matches, but it doesn’t deal well with songs that have long titles or long artist names. For example: Downtown by Macklemore, Lewis et al is undercounted because it’s got a really long name when you include all the ‘featuring’ bits.
I tried stripping off all the parenthesis stuff at the end of song names for the matcher, but then it may not match stuff as well if it’s long in the OCR file. I’d have to muck about with it a lot more to make it better, and I’ve only had a couple of hours to play with this today, so the results are going to be off.
I’m expecting to see a few songs wildly out of place in the predictions the matcher gives for this, and other, reasons.