The goal of the thesis was to identify algorithms that can be used to tell if two songs are the same based on the artist, tile and duration metadata. I based my research on metadata for more than 95000 songs from the game SongArc. From this large data set I have selected three smaller data sets to focus on. The first sample set contained the three popular songs, their variances and songs that were similar in either the title or the artist to those songs. The second sample set contained songs with similar titles. The third sample set included songs that had metadata containing non-Latin characters.
These three sample set were used to measure how well algorithms perform at detecting similar songs. I have selected various approximate string matching algorithms to examine. These algorithms were based on two distinct approaches: edit-distance and tokens. My final goal was to improve the accuracy of the algorithms by transforming their input. I have found that removing text starting with the first bracket character improved the results significantly. The edit-distance based algorithms benefited the most, especially when the goal was to find a low number of false positives. Ignoring the artist metadata field if it contained the word “unknown” improved the accuracy of all algorithms. I have found that it had the greatest effect on token based algorithms.
In conclusion, I have found that using my improvements the cosine similarity metric is best to be used to detect if two songs are similar based on the artist and title metadata, being able to detect 87.27% of the similar songs when high accuracy is required at a sensitivity of 0.55. When high accuracy is not required, it is able to detect 96.78% of the similar songs using a sensitivity of 0.42.