Improving search results at 7digital


Developing the search & catalogue infrastructure for the 7digital API

Technological Objectives

The biggest problem with the old search platform was that in January 2014, the average track per search response time was 4000 milliseconds. In addition to being slow, the search results were often wrong, out of date or would return errors. Customers feedback was that they felt there was a poor user experience and were often irritated by the constant feed of error messages.

The meta data of the tracks was stored in a search index that was 660Gb in size, containing 660 mil documents, which is extremely large compared with a number of search indexes. Various tweaks were made to JVM and memory settings were made, but these failed as there was no permanent improvement. An extensive investigation was carried out on the search platform. It was discovered that the previous schema in production was indexing fields like track price which were never actually searched upon. A prototype was created to come up with a much smaller search index, with a size around 10 Gb. Benchmarked against the original size of 660Gb, this is a clear improvement.

Technological Advancements

This new smaller search index has created a number of technological advances:

being reliant on the ~/track/details endpoint meant we always returned current results, and we were 100% consistent with the rest of the API, which eliminated the catalogue inconsistencies problem;

we could create a brand new index within an hour, meaning up to date data;

much faster average response times for ~/track/search, reduced the response time by 88% from around 2600ms to around 350ms;

no more deleted documents bloating the index, thus reducing the search space;

longer document cache and filter cache durations, which would only be purge at the next full index every 12 hours, this helped performance;

quicker to reflect catalogue updates, as the track details were served from the SQL database which could be updated very rapidly.

Technological Uncertainties

Would the existing hardware cope with current levels of traffic with the new architecture?
Would new architecture provide consistent responses across different endpoints, I.E. search results and catalogue results, chart endpoints?

Innovations

We created a Git repository on Github with public access; anyone can submit to us a pull request to add new synonyms for our search platform. We can choose to accept the search synonym to our platform and the change will be effected on our search API within 12 hours of our acceptance of change. Repository is here: https://github.com/7digital/synonym-list

Some labels deliberately publish tribute, sound-alike and karaoke tracks with very similar names to popular tracks, in the hope that some clients mistakenly purchase them. These tracks are then ingested into our platform, and 7digital’s contract with those labels means we are obliged to make them available. At the same time, consumers of our search services complain that the karaoke and sound-alike artists are returning in the search results above the genuine artists, mostly because of the repeated keywords in their track and release titles.

In order to satisfy both parties, we decided to override default Lucene implementation of search and exclude tracks, releases and artists that contained certain words in their titles, unless the user specifically entered them in as search term. For example, searching for “We are the champions” now returns the tracks by the band Queen, which is what customers expect. To achieve this we tweaked the search algorithm so all searches by default it will purposefully exclude tracks with the text “tribute to” anywhere in their textual description, be it the track title, track version name, release title, release version name or artist name.
The results look like this: https://www.7digital.com/search/track?q=we%20are%20the%20champions%20queen

Prior to the change, all tribute acts would appear in the search results above tracks by the band Queen. To allow tribute acts to still be found, the exclusion rule will not apply if you include the term “tribute to” in your search terms, as evidenced by the results here: https://www.7digital.com/search/track?q=we%20are%20the%20champions%20tribute%20to%20queen

Other music labels send 7digital a sound-alike recording of a popular track, and name it so it’s release title and track tile duplicate the title of a well known track. This would mean that searching for “Umbrella” by Rhianna

Search: Dumb similarity modification of Lucene. Lucene is a capable search engine which specialises in fast full text searches, however the documents it is designed to search across work best when they are paragraph length containing natural prose, such as newspaper articles. The documents that 7digital add to Lucene are models of the metadata of a music track in our catalogue, in the form of as follows:

Standard implementation of Lucene will give documents containing the same repeated terms a higher scoring match than those that contain a single match. This is means when using the search term: “Michael” results such as “The Best of Michael Jackson” by “Michael Jackson”, will score higher than “Thriller” by “Michael Jackson” because the term “Michael Jackson” is repeated in the first document, but not the second.

In terms of matching text values this makes sense, but for a music search we want to factor in popularity of our releases based on sales and streams of it’s tracks.

Ignoring popularity leads to a poor user experience; since “Best of Michael Jackson” release is ranked as the first result, despite being much less popular than “Thriller”which is ranked lower in the search results.

This was achieved by modification of the Lucene’s term frequency weighting in the similarity algorithm

,

  1. No comments yet.
(will not be published)