Archive for category 7digital

R&D Kafka Catalogue Cloud write up

 

Avoiding the move of a monolithic database into the cloud

Music labels will send to 7digital not just the audio recordings, but also the data pertaining to the audio, such as the artwork, track listings, performing Artists, release dates, prices, and specific rights to stream or download the music in various territories around the world. Approximately 250,000 tracks per week are received by 7digital and added to its catalogue, which is stored in a database. This process is called ingestion.

At the outset of the project there was a single database that stores catalogue and sales data which is used for multiple, unrelated purposes. This database is written to with new albums sent to 7digital from music labels, along with licensing rules governing who can access the music.

Slow queries cause the web applications which use the database to time out and fail which returns errors to the end users. Changing the database structure would help to resolve the errors and failures, however they would necessitate re-writing nearly every other web application that 7digital own, since the web applications are tightly coupled to the database. The database is very large and uses proprietary, licensed technology that cannot be easily moved to data centres around the world.

By separating the catalogue data from other data in our system, it would become possible to, not only write a much more efficient database schema which failed less often, but to also move this database into a cloud provider’s platform. With a database in cloud we could build part of 7digital’s Web API platform in the same cloud provider’s platform and therefore deploy our platform nearer to our customers in Asia.

Creating a separate catalogue database in London and moving the applications to AWS in Asia might help solve the problems with concurrent reads and writes to the database, but it would not help reduce the latency experienced by customers in Asia.

The key issue becomes one of figuring out how to transport the relevant catalogue data from the ingestion process in London and send this data into an AWS region.

Moving the ingestion process into AWS is far larger piece of work and would not yield produce performance improvements for customers in Asia by itself, so that it was decided not to move it out of the London datacentre in 2015.

Failures of the project

During 2015 we did not complete the final link in the chain, the application which could read messages from the Kafka service and persist their contents into the AWS database. We could not deploy a full fledged version of the London based Gateway API service, as it was too complex. Instead we made a naive implementation of this Gateway using a technology called nginx.

The Catalogue Persister service eventually was abandoned. An instance of Kafka was built in late 2015 and the Catalogue consumer was due to be started early 2016.

No Comments

Improving search results at 7digital

Developing the search & catalogue infrastructure for the 7digital API

Technological Objectives

The biggest problem with the old search platform was that in January 2014, the average track per search response time was 4000 milliseconds. In addition to being slow, the search results were often wrong, out of date or would return errors. Customers feedback was that they felt there was a poor user experience and were often irritated by the constant feed of error messages.

The meta data of the tracks was stored in a search index that was 660Gb in size, containing 660 mil documents, which is extremely large compared with a number of search indexes. Various tweaks were made to JVM and memory settings were made, but these failed as there was no permanent improvement. An extensive investigation was carried out on the search platform. It was discovered that the previous schema in production was indexing fields like track price which were never actually searched upon. A prototype was created to come up with a much smaller search index, with a size around 10 Gb. Benchmarked against the original size of 660Gb, this is a clear improvement.

Technological Advancements

This new smaller search index has created a number of technological advances:

being reliant on the ~/track/details endpoint meant we always returned current results, and we were 100% consistent with the rest of the API, which eliminated the catalogue inconsistencies problem;

we could create a brand new index within an hour, meaning up to date data;

much faster average response times for ~/track/search, reduced the response time by 88% from around 2600ms to around 350ms;

no more deleted documents bloating the index, thus reducing the search space;

longer document cache and filter cache durations, which would only be purge at the next full index every 12 hours, this helped performance;

quicker to reflect catalogue updates, as the track details were served from the SQL database which could be updated very rapidly.

Technological Uncertainties

Would the existing hardware cope with current levels of traffic with the new architecture?
Would new architecture provide consistent responses across different endpoints, I.E. search results and catalogue results, chart endpoints?

Innovations

We created a Git repository on Github with public access; anyone can submit to us a pull request to add new synonyms for our search platform. We can choose to accept the search synonym to our platform and the change will be effected on our search API within 12 hours of our acceptance of change. Repository is here: https://github.com/7digital/synonym-list

Some labels deliberately publish tribute, sound-alike and karaoke tracks with very similar names to popular tracks, in the hope that some clients mistakenly purchase them. These tracks are then ingested into our platform, and 7digital’s contract with those labels means we are obliged to make them available. At the same time, consumers of our search services complain that the karaoke and sound-alike artists are returning in the search results above the genuine artists, mostly because of the repeated keywords in their track and release titles.

In order to satisfy both parties, we decided to override default Lucene implementation of search and exclude tracks, releases and artists that contained certain words in their titles, unless the user specifically entered them in as search term. For example, searching for “We are the champions” now returns the tracks by the band Queen, which is what customers expect. To achieve this we tweaked the search algorithm so all searches by default it will purposefully exclude tracks with the text “tribute to” anywhere in their textual description, be it the track title, track version name, release title, release version name or artist name.
The results look like this: https://www.7digital.com/search/track?q=we%20are%20the%20champions%20queen

Prior to the change, all tribute acts would appear in the search results above tracks by the band Queen. To allow tribute acts to still be found, the exclusion rule will not apply if you include the term “tribute to” in your search terms, as evidenced by the results here: https://www.7digital.com/search/track?q=we%20are%20the%20champions%20tribute%20to%20queen

Other music labels send 7digital a sound-alike recording of a popular track, and name it so it’s release title and track tile duplicate the title of a well known track. This would mean that searching for “Umbrella” by Rhianna

Search: Dumb similarity modification of Lucene. Lucene is a capable search engine which specialises in fast full text searches, however the documents it is designed to search across work best when they are paragraph length containing natural prose, such as newspaper articles. The documents that 7digital add to Lucene are models of the metadata of a music track in our catalogue, in the form of as follows:

Standard implementation of Lucene will give documents containing the same repeated terms a higher scoring match than those that contain a single match. This is means when using the search term: “Michael” results such as “The Best of Michael Jackson” by “Michael Jackson”, will score higher than “Thriller” by “Michael Jackson” because the term “Michael Jackson” is repeated in the first document, but not the second.

In terms of matching text values this makes sense, but for a music search we want to factor in popularity of our releases based on sales and streams of it’s tracks.

Ignoring popularity leads to a poor user experience; since “Best of Michael Jackson” release is ranked as the first result, despite being much less popular than “Thriller”which is ranked lower in the search results.

This was achieved by modification of the Lucene’s term frequency weighting in the similarity algorithm

,

No Comments

Solving problems of scale using Kafka

At 7digital I was in a team which was tasked with solving a problem created by taking on a large client capable of pushing the 7digital API to its limits.   The client had many users and expected their numbers to exponentially increase. Whilst 7digital’s streaming infrastructure could scale very well, the requirement was that client wanted to send back the logs of the streams back to 7digital via the API. This log data would be proportional to the number of users. 7digital had no facility for logging said data being sent from a client, so this would be a new problem to solve.

We needed to build an Web API which exposed an endpoint for a 7digital client to send large amounts of JSON formatted data, and to generate periodic reports based on such data. The expected volume of data was thought to be much higher than what the infrastructure in the London data centre was capable of supporting. It was very slow, costly and difficult to scale up the London data centre to meet the traffic requirements. It was deemed that building the API in AWS and transporting the data back to the data centre asynchronously would be the best approach.

Kafka was to be used to decouple the AWS hosted web service accepting incoming data from the London database storing it. Kafka was used as a message bus to transfer the data from an AWS region back to the London data centre. It was already operational with the 7digital platform at this time for non real time reporting purposes.

Since there was no need to use the London data centre, there was no advantage in writing another application in C# that could be hosted on the existing Windows webservers running IIS. Given the much faster boot times of Linux EC2 instances and the greater ease of using Docker in Linux, we elected to write the web application in Python. We could use Docker to speed up development.

The application used the Flask web framework. This was deployed in an AWS ECS cluster, along with a nGinx container to proxy requests to the python API container and an DataDog container which was used for monitoring the application

The API was very simple, once the inbound POST request was validated, the application would write the JSON to a topic on a kafka cluster. This topic was later consumed and written into a relational database so reports could be generated from it. The decoupling of the POST requests from the process that that writes to the database meant we could avoid locking the database, by consuming the data from the topic at a rate that was sustainable for the database.

Since reports were only generated once a day, covering the data received during the previous day, there could be a backlog of data on the kafka topic; there was no requirement for Near Real time Data.

In line with the usual techniques of software development at 7digital, we used tests to drive the design of this feature and were able to achieve continuous delivery. By creating the build pipeline early on, we could build the product in small increments and deploy them frequently.

We had the ability to run a makefile on a developer machine which built a docker container running the Python web app. We could then use the Python test frame unittest to run unit, integration and smoke tests for our application.  The integration tests were testing if the app could write to Kafka topics and the smoke tests were end to end tests, which ran after a deploy to UAT or Production to verify a successful deployment.

We successfully completed the project and the web application worked very well with the inbound traffic from the client. Since it was hosted in an EC2 cluster we could scale up both the number and the resources of the instances running our application. The database was able to cope with the import of the voluminous user data too. It served as a good example of how to develop an scalable web API which communicated with a database located in a datacentre. It was 7digital’s first application capable of doing so and remains in use today.

 

No Comments