Searching through billions of documents with a response time of under 1 second.
Clipit accumulated over 2.3 billion news articles and posts from various sources, over the course of more than 15 years. They wanted to expose that rich dataset to analyze data for their clients through an easy to use and familiar web solution, but had no way of accomplishing satisfying performance with the existing technology. They were looking for a big data solution which would provide the needed storage capacity, performance considerations and custom searching language that will offer all the needed options for the end users' challenges
The goal was to build a system that will be a stable and reliable search platform that provides real time searching and analysis on a large dataset, can serve a large number of clients simultaneously, and provides statistics, reporting, and charts, and will have the possibility to scale for the next several years.
Search with various filters across all messages in under 1 second
Provide a scalable solution with uptime of more than 99%
Provide analytics together with the search results
Large volume of data
High number of requests
The initial phase of the project was to select the right technology to accomplish the requirements. Our research and demo phase showed that Elasticsearch is best suited for the job, so we proceeded on the design and sizing of the cluster, which given the huge amount of data we had to organize and store, was not an easy task.
Processing news articles and messages was also a big part of the project, as they were stored in XML archives and in the database, while new ones were also coming in every second. That’s why we created a custom solution for parsing, enriching the data with sentiment, language and gender analysis, and indexing existing both archive and live data. The solution offers high quality guaranteed by automated tests and high performance so we can keep up with the speed of the producers.
In order to provide access to the search we built a .NET API on top of the elasticsearch cluster, which has specific authorization and account settings, validation rules and various options needed for every search request. The deployment of this API and of the services running in behind was automated using TeamCity. The whole module has its own monitoring metrics that we follow when there is something off or we want to improve. With the API up and running, we continued developing the frontend part of the solution, which had a lifecycle on its own, with several iterations and improvements until every customer got what they needed.
We're all ears.
Drop us a line