So, couple years back, me and my mates at work were tasked to migrate an existing Solr based search-architecture to ElasticSearch, a different storage system, with one little condition: ZERO DOWNTIME! Looking back at the challenges we faced, I thought this is a story that deserves to be sketched.
Before we jump into the fun part, let's do a bit introduction to Solr and ElasticSearch. Please feel free to skip this section if you are already familiar. Solr and ES are both data storage engines optimized for fast-searching. Basically, when you need to find a needle from a haystack in milliseconds, you keep one of these in your right pocket (assuming, you are right-handed). Both of them are open-source search engines build on top of Apache Lucene. Both engines offer distributed full-text search, faceting, near real-time indexing, and NoSQL features. Along with them, ES offers geo-search, multi-tenancy, powerful query-DSL, sharding, and replication, making it relatively more popular.
We had hundreds of gigabytes of textual data in a MySQL database, which were synced with the Solr storage. All the writes go to the database, and then the database syncs with Solr. And all the searches are served from Solr. Our objective was to replace Solr by ES. From a high-level perspective, this process can be broken down into three steps:
- Provision ES instances.
- Create a syncing pipeline for MySQL-ES.
- Create a query-processor for ES.
Provisioning ES instances is done with keeping safety precautions, backup schedules, and workloads in mind. That means you maintain fallback measures if an instance becomes unhealthy, replicate and backup data in case of any accidental data loss, deploy a number of instances for load-balancing, etc. The size (in terms of cpu/memory) of each instance is decided based on the potential number of concurrent requests.
We had an existing syncing pipeline for MySQL-Solr. But because of the differences in indexing structure, reusing that wasn’t an option. We needed to write a completely new pipeline for ES. We also had to write a new query-processor, as well as keep in mind that the query response must be the exact same as the Solr’s query response. Otherwise, the services that depended on the search APIs wouldn’t work. These two steps required a lot of effort, study, and trial-n-error iterations because of this condition.
Finally, we had the infrastructure ready for indexing and searching. At this stage, we had two more things to do to wrap up the project: back-populate the existing data to ES and hook-up the ES to the overall system.
At this stage, if you can afford downtime, you can simply take down Solr and place ES, and you are good to go. But in our case, we had to implement a double-write pipeline, so that both Solr and ES would sync with MySQL at the same time. All the queries were still being served from Solr. Before we decoupled Solr, we back populated the data. Backpopping data is easy, you just run indexing for the whole database. Depending on the amount of data, it can take a long time. So be sure to keep some Tarantino movies to put yourself to sleep 😴.
The migration was completed. But there was one last thing left to do: find the right server configuration for ES clusters. First, we came up with a set of potential configurations. Then we applied each of them for a span of time and gathered data on runtime and memory consumption. From a comparative study of the data, we picked the one that seemed statistically optimal.
The whole plan was done before we jumped into the work, along with task-breakdowns with time-spans. But the most challenging aspect of this project was to keep the search API fully functional throughout the migration.