Elasticsearch implementation: Brute force ‘stemming’

During my last project I was responsible, as a project manager, for implementing the open source search engine Elasticsearch and the crawler Nutch. It proved to be everything they promised and then some. To get the stemming of the Dutch content right we used a Brute force approach by using  a synonym file for all the conjugations in the Dutch language (for details see the end of this post). The result can be viewed on op.nl.

Business case

The job started with the client asking for the replacement of (part) of their web technology stack with open source solutions. They told me to deliver a solid business case and a POC for them to evaluate and take a decision whether to proceed with the implementation.

During the evaluation we took a good look at all the existing solutions in place and found that the search solution was a good candidate for replacement. The existing license structure and cost associated made the use of the existing search solution undesirable for some functionality. This meant that the project, in addition to the license cost for the search solution, was implementing custom software to create functionality the search solution was supposed to fill.

It proved to be possible to create a profitable business case around the implementation of a search engine and a web crawler. The web crawler is of course an undesirable technical workaround for the fact that not all content was available in a structured format or could be made available within a reasonable amount of time and budget. In addition the goal was to create a system that could easily assimilate more data from unstructured sources.

Before we could start the POC we had to choose between the available open source search engines. For this purpose we applied the Open Source Maturity Model (OSMM) to the most prominent open source search engines: Elasticsearch and Solr. Both based on the search engine library Lucene. From the OSMM evaluation we learned both solutions were deemed ‘enterprise fit’ with a clear lead in maturity for Solr. However from our research into both systems we took the popular view that Elasticsearch was deemed more easy to use and built for the sort of scalability we were looking for.

Proof of concept (POC)

During the POC we established that the advertised ease of use in installing, feeding and querying Elasticsearch proved to be true. In addition we were able to ‘scale’ the system by simply starting another instance of Elasticsearch and both instances automatically started sharing their data and divide the work. During the POC we also setup the open source variant of puppet to be able to automatically provision new Elasticsearch nodes to increase performance or replace defective nodes.

During the POC we also selected a web crawler for the search solution: Apache Nutch. OpenIndex was selected for implementing this part of the solution and did a brilliant job of configuring the crawler and implementing the interface between Elasticsearch and Nutch 1.x.

Brute force ‘stemming’

The only hiccup worth mentioning  was when we started to evaluate the quality of the search results. We found that non of the traditional stemming algorithms for the Dutch language (compared to English a bit irregular) could meet our quality goals. Fortunately I thought of a better way to approach the problem: Brute Force. I created a file which contained a line for each word, and all its conjugations, in the Dutch language. We added this file (which contained ~110K lines) as a list of synonyms in Elasticsearch to be used on index time. In spite of the reservations of some of the  experts I consulted, this approach works superbly. The quality goal we set was easily reached. The only significant drawback was the increase in the size of the index (about 50%).  As we did not hit the RAM memory limit, the performance of our Elasticsearch cluster was not negatively impacted.

Comments