Elasticsearch implementation: Brute force ‘stemming’

During my last project I was responsible, as a project manager, for implementing the open source search engine Elasticsearch and the crawler Nutch. It proved to be everything they promised and then some. To get the stemming of the Dutch content right we used a Brute force approach by using  a synonym file for all the conjugations in the Dutch language (for details see the end of this post). The result can be viewed on op.nl.

Business case

The job started with the client asking for the replacement of (part) of their web technology stack with open source solutions. They told me to deliver a solid business case and a POC for them to evaluate and take a decision whether to proceed with the implementation.

During the evaluation we took a good look at all the existing solutions in place and found that the search solution was a good candidate for replacement. The existing license structure and cost associated made the use of the existing search solution undesirable for some functionality. This meant that the project, in addition to the license cost for the search solution, was implementing custom software to create functionality the search solution was supposed to fill.

It proved to be possible to create a profitable business case around the implementation of a search engine and a web crawler. The web crawler is of course an undesirable technical workaround for the fact that not all content was available in a structured format or could be made available within a reasonable amount of time and budget. In addition the goal was to create a system that could easily assimilate more data from unstructured sources.

Before we could start the POC we had to choose between the available open source search engines. For this purpose we applied the Open Source Maturity Model (OSMM) to the most prominent open source search engines: Elasticsearch and Solr. Both based on the search engine library Lucene. From the OSMM evaluation we learned both solutions were deemed ‘enterprise fit’ with a clear lead in maturity for Solr. However from our research into both systems we took the popular view that Elasticsearch was deemed more easy to use and built for the sort of scalability we were looking for.

Proof of concept (POC)

During the POC we established that the advertised ease of use in installing, feeding and querying Elasticsearch proved to be true. In addition we were able to ‘scale’ the system by simply starting another instance of Elasticsearch and both instances automatically started sharing their data and divide the work. During the POC we also setup the open source variant of puppet to be able to automatically provision new Elasticsearch nodes to increase performance or replace defective nodes.

During the POC we also selected a web crawler for the search solution: Apache Nutch. OpenIndex was selected for implementing this part of the solution and did a brilliant job of configuring the crawler and implementing the interface between Elasticsearch and Nutch 1.x.

Brute force ‘stemming’

The only hiccup worth mentioning  was when we started to evaluate the quality of the search results. We found that non of the traditional stemming algorithms for the Dutch language (compared to English a bit irregular) could meet our quality goals. Fortunately I thought of a better way to approach the problem: Brute Force. I created a file which contained a line for each word, and all its conjugations, in the Dutch language. We added this file (which contained ~110K lines) as a list of synonyms in Elasticsearch to be used on index time. In spite of the reservations of some of the  experts I consulted, this approach works superbly. The quality goal we set was easily reached. The only significant drawback was the increase in the size of the index (about 50%).  As we did not hit the RAM memory limit, the performance of our Elasticsearch cluster was not negatively impacted.

My future career development

How do I see my career developing and how will an MBA @RSM help me achieve my goals?

I want to become one of the best conductors of customer journeys. Removing the  hurdles customers are experiencing when interacting with the same company through different channels and touch-points. Optimizing the outcome of the interaction between organizations and customers from both the company and customer’s perspectives.

At this point in my career my focus is shifting from solving technical problems to spotting business opportunities and making the most of them. One of the hottest topics in my field is CXM most companies are planning to implement parts of this concept. The holy grail of CXM is creating compelling and fully integrated customer journeys across touch points.  Only a few organizations seem to be successful in consistently creating customer journeys and the million dollar question is: How do you create an organization that can consistently generate new profitable customer journeys while executing and optimizing the existing ones?

In my opinion creating, executing and optimizing cross channel customer journeys will demand a huge increase in cross functional cooperation. The current operating model where organizations are made up of a number of loosely coupled functional or product oriented silo’s with each their own unique KPIs and targets, is not designed to facilitate the necessary level of cooperation.  To effectively facilitate the necessary inter functional cooperation, organizations will need to change drastically and become more customer centric.

To change organizations from their product or functional grouping into a customer centric organization, and in the same time losing as little of the advantages of its previous organizational form as possible, is a job I am pretty excited about. It will take a lot of pioneering, determination, creativity and leadership to drive such change.

In order to be successful in such a role I will need to develop a broad skill set in business administration. It will require learning about business economics, strategy, marketing, human resources, organizational behavior and operations. I could acquire these skills by following separate courses on each subject, go to a university and get an academical degree, learn on the job or get an MBA.

Acquiring the skills through separate courses will not provide me with enough insight into the intricate couplings between the different aspects of managing a business. The university option would provide me with a lot of know how, but fall short in giving me a good sense for the context in which to apply it. On the job training would provide the necessary context, but might prove to be a long journey.

An executive MBA offers both the content and context I am looking for. The context might be less then I would get with learning on the job, however this is easily compensated by the much shorter period in which the skills are acquired.  In addition it will provide me with a network of successful and motivated individuals across industries and functions which is a nice bonus.

My reason for choosing an MBA at Rotterdam School of Management is partly based on practical grounds. It will allow me to keep my job at Deloitte Consulting and give me the opportunity to directly apply my newly acquired knowledge and insights into a professional environment. The other taking argument is the focus on leadership development. Leadership is, in my humble opinion, the most important skill for a business leader to be successful in any setting.

Note to readers: This is one of my admissions essays. Please provide me with feedback.

MBA sponsorship approved

My business case for the Executive MBA at the Rotterdam School of Management has been approved. Somehow together we (thanks to all who helped) succeeded in convincing:

  1. My wife
  2. My counselor
  3. 2x Service line leader
  4. Service area leder
  5. Consulting leader, talent partner of consulting and the learning manager.

Thank you for the faith you have placed in me by sponsoring my tuition fee and giving me time off to study. I already started working on my application for a position in the class of 2015. My application essays will follow shortly please come back later to provide me with feedback.

Seven Years in Deloitte Consulting

After Seven years in Deloitte Consulting I finally made it and landed on the beach (at least partially). Which of course gives me the time to do all sorts of activities:

 

By now I finished all my chores and thought it might be nice to make good on my promise of blogging something about my career and share some thoughts with you.

How I got here.

My career in IT started when I received a phone call from the Royal Dutch air-force telling me I failed the very last test: An allergy test. No fighterpilot training for me. Fortunately my mother had insisted I enroll in a bachelor program  I chose software engineering as a major, moved to live on my own and had an excellent time with some high school chums.

Then reality broke in and I had to decide on an internship. This was in the period of the dot-com bust and in that period I saw web “scripting” as something for people without real programming skills. So i decided to test my skills in computer vision and applied for an internship at Urenco. With the logical outcome that I ended up at Stentec building a tool for importing 3D objects into their DirectX simulation engine for Sail simulator 4.0.

After deciding that game development, although a worthy occupation, was too small a niche in The Netherlands to base a solid career on, I tried out for technical software development. I moved to Eindhoven and worked on Motion Control software which proved to be incredibly boring. At first sight it seemed to me like playing with Lego Technics, unfortunately this is only a very small part of the job.

Still unsure of what career path would be right for me I applied and was accepted at TU/e (Eindhoven University). Which was like a never ending math camp. To top this experience off I decided to do my master thesis on a research topic inside the university walls. I think the professor who supervised my work summed it up well at graduation:

Fundamental technology research might not be the best career fit, but with the combination of your engineering and communication skills I foresee a bright future in consulting.

Luckily for me I had already come to that conclusion and secured a job as Business Analyst (= junior consultant) at Deloitte Consulting. Where I had a lightning start and was staffed on a web project within my first month, never to return to the beach. Until now.

What I did

I worked for a very diverse set of clients from consumer business to public sector and from telecom to education and the financial services industry. They al had one thing in common: they manage their online content in SDL Tridion. Projects ranged from e-commerce optimizations to content aggregation and included assessments, implementation advice,  troubleshooting and a project salvage operation.

Next to my client work I started blogging about SDL Tridion somewhere around september 2008. Back in those days there were almost no public sources of information about Tridion. Luckily that changed slowly and now the online SDL Tridion community seems to be thriving.  The blog delivered me two clients who contacted me directly via the contact form and a lot of exposure to the rest of the community and clients at large. My blogging frequency has dropped to an all-time low which is something I regret.  Perhaps I will pick it up again in the near future as I am  working on a very interesting project in a part of the CXM technology stack other then content management.

In my spare time I used to fly gliders competitively. Unfortunately I have chosen to stop competing as the time needed for a decent ranking was more then I was willing to invest. Without the competition element gliding has gradually lost my interest and by now my instructors and pilot licence are expired.

Currently most of my spare time is spent with my family and in creating and maintaining my own empire of ‘small’ websites. Which gives me a lot of satisfaction and for which there seems to be too little time to try out every new idea (without neglecting my wife and daugther). Seems I turned my work into a hobby …

What is next?

My technical background (2x computer science bachelor & Msc in algorithms and datastructures) has served me well. However I am discovering that in my current role  I am expected to shift my focus from solving technical problems to solving business problems. My formal education gave me the ability to excel in analyzing and solving complex problems. Unfortunately it did not give a lot of reference and tools on how to apply these skills effectively to business problems.

Fortunately my employer thought ahead and encourages young (ahum) ambitious personnel to apply for sponsorship of post graduate education. I figured that an MBA might just be what I need to fill my head with new business tools and methodologies. The education of my choice is an EMBA at the Rotterdam School of Management . I talked to a recent graduate (and their posterboy) Wing Lee who was positive about the experience and felt it had been worth the investment in both time and lost opportunities. Currently I am in the process of getting my Business Case approved by the senior management of my service area as Deloitte sponsors tuition fees and grants some paid leave to selected candidates.

Wish me luck!

SDL Tridion SEO: Managing inbound links

404 pages are the best way to lower search engine rankings and scare  visitors away from your site. In many cases the content is still available on the site only the location changed. Tridion eliminates the number of broken links within your website if your content editors  make correct use of component linking. Component linking makes it very easy (and tempting) to change the location of content within a website.

Unfortunately inbound links and search engine content is not managed out of the box by Tridion which results in the dreaded 404 pages being served to visitors and crawlers. The solution is simple: Redirect (301) the crawlers and visitors to the new location of the content yourself. To do this you need to:

Read more

My promotion to: Father

My daugther Kira

Kira

Sorry, I was forced to post the above picture by Alvin. He wondered why I stopped posting and I think it is  only fair to share with you my reason for my absence. Alvin suggested that I stopped posting because of a promotion and I have to admit: It sure feels like a promotion. I got a fancy new title, huge responsibilities a sizeable addition to my workload and very little extra pay. However when I look into those blue eyes I get the feeling it is all worth it and I hope you will forgive me for not posting. In any case the number of visits on my blog has doubled since last year. Which is a testimony to the success of SDL Tridion. To keep this blog relevant I hereby promise to resume posting.

SDL Tridion Troubles

This post is a Response to “Rants on Tridion Implementations” which I ran into on my Sunday evening round of blog reading. In this post Nuno L rants about the burden of fixing broken Tridion implementations. We seem to be in the same business and I would like to share my perspective on this subject. I strongly diasgree with “my job is not always an easy or necessarily happy one”. In addition  when I (please read the rant of Nuno first) walk into a project it means your Tridion Troubles are over. I love to fix things that are broken or find a solution to problems other people have given up on. To see despair turn to optimism and see smiles on the customers face always makes me feel happy with my job. Though I have to admit I find that the best part of fixing the impossible is the bragging rights afterwards.

In the language debate I would like to make a stand for reduction in the number of languages/technologies needed to implement Tridion. The cost of maintaining different development environments alone should be enough reason to want to limit the number of languages and technologies.

That said I would also like to vent about the most unintelligible Tridion troubles I have come across: Read more

SDL Tridion (5.3) deleting large publications

Recently I had the possibility of fixing one of my customers blueprinting structures which was of questionable quality. The change involved phasing out twelve large publications filled with all types of content imaginable. Of course I used the tips&tricks from a previous post on this topic which handled the ‘unpublishing‘ of content. However I quickly ran into problems which were the result of the time needed by the application to delete the publication.

The publications had been in use for over 4 years and a lot of content had been accumulated in them. In order to be able to delete these publications I had to set a number of timeouts to > 600 seconds. The following might prove useful if anybody attempts a similar clean-up:

  1. Tridion configuration -> timeout settings ->Seconds before a time out is generated when executing a long query
  2. MSDTC Admin Tools –> Component Services. Then Right click Computer –> Properties.
  3. IIS “script time out” error in Active Server Pages.

After these changes I was able to delete most publications and in addition serve a lot of coffee to my colleagues. Unfortunately some publications had not yet reconciled with their fate and refused to be deleted throwing all kinds of incomprehensible errors. I found the following solution to this problem:

<content disapproved by Tridion support>

  1. Please back-up your DB before attempting the following.
  2. Run the Stored procedure: EDA_PUBLICATIONS_DELETE using the publication id as parameter.

</content disapproved by Tridion support>

XSLT templating Tcm Script Assistant

I have to admit I am an XSLT junky. It seems to be useful for almost anything. From transforming Enterprise Architect models to full blown reports or simply adding sequential numbering to data dictionaries. My XSLT skills have proven to be a huge time saver on many occasions. The only flaw in XSLT is its readability.

In a rare bout of regression to my very first project role. I got down and dirty with some of the finer points of the SDL WCMS XSLT Component Templating. Specifically the part of using the Tcm Script Assistant in XSLT. From this experience I would like to share my most valuable lesson learned: Do not forget to cast the xslt node text to a string. Hopefully the code below will prove beneficial for someone running into a similar challenge:

<xsl:template match=”Content:photo”>
<xsl:element name=”=”{local-name()}”>
<xsl:value-of select=”tcmse:PublishBinary(string(@xlink:href))”/>
</xsl:element>
</xsl:template>

The error I received was: “Error occurred during a call to property or method ‘PublishBinary’”

PS do not forget to declare the tcmse namespace: xmlns:tcmse=”http://www.tridion.com/ContentManager/5.1/TcmScriptAssistant”

SDL Tridion dynamic website performance

During one of my recent Tridion consulting assignments the client asked me about the performance difference between a website using a dynamic publishing approach vs a static approach. For he had concerns about the performance of dynamic websites. In the graph below you can see the difference the Google crawler noticed after a website, which consists mainly of article type content, changed from a static to a dynamic publishing approach. The left side of the graph shows the average time it took the Google crawler to download a page from the website (created by a well known Tridion implementation partner – not Deloitte) based on the static publishing model. The right side shows the average time it took to download pages from the exact same front-end, but with the back-end rebuilt based on the dynamic publishing model.

As you can see the average download times went down significantly. We have regularly seen the cached (home)page load within 100ms and uncached pages in double that time. The peaks that show up from the middle of December marked the go live of an application which was not fed with Tridion data and showed some mediocre performance until the first week of January.

The project of replacing the Tridion back-end for this particular website felt like killing my white whale. The website performance increase was an added bonus on top of the other improvements we realized during this project.

Why page load times matter:

Next Page →