Home > Big Data and Hadoop > Full text search on big data

Full text search on big data

Now that you have bought into how analytics on “big data” will improve your company’s bottom-line revenue you are tasked with identifying your company’s search strategy for big data.
Full text search is offered by various established vendors:

  • Microsoft with its acquisition of FAST Search and Powerset
  • HP with its acquisition of Autonomy who had IDOL eDiscovery
  • IBM with its acquisition of Vivisimo
  • EMC with its acquisition of Kazeon
  • Oracle with its acquisition of Endeca
  • Dassault with its acquisition of Exalead

The flip side with each of the above is that you are now forced into buying a big vendor solution perhaps comparable to a consumer being forced to buy a complex CD/MP3/radio/clock when all you wanted was the clock with big blue lettering so it is visible in the dark!

What alternatives do you have to the big vendors? Lucene is a full text search engine written in the Java language.  A more precise definition I’ve seen on the net is “Lucene is an information retrieval library with support for tokenizers, analyzers, query parsers”..  Pray what does this mean to you and me?  Just that Lucene is a low level tool-kit and a library without support for key things like distributed indexes.    Not of much use to a CIO with limited engineering staff.   If you were a company like Salesforce, LinkedIn or Twitter with legions of developers Lucene would have been perfectly adequate.

What then should the typical enterprise who is not one of the booming dot coms do?  Look into technologies like Apache Solr – an HTTP search server built on top of Lucene.  Specifically Solr gives you REST-like HTTP/XML and JSON APIs which means you don’t have to have developers creating Java code.
What Solr gives you is the following:

  • Full-text search
  • Highlighting of hits
  • Faceted search
  • Dynamic clustering
  • Database integration
  • Rich document handling of formats like Word and PDF

Are there companies using Solr over Lucene today?  Groupon is said to use Solr/Lucene to index email sent to Groupon customers and to inform consumers about deals available in their geographical vicinity.  Other companies which are said to use Solr/Lucene include Netflix, AT&T, Sears, Ford, and Verizon.

By this time you’re asking yourself, if I invest in developing searching using Solr over Lucene how do I get support if it is all open-source?.  Upcoming vendors like Lucidworks, OpenLogic  offer consulting and support for deploying search applications using Solr over Lucene. The other question you might have is “Why invest in x86 servers, rack space, cooling, floor tile issues, what are the pros/cons if I move everything into the Amazon cloud?.”  An interesting cost comparison on this topic done by OpenLogic may be found here.  Having read all of the above what if you want to explore open source alternatives to Solr? Look into Elastic search, Blur.  In short, there are many ways to find the proverbial needle in the “big data” haystack.

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: