Solr 4.0 uber cool usages

One of the most common question I have been asked recently is what are the new features in Solr 4.0. Well, there are lot many posts on the web that provides the information. Solr wiki also explains it quite well.

I will try to throw some light on the features that are really cool and what are the practical usages of the feautes and which of them we are trying to leverage in our current projects. Also, I will define the feature in brief to set the ground.

  • Pseudo-fields: It provides ability to alias a  field, or add metadata along with returned documents. We are using it to return the confidence score of the matched document
  • New Spell Checker implementation: This will not require a new index to be created and will work on the main index. Hence, no extra index need to be maintained for spellchecker to work
  • Enhancements has been done to the function queries. Conditional function query is allowed. We had a scenario where we were boosting the document on the basis of download count. There were some documents for which download count was not available. They were badly effected. Now conditional boosting can be done or only document with more than specified download count will be boosted
  • Atomic update: Provide flexibility to update only the fields of the document that has been modified. Prior version required us to send the complete document even in single field has been modified. Note: Internally, it’s still implemented as delete and add and is not DB like update
  • New relevance ranking models like BM25, Language Models etc has been introduced. Analysis need to be done to check if some other model works better than current VSM
  • Indexed terms are no longer UTF-16 char sequences, instead terms can be any binary value encoded as byte arrays. By default, text terms are now encoded as UTF-8 bytes
  • A transaction log ensures that even uncommitted documents are never lost
  •  FuzzyQueries are 100x faster
  • Solr Cloud. I will refrain from using this until 4.1 releases
  • NoSql features: As of now, I will prefer to use Solr for search. For NoSql i will prefer to stick with my NoSql DB of my choice
Advertisements

Solr Atomic Update

One of the promising features of Solr 4.0 is atomic updates. With the previous releases of Solr, to update a document, you are supposed to send all the fields, even those that have not changed. If you provide only the fields that has changed, the values of other fields will be lost. What does it behave so? It’s because Lucene deletes the document and then adds it.

There are many a times, you form Lucene index by reading data from different sources or from different tables in DB. Forming a complete Solr document is a costly operation in many a case, say you are forming a Solr document from different graph DBs. Solr 4.0 sets you free! Just add the field (along with few additional parameters) that is to be updated along with the unique field and you are done. Internally Solr queries the document based on uniqueId, modifies it and  adds it back to the index. But it sets you free from doing the same in your client application.

To update a field, add update attribute to the field tag, with set as the value. Here is an example:

<add>
<doc>
<field name=”id”>1</field>
<field name=”profession” update=”set”>consultant</field>

</doc>

</add>

This works really well for single valued field. But if you want to update the value for a multi-valued field, better send the whole document as you can make it work only with a work around as of now.

I tried setting the values of multi-valued field and this is what happened.

<add>
<doc>
<field name=”id”>1</field>
<field name=”skills” update=”set”>mac</field>
<field name=”skills” update=”set”>linux</field>
</doc>
</add>

The document added to Solr was as follows:

<doc>
<str name=”id”>1</str>
<arr name=”skills”>
<str>{set=mac}</str>
<str>{set=linux}</str>
</arr>
</doc>

This is really something which I was not looking for.

I happened to find a work around to achieve proper add of value to multi-valued field. Trick is update any other field with same value (you are sure of) or have a dummy field and update it will null value. Also pass the values for multi-valued fields the way you do will adding new document. Here is an example:

<add>
<doc>
<field name=”id”>1</field>
<field name=”profession” update=”set”>consultant</field>
<field name=”skills”>java</field>
<field name=”skills”>J2EE</field>
<field name=”skills”>Unix</field>
</doc>

</add>

With this I was able to achieve what I wanted.

Yeah, if you want to add value to existing set of values in multi-valued field, this is simple.

<add>
<doc>
<field name=”id”>1</field>
<field name=”skills” update=”add”>windows</field>
</doc>
</add>

If you want to reset/remove the value of a field in document, pass additional parameter null=true as follows:

<field name=”name” update=”set” null=”true”></field>


Index XML documents to Solr

The two primary operations on Solr are indexing and searching. When it comes to indexing in Solr, documents can be indexed using different sources like DB, XMLS, CSV etc. In this blog, we are going to focus on indexing XMLS. The XML can be indexed to Solr as follows:

  1. Over HTTP:To index document to Solr, xml should be created in following format:
    <add>
     <doc>
       <field name="field1">value to be indexed</field>
       <field name="field2">value to be indexed</field>
     </doc>
     <doc>
       <field name="field1">value to be indexed</field>
       <field name="field2">value to be indexed</field>
     </doc>
    </add>

Note: field1 & field2 should correspond to field name in schema.xml. Ensure that the value for required field is present in the XML.

The document can be indexed using GET or POST method. Use GET only when adding few documents.

If the document is being added using GET method, index the documents as follows:

http://localhost:8080/solr/umdb_mapping/update?stream.body=<add><doc><field name=”song”>Love the way you cry</field><field name=”album”>Rihanna</field></doc></add>

If POSTing the document using curl, do it as follows:

curl http://<host&gt;:<port>/solr/<core-if-applicable>/update?commit=true -H “Content-Type: text/xml” –data-binary ‘<add><doc><field name=”song”>Love the way you cry</field><field name=”album”>Rihanna</field></doc></add>’

If the XML document is in a file (assuming the file is in the same directory) instead of stream, pass the file-name as follows:

curl http://<host&gt;:<port>/solr/<core-if-applicable>/update?commit=true -H “Content-Type: text/xml” –data-binary ‘@solr_xml_sample.txt’

The documents added are not committed by default, this has to be done by either making separate commit request like ‘<commit/>’  or add query should contain parameter commit=true. 

Ever wondered how to add value for a multiValued Solr field. Well, it’s no new tag need to be known. All you need to do is have multiple field tag with that name. It has to be something as follows:

<add>
 <doc>
   <field name="field1">value to be indexed</field>
   <field name="field2">value to be indexed</field>
   <field name="field2">value to be indexed2</field>
 </doc>
</add>

2. Using DataImport: Will be taking this up soon..

Please refer to Solr Wiki URL for details of other optional attributes and other XML operations like update & delete.


Solr vs Lucene

In this post, we are not trying to compare Solr and Lucene, as both are not different technologies, but we are trying to identify when to use which. I would recommend that in 90% of the cases, or even more, Solr would be the preferred choice, as it’s nothing Serverization of Lucene. Below are the list of additional features which solr provides, on top of Lucene:

  • – Processing request over http
  • – Caching mechanism
  • – Admin interface
  • – Configuration in xm file, with notion of fieldType
  • – DisMax query
  • – Spell check & suggest
  • – More like this
  • – Distributed & cloud features
  • – DataImportHandler & other handlers for extracting data
Above features makes it the preferred choice. Now comes the question when should you use Lucene. It most of the cases you would not. But if the memory available is limited like in cases of mobile devices or you need to write lot of low level code, tuning/adding your own logic, Lucene would be your choice.