Solr Atomic Update

One of the promising features of Solr 4.0 is atomic updates. With the previous releases of Solr, to update a document, you are supposed to send all the fields, even those that have not changed. If you provide only the fields that has changed, the values of other fields will be lost. What does it behave so? It’s because Lucene deletes the document and then adds it.

There are many a times, you form Lucene index by reading data from different sources or from different tables in DB. Forming a complete Solr document is a costly operation in many a case, say you are forming a Solr document from different graph DBs. Solr 4.0 sets you free! Just add the field (along with few additional parameters) that is to be updated along with the unique field and you are done. Internally Solr queries the document based on uniqueId, modifies it and  adds it back to the index. But it sets you free from doing the same in your client application.

To update a field, add update attribute to the field tag, with set as the value. Here is an example:

<add>
<doc>
<field name=”id”>1</field>
<field name=”profession” update=”set”>consultant</field>

</doc>

</add>

This works really well for single valued field. But if you want to update the value for a multi-valued field, better send the whole document as you can make it work only with a work around as of now.

I tried setting the values of multi-valued field and this is what happened.

<add>
<doc>
<field name=”id”>1</field>
<field name=”skills” update=”set”>mac</field>
<field name=”skills” update=”set”>linux</field>
</doc>
</add>

The document added to Solr was as follows:

<doc>
<str name=”id”>1</str>
<arr name=”skills”>
<str>{set=mac}</str>
<str>{set=linux}</str>
</arr>
</doc>

This is really something which I was not looking for.

I happened to find a work around to achieve proper add of value to multi-valued field. Trick is update any other field with same value (you are sure of) or have a dummy field and update it will null value. Also pass the values for multi-valued fields the way you do will adding new document. Here is an example:

<add>
<doc>
<field name=”id”>1</field>
<field name=”profession” update=”set”>consultant</field>
<field name=”skills”>java</field>
<field name=”skills”>J2EE</field>
<field name=”skills”>Unix</field>
</doc>

</add>

With this I was able to achieve what I wanted.

Yeah, if you want to add value to existing set of values in multi-valued field, this is simple.

<add>
<doc>
<field name=”id”>1</field>
<field name=”skills” update=”add”>windows</field>
</doc>
</add>

If you want to reset/remove the value of a field in document, pass additional parameter null=true as follows:

<field name=”name” update=”set” null=”true”></field>

Advertisements

Index XML documents to Solr

The two primary operations on Solr are indexing and searching. When it comes to indexing in Solr, documents can be indexed using different sources like DB, XMLS, CSV etc. In this blog, we are going to focus on indexing XMLS. The XML can be indexed to Solr as follows:

  1. Over HTTP:To index document to Solr, xml should be created in following format:
    <add>
     <doc>
       <field name="field1">value to be indexed</field>
       <field name="field2">value to be indexed</field>
     </doc>
     <doc>
       <field name="field1">value to be indexed</field>
       <field name="field2">value to be indexed</field>
     </doc>
    </add>

Note: field1 & field2 should correspond to field name in schema.xml. Ensure that the value for required field is present in the XML.

The document can be indexed using GET or POST method. Use GET only when adding few documents.

If the document is being added using GET method, index the documents as follows:

http://localhost:8080/solr/umdb_mapping/update?stream.body=<add><doc><field name=”song”>Love the way you cry</field><field name=”album”>Rihanna</field></doc></add>

If POSTing the document using curl, do it as follows:

curl http://<host&gt;:<port>/solr/<core-if-applicable>/update?commit=true -H “Content-Type: text/xml” –data-binary ‘<add><doc><field name=”song”>Love the way you cry</field><field name=”album”>Rihanna</field></doc></add>’

If the XML document is in a file (assuming the file is in the same directory) instead of stream, pass the file-name as follows:

curl http://<host&gt;:<port>/solr/<core-if-applicable>/update?commit=true -H “Content-Type: text/xml” –data-binary ‘@solr_xml_sample.txt’

The documents added are not committed by default, this has to be done by either making separate commit request like ‘<commit/>’  or add query should contain parameter commit=true. 

Ever wondered how to add value for a multiValued Solr field. Well, it’s no new tag need to be known. All you need to do is have multiple field tag with that name. It has to be something as follows:

<add>
 <doc>
   <field name="field1">value to be indexed</field>
   <field name="field2">value to be indexed</field>
   <field name="field2">value to be indexed2</field>
 </doc>
</add>

2. Using DataImport: Will be taking this up soon..

Please refer to Solr Wiki URL for details of other optional attributes and other XML operations like update & delete.


Stemming vs Lemmatization

How is stemming and lemmatization different?

Stemming work on single word without knowledge of the subject. Stemmers are easier to implement and faster to run.

Lemma of a word changes with context and hence are difficult to implement. ‘Running’ has ‘run’ as it’s lemma as well as stem. ‘Better’ has ‘good’ as it’s lemma but not stem.


Solr vs Lucene

In this post, we are not trying to compare Solr and Lucene, as both are not different technologies, but we are trying to identify when to use which. I would recommend that in 90% of the cases, or even more, Solr would be the preferred choice, as it’s nothing Serverization of Lucene. Below are the list of additional features which solr provides, on top of Lucene:

  • – Processing request over http
  • – Caching mechanism
  • – Admin interface
  • – Configuration in xm file, with notion of fieldType
  • – DisMax query
  • – Spell check & suggest
  • – More like this
  • – Distributed & cloud features
  • – DataImportHandler & other handlers for extracting data
Above features makes it the preferred choice. Now comes the question when should you use Lucene. It most of the cases you would not. But if the memory available is limited like in cases of mobile devices or you need to write lot of low level code, tuning/adding your own logic, Lucene would be your choice.