The processing instruction target matching “[xX][mM][lL]” is not allowed

 The processing instruction target matching “[xX][mM][lL]” is not allowed

This exception reminds me of quote “Much ado about nothing”.

I experienced this exception while parsing an XML using SAXParser. I searched over the web and found many questions on this and there were several others who experienced the same issue. So thought of putting a solution over here, so that it might help someone if they safe something similar.

Well, the solution is pretty simple. The problem is somewhere with the first line of you XML i.e.

<?xml version=”1.0″ encoding=”utf-8″?

In my case there was a whitespace before the start tag <. I trimmed the XML string and that’s all!

Advertisements

How to identify endianness of a system

Though Java is platform independent, there are times when we need to identify the endianness of a system.

In this short blog, we will see how can this be achieved.

1. Using System.Properties()

System.getProperty(“sun.cpu.endian”);

Ex:

System.out.println(“Endianness: ” + System.getProperty(“sun.cpu.endian”));

Output:

Endianness: little

2. Using ByteOrder.nativeOrder()

System.out.println(“ByteOrder: ” + ByteOrder.nativeOrder());
if(ByteOrder.nativeOrder().equals(ByteOrder.LITTLE_ENDIAN)) {
System.out.println(“Little Indian”);
}

Output:

ByteOrder: LITTLE_ENDIAN
Little India

3. Using BitShift

int right = 0;
int left = 1;
int combine = (left<<16) | right;
if(combine == 65536) {
System.out.println(“Little Indian”);
}

Output:

Little Indian


Solr 4.0 uber cool usages

One of the most common question I have been asked recently is what are the new features in Solr 4.0. Well, there are lot many posts on the web that provides the information. Solr wiki also explains it quite well.

I will try to throw some light on the features that are really cool and what are the practical usages of the feautes and which of them we are trying to leverage in our current projects. Also, I will define the feature in brief to set the ground.

  • Pseudo-fields: It provides ability to alias a  field, or add metadata along with returned documents. We are using it to return the confidence score of the matched document
  • New Spell Checker implementation: This will not require a new index to be created and will work on the main index. Hence, no extra index need to be maintained for spellchecker to work
  • Enhancements has been done to the function queries. Conditional function query is allowed. We had a scenario where we were boosting the document on the basis of download count. There were some documents for which download count was not available. They were badly effected. Now conditional boosting can be done or only document with more than specified download count will be boosted
  • Atomic update: Provide flexibility to update only the fields of the document that has been modified. Prior version required us to send the complete document even in single field has been modified. Note: Internally, it’s still implemented as delete and add and is not DB like update
  • New relevance ranking models like BM25, Language Models etc has been introduced. Analysis need to be done to check if some other model works better than current VSM
  • Indexed terms are no longer UTF-16 char sequences, instead terms can be any binary value encoded as byte arrays. By default, text terms are now encoded as UTF-8 bytes
  • A transaction log ensures that even uncommitted documents are never lost
  •  FuzzyQueries are 100x faster
  • Solr Cloud. I will refrain from using this until 4.1 releases
  • NoSql features: As of now, I will prefer to use Solr for search. For NoSql i will prefer to stick with my NoSql DB of my choice

Precisely, what is precision and recall?

Precision and recall is something which comes to our mind first when we talk of information retrieval.

Whenever we develop an IR engine or tune the existing engine, we are interested to know how good our search result is or how is the improvement. This is where precision and recall comes into play.

Whenever we query the IR system, we generally retrieve the x result out of the relevant results from the total documents z in corpus. Out of these x retrieved documents some a will be relevant.

Precision can be defined as a/x and recall is a/y.

Hence, we can define precision as the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved.

For example, the index has 20 documents for music and 10 for movies.  A query for some music returns 10 document which as 5 music and 5 movies. Hence, the precision is 5/10= 1/2 i.e. 50% and recall is 5/20= 1/4 i.e. 25%for the query.

In a nutshell, we can say that precision is a measure of quality while recall is a measure of quantity. So, high recall means that an algorithm returned most of the relevant results and high precision means that an algorithm returned more relevant results than irrelevant.

Precision = relevant (intersect) retrieved / retrieved
Recall = relevant (intersect) retrieved / relevant

 

In the next blog, we will try to dive deeper into the concept.

 


Solr Atomic Update

One of the promising features of Solr 4.0 is atomic updates. With the previous releases of Solr, to update a document, you are supposed to send all the fields, even those that have not changed. If you provide only the fields that has changed, the values of other fields will be lost. What does it behave so? It’s because Lucene deletes the document and then adds it.

There are many a times, you form Lucene index by reading data from different sources or from different tables in DB. Forming a complete Solr document is a costly operation in many a case, say you are forming a Solr document from different graph DBs. Solr 4.0 sets you free! Just add the field (along with few additional parameters) that is to be updated along with the unique field and you are done. Internally Solr queries the document based on uniqueId, modifies it and  adds it back to the index. But it sets you free from doing the same in your client application.

To update a field, add update attribute to the field tag, with set as the value. Here is an example:

<add>
<doc>
<field name=”id”>1</field>
<field name=”profession” update=”set”>consultant</field>

</doc>

</add>

This works really well for single valued field. But if you want to update the value for a multi-valued field, better send the whole document as you can make it work only with a work around as of now.

I tried setting the values of multi-valued field and this is what happened.

<add>
<doc>
<field name=”id”>1</field>
<field name=”skills” update=”set”>mac</field>
<field name=”skills” update=”set”>linux</field>
</doc>
</add>

The document added to Solr was as follows:

<doc>
<str name=”id”>1</str>
<arr name=”skills”>
<str>{set=mac}</str>
<str>{set=linux}</str>
</arr>
</doc>

This is really something which I was not looking for.

I happened to find a work around to achieve proper add of value to multi-valued field. Trick is update any other field with same value (you are sure of) or have a dummy field and update it will null value. Also pass the values for multi-valued fields the way you do will adding new document. Here is an example:

<add>
<doc>
<field name=”id”>1</field>
<field name=”profession” update=”set”>consultant</field>
<field name=”skills”>java</field>
<field name=”skills”>J2EE</field>
<field name=”skills”>Unix</field>
</doc>

</add>

With this I was able to achieve what I wanted.

Yeah, if you want to add value to existing set of values in multi-valued field, this is simple.

<add>
<doc>
<field name=”id”>1</field>
<field name=”skills” update=”add”>windows</field>
</doc>
</add>

If you want to reset/remove the value of a field in document, pass additional parameter null=true as follows:

<field name=”name” update=”set” null=”true”></field>


Index XML documents to Solr

The two primary operations on Solr are indexing and searching. When it comes to indexing in Solr, documents can be indexed using different sources like DB, XMLS, CSV etc. In this blog, we are going to focus on indexing XMLS. The XML can be indexed to Solr as follows:

  1. Over HTTP:To index document to Solr, xml should be created in following format:
    <add>
     <doc>
       <field name="field1">value to be indexed</field>
       <field name="field2">value to be indexed</field>
     </doc>
     <doc>
       <field name="field1">value to be indexed</field>
       <field name="field2">value to be indexed</field>
     </doc>
    </add>

Note: field1 & field2 should correspond to field name in schema.xml. Ensure that the value for required field is present in the XML.

The document can be indexed using GET or POST method. Use GET only when adding few documents.

If the document is being added using GET method, index the documents as follows:

http://localhost:8080/solr/umdb_mapping/update?stream.body=<add><doc><field name=”song”>Love the way you cry</field><field name=”album”>Rihanna</field></doc></add>

If POSTing the document using curl, do it as follows:

curl http://<host&gt;:<port>/solr/<core-if-applicable>/update?commit=true -H “Content-Type: text/xml” –data-binary ‘<add><doc><field name=”song”>Love the way you cry</field><field name=”album”>Rihanna</field></doc></add>’

If the XML document is in a file (assuming the file is in the same directory) instead of stream, pass the file-name as follows:

curl http://<host&gt;:<port>/solr/<core-if-applicable>/update?commit=true -H “Content-Type: text/xml” –data-binary ‘@solr_xml_sample.txt’

The documents added are not committed by default, this has to be done by either making separate commit request like ‘<commit/>’  or add query should contain parameter commit=true. 

Ever wondered how to add value for a multiValued Solr field. Well, it’s no new tag need to be known. All you need to do is have multiple field tag with that name. It has to be something as follows:

<add>
 <doc>
   <field name="field1">value to be indexed</field>
   <field name="field2">value to be indexed</field>
   <field name="field2">value to be indexed2</field>
 </doc>
</add>

2. Using DataImport: Will be taking this up soon..

Please refer to Solr Wiki URL for details of other optional attributes and other XML operations like update & delete.


Solr vs Lucene

In this post, we are not trying to compare Solr and Lucene, as both are not different technologies, but we are trying to identify when to use which. I would recommend that in 90% of the cases, or even more, Solr would be the preferred choice, as it’s nothing Serverization of Lucene. Below are the list of additional features which solr provides, on top of Lucene:

  • – Processing request over http
  • – Caching mechanism
  • – Admin interface
  • – Configuration in xm file, with notion of fieldType
  • – DisMax query
  • – Spell check & suggest
  • – More like this
  • – Distributed & cloud features
  • – DataImportHandler & other handlers for extracting data
Above features makes it the preferred choice. Now comes the question when should you use Lucene. It most of the cases you would not. But if the memory available is limited like in cases of mobile devices or you need to write lot of low level code, tuning/adding your own logic, Lucene would be your choice.