Solr Highlighting - Enterprise Solr vs ColdFusion's Built-in Solr

First, a big THANK YOU to everyone who attended my session at CF Objective. It was very rewarding to be able to spread the information rattling around in my head and answer your questions about Solr. One of the questions I received has inspired a blog post.

This particular attendee needed to be able to display highlighting from multiple areas of a document in the case that the matching search term appears more than once in that document. Using the Solr instance built into ColdFusion, she was unable to achieve this. I'll outline the different highlighting options available in both ColdFusion's and the full version of Apache Lucene Solr in this post.

First, I am in no way bashing Adobe's roll out of Solr. It will work fine for certain types of applications that only require basic search and indexing features. If you need a lot of details out of your search, however, the full version available from Apache is the way to go.

So let's see what ColdFusion has to offer us when we want highlighting results with our search.

I created a document containing 5 paragraphs of "Lorem Ipsum" and placed the word "dog" at four random locations within the document. I then used CFIndex to add the document to my ColdFusion collection and performed a search using CFSearch. In the "context" column of the returned query object I get:

ColdFusion provides us with a short snippet of text with the search term surrounded by HTML emphasis tags. Although the search term appears four times inside the document text, we only get the first match here. The HTML tags surrounding the matched term can be customized using the contextHighlightBegin and contextHighlightEnd attributes of the CFSearch tag in the case you wanted to change the background color or have your match noted a different way.

Okay, so if we want ALL of the highlighting data, that doesn't help us. Let's see what we get with the full version of Solr 4.8.0 with CFSolrLib to integrate it into our CF application.

I indexed the same text file, using Apache Tika to extract the text from the document and the "add" method of CFSolrLib to index my data. I placed the text in the "title" field since I already have that field set up for highlighting. There are two types of highlighters available in Apache Solr. There is a "simple" highlighting component, which is much like what you see in ColdFusion's instance, and a "Fast Vector Highligher", which is a much faster and far superior highlighting component. For this example, I will be using the Fast Vector Highlighter.

First thing to note, when using the Fast Vector Highlighter, the field you intend to pull your highlighting data from has to have Term Offsets, Term Positions and Term Vectors set to true. In the Solr instance distributed with CFSolrLib you'll find this line in your schema:


<field name="title" type="text_general" indexed="true" stored="true" termOffsets="true" termPositions="true" termVectors="true" />

This field has the proper attributes set to use the Fast Vector Highlighter.

Now let's look at our search example (searchExample.cfm distributed with CFSolrLib). Taking a look at the top of the example, there are highlighting attributes being passed to Solr.


<cfset local.params = structNew()>
<cfset local.params["hl"] = "on">
<cfset local.params["hl.fl"] = "title">
<cfset local.params["hl.fragListBuilder"] = "simple">
<cfset local.params["hl.fragsize"] = 20>
<cfset local.params["hl.snippets"] = 10>
<cfset local.params["hl.useFastVectorHighlighter"] = true>
<cfset local.params["hl.fragmentsBuilder"] = "colored">
<cfset local.params["hl.boundaryScanner"] = "default">
<cfset local.params["hl.usePhraseHighlighter"] = true>
<cfset searchResponse = sampleSolrInstance.search(URL.q,0,100,"title",local.params) />

hl.fl contains the list of fields we want highlighting results from and we're telling Solr to use the simple frag list builder. Fragsize tells Solr how long to make the snippets of text returned in the results and hl.snippets sets a cap on the number of snippets that will be returned. We're telling Solr we want to use the Fast Vector Highlighter and that we want it to automatically change the background color behind the matching text. This will set a new color for each matching term, if you're searching for more than one word. We've also enabled Phrase Highlighting in the case that we search for a phrase, which must be surrounded by quotes. Finally we're passing the field name "title" along with our parameters to the search function since we're expecting highlighting results from that field. Currently CFSolrLib is configured to return highlighting results from only one field, but it could be easily modified to accept a list of fields.

If I run searchExample.cfm in a browser and search for "dog" with highlighting enabled, I'll get my document back with just the first instance of "dog" highlighted. If we take a look at the code we'll see:


<cfif structKeyExists(currentResult,"highlightingResult")>#currentResult.highlightingResult[1]#</cfif>

If we get highlighting results back, it displays only the first value in an array. This was done for simplicity in the example. If you wanted to display all of the results, the array could easily be looped to show all of the results, or a specific row could be displayed. If we dump out our result we see:

You'll notice that instead of a query object, we get back a ColdFusion structure containing an array of results. In our highlighting node of the struct we get back an array containing each of the highlighting snippets Solr generated. This array can easily be accessed to display the highlighting result from any section of the document. The fragsize attribute can be adjusted to make these snippets of text longer or shorter. I recommend playing with the example some to see how you can change the way your highlighting results are returned.

As always I'm happy to answer any questions you have about this post and keep in mind, just because you're not getting what you want from ColdFusion's Solr, don't give up on it. Solr is fast and feature rich. Chances are, it will do more than you think.

CFSolrLib can be downloaded at https://github.com/iotashan/cfsolrlib. The version available for download there is distributed with Solr 4.0.0, but I'm working on updating my fork of the repository, https://github.com/VWRacer/cfsolrlib, with Solr 4.8.0. I'll update this post when it's done.

My Proposed Topic for CF Objective 2014

This time around, I decided to throw my hat in the ring to potentially be a speaker at CF Objective and, of course, it involves Apache Solr. My topic is named "Beyond CFIndex - Apache Solr Enterprise Search Server and ColdFusion Integration". I did a similar presentation in October 2012 at CF.Objective(ANZ) in Melbourne, Australia. Solr 4.0 had just been released and I was able to highlight some of the new shiny parts, especially the redesigned web admin interface. My goal with this topic for 2014 is to show off some of the powerful features of Apache Solr that just aren't available to the user when using ColdFusion's built-in version of Solr and how to harness them within ones CF Application. In addition, I'll be highlighting some of the breakthroughs that have come about in Solr 4 since its initial release. Most recently, Solr 4.5 has introduced some VERY useful features including the ability to manage your schema through the API or run in a schema-less mode where Solr uses its best guess on the field type depending on what you send it for indexing.

With all of that said, voting is now open for proposed topics and we need to CF Community's opinions to make CF Objective the great conference it always has been. I have never had the means to go on my own, so speaking would open a lot of doors for me. Speaker or not, I will find some way to be there this time around.

Voting on topics can be done at:

https://trello.com/b/4M6JSoyL/cf-objective-call-for-speakers-2014

You will have to sign up for a Trello username to vote, or you can link Trello to your Google+ account.

I thank everyone for their support and will see you out there!

Copyright © 2008 - Jim Leether BlogCFC was created by Raymond Camden. This blog is running version 5.9.1.001. Contact Jim