Apache Solr, Tika and Those Dreaded "X-Files"

I had a great post written on multicore mode, indexing and searching in multicore mode and doing a distributed search over several cores, but good ole Skype locked up my computer and I lost the entire thing. So instead I decided to touch on this subject.

So we all know Solr is nifty by now. If you've done any in depth reading its capabilities, the ways you can break up and analyze text to make your searches are relevant are endless. Solr uses Apache Tika to parse many types of documents and Tika does a great job extracting content and metadata from a large number of different kinds of files.

Those of you using Solr with your CF applications probably know that it's smart to load Tika on the ColdFusion side and parse a file's content before sending it over to Solr. That way you're not streaming an entire file over HTTP, but simply sending a string. This frees up resources and bandwidth for other things, and is also a lot faster. There is an example of how to use Tika on the ColdFusion side in the index example in the latest CFSolrLib.

In an application I maintain the Solr server and code for, things were humming along just fine until I tried to parse a .docx file. All of the sudden the application chokes and I receive a ColdFusion error. This became a huge thorn in my side and was happening with any of the "newer" Microsoft Office files with file extensions ending in "X" (aka Open XML). They became known around the office as "X-Files". In the stack trace of the error was:

Caused by: java.lang.ClassCastException: org.dom4j.DocumentFactory cannot be cast to org.dom4j.

So what the heck does that mean???

To put it in English, Tika processes Open XML formatted documents (docx, xlsx, pptx, etc) in a different way and as a result, must use a different context class loader. This used to be difficult and required a lot of code, but thanks to Mark Mandel, a switchThreadContextClassLoader method was added to Javaloader that does this for us automatically.

In the application I work on, the file extension is stored in a database, so it's very easy for me to make a comparison and switch the context class loader when needed:


<cfscript>
// I probably do not have all the file formats in this list, but these are some common ones.
if (listFindNoCase("docx,xlsx,pptx,docm,xlsm,pptm,ppsx",arguments.fileExtension)) {
// parsing OpenXML files must be done using a different context class loader
var fileObject = application.javaloader.switchThreadContextClassLoader(processOpenXmlFile, { filePath = arguments.filePath });
SolrInstance.add([{name="content",value=fileObject},{name="attr_fileName",value=ARGUMENTS.fileName},{name="id",value=ARGUMENTS.id}]);
} else {
// use our cached copy of tika and parse the file
application.tika.setMaxStringLength(-1);
var fileObject = application.tika.parseToString(createObject("java","java.io.File").init(arguments.filePath));
SolrInstance.add([{name="content",value=fileObject},{name="attr_fileName",value=ARGUMENTS.fileName},{name="id",value=ARGUMENTS.id}]);
}
</cfscript>

The code above reads the file extension and decides how to process the incoming file based on whether or not it's an Open XML formatted file. If it is an Open XML file, it calls the switchThreadContextClassLoader method and loads a new instance of Tika in the processOpenXMLFile method, where the file content is parsed and returned.

Here's look at the processOpenXMLFile method:


<cffunction name="processOpenXMLFile" access="private" returntype="string">
<cfargument name="filepath" type="string" required="yes">

<cfscript>
    // grab a new instance of tika
    var tika = application.javaloader.create("org.apache.tika.Tika").init();
        
    // parse the file
    tika.setMaxStringLength(-1);
    var returnValue = tika.parseToString(createObject("java","java.io.File").init(arguments.filePath));
        
// return the parsed string
    return returnValue;
        
</cfscript>

</cffunction>

Take note of the tika.setMaxStringLength(-1); setting. By default, Tika will only extract the first 1000 characters from a document. You can set this to as many characters as you want, but setting it to -1 will remove the restriction altogether.

Using the code above will allow your application to handle any Open XML file you want to throw at it, just make sure the file extensions you need are listed in the function. If you need to place the code in more than one location, you could store the list of file extensions in a database or variable, that way you only have to maintain it in one place.

This error was a huge pain for me and I want to thank Mark Mandel and Jeff Coughlin for helping me to flush out the issue. I'm a big fan of "passing along the love", so hopefully this will help someone who encounters this problem flush it out in a timely manner. I wasn't able to find much on the subject, and what I was able to find was more for the Java programmer, which I am not.

Have fun out there and keep on indexing!

Comments (Comment Moderation is enabled. Your comment will not appear until approved.)
Javier's Gravatar I need use solr with rich document and tika but I dont know ColdFusion, I use struts, I use struts2, can you help me?
# Posted By Javier | 2/9/13 10:52 AM
Jim Leether's Gravatar Since you're using Struts, I assume you're somewhat familiar with java programming. I would take a look at the documentation for both Tika and SolrJ. SolrJ is the API we use with CFSolrLib to communicate with Solr. It is Java based. I believe you can use the Tika and SolrJ jar files in your application and call their parsing and indexing methods from within your application.

http://tika.apache.org/
http://wiki.apache.org/solr/Solrj
# Posted By Jim Leether | 2/9/13 12:47 PM
Copyright © 2008 - Jim Leether BlogCFC was created by Raymond Camden. This blog is running version 5.9.1.001. Contact Jim