Coldfusion Solr Client - SolColdfusion

As I hinted at yesterday, I was close to having some code in the pipeline to abstract using Solr. I've finished the initial code with the following built in. Here's a brief setup guide to start playing with the code.

First, you're going to need to grab the latest release version of Solr (currently 1.2). The only real requirement to run this software is that you have a JRE of 1.5 or higher. Untar/zip the file somewhere convenient and open a command prompt. Get to the example directory in the apache-solr.1.2.x folder (cd /example). To start up the sample server running Jetty, just issue the following command:

java -jar start.jar

This will start a new instance of the Solr server on your computer on port 8983. You can make sure this is running by navigating to http://localhost:8983/solr (NOTE: this is a link to your computer. If you get an error, it's because your computer isn't running an instance of Solr on port 8983).

At this point, it's probably good to send you over to the Solr website to take a look at their tutorial. Go ahead. I'll wait...

...

Great, you're back.

You've seen some basic inserting, deleting, and querying of Solr index data. You may have also noticed that there are clients for PHP, Ruby, Python, and Java...no ColdFusion. I want to do a little more testing on this before I submit the patch, but I've added the initial code as an encosure here to do updating, deleting, and searching in Coldfusion.

The CFC SolColdfusion should be in the path org/apache/client (at least that's where I'm putting in for the purposes of this initial demonstration). The initialization takes one required parameter (the Solr host) and then has two optional parameters (port and path).

To set this up, create an instance with

<cfset solr = createObject("component", "org.apache.solr.client.SolColdfusion").init("http://localhost", "8983", "/solr") />

Now, there are a lot of different parameters you can send to Solr to perform different queries. And, since some of these key names can repeat, I chose to implement sending these parameters as an array. So, let's set this up.

<cfset params = arrayNew(1) />

<cfset params[1][1] = "indent">
<cfset params[1][2] = "on" />
<cfset params[2][1] = "wt">
<cfset params[2][2] = "standard" />
<cfset params[3][1] = "fl" />
<cfset params[3][2] = "*,score" />
<cfset params[4][1] = "qt" />
<cfset params[4][2] = "standard" />
<cfset params[5][1] = "wt" />
<cfset params[5][2] = "standard" />

These parameters are basically what are the defaults that Solr will return back to you. If you want highlighting, you would need to add two additional row vectors with 'hl = on' and 'hl.fl = '.

Searching is straight forward, taking a query, the start row, number of rows to return, and the array of parameters:

<cfset results = solr.search("*:*", 0, 10, params) />

This searches all fields and all content and returns back an XML document with the search results in it.

<cfdump var="#results#" />

In the result node, you'll see that Solr returns an xmlAttribute of

numFound
of 0 (assuming you don't have anything in the index). Let's add an example document from the documents that come with Solr.

<!--- Create a new sample document --->
<cfxml variable="sample">
<doc>
<field name="id">F8V7067-APL-KIT</field>
<field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field>
<field name="manu">Belkin</field>
<field name="cat">electronics</field>
<field name="cat">connector</field>
<field name="features">car power adapter, white</field>
<field name="weight">4</field>
<field name="price">19.95</field>
<field name="popularity">1</field>
<field name="inStock">false</field>
</doc>
</cfxml>

<!--- add this document to the index --->
<cfset solr.add(sample) />
<cfset solr.commit() />
<cfset solr.optimize() />

<!--- search for the newly added document --->
<cfset results = solr.search("id:F8V7067-APL-KIT", 0, 10, params) />

<cfdump var="#xmlParse(results)#" />

You'll notice I used a commit and optmize statement. Neither of these statements are necessary every time you add a document, but be aware that Solr caches documents and won't flush the new documents to disk unless you either commit the documents or the mergefactor setting you used in your solrconfig.xml file has been reached.

Now, let's delete this document...

<cfset solr.deleteById("F8V7067-APL-KIT") />
<cfset solr.commit() />

Don't forget to commit deletions to the index!

There'll be more soon (add multiple documents, delete by queries). In the mean time, try it out. If you have any comments, questions, concerns, whatever, let me know.

ColdFusion and Solr

I've spent the last few months working on some projects that didn't really have anything to do with ColdFusion (lots of Java and PHP). One of the projects I've been working with (Vufind.org) uses Solr as it's indexing/search engine. That's starting to get picked up by some pretty big companies (Netflix just relaunched their search using Solr this week).

I've been working with Solr in Java for a bit now, and I wanted to start to build an interface for using it as a search engine (my Lucene code is stuck in open source limbo) in Coldfusion. One of the cool things about Solr is that it returns results back through HTTP (in XML, JSON, or ruby).

As soon as I get the code finished, I'll post it as a patch in Solr.

Fun with PDFs

I've been working with a lot of PDF files lately for a few different projects (see The FlatHat and Card Catalog). With our special collections cards, when you got a result back, Acrobat viewer would blow up the image to around 600%, making for a rather ugly image. For the FlatHat, I really wanted to be able to open a PDF and have the search terms highlighted, so I started hunting for ways to actually do this.

I've been using PDFBox to extract text from our PDFs to index with Lucene, so I started there and they clued me in to Adobe's PDF Open Parameters. This really killed a few birds with one stone.

When I was working on the Flat Hat newspaper, I was originally only returning back the page that the search result was on. I had some misgivings about this (like what if the story was on more than one page), but being able to pass the search query from the engine into the PDF is really nice since the user doesn't have to search through the entire issue to find the the context they are searching for (e.g. whistle bait -- when I saw that term, I cracked up; definitely a different era).

Basically, the PDF Open Parameters allow you to pass commands into a PDF to allow you to control how the PDF is opened. They're passed with a "#" after the filename (e.g. filename.pdf#zoom=100). You can string commands together with an ampersand (&) with a few caveats:

  1. only one digit after a decimal is retained
  2. parameters and their values can only be 32 total characters long
  3. you can't use reserved characters (=, #, and &) to escape special characters
  4. if you turn bookmarks off for a PDF that had bookmarks showing, they won't go away until the PDF has been rendered

Anyway, here are some examples of what you can do:

ColdFusion and Lucene

It seems that every couple of years someone has need for some aspect of an information retrieval (IR) system with features that ColdFusion's bundled Verity IR doesn't have (see cflucene and Aaron Johnson's blog). I too ran into a situation that called for investigating alternatives to Verity.

Our Special Collections Research Center has used 3x5 index cards to catalog their archives and manuscript collections. There are myriad problems with a hard-copy index catalog (that's why we use computers right?), so we started a process of scanning the cards and running them through OCR software (we're using ABBYY). My original thought was just to dump these all in a location and index them with Verity. All was going well until I realized we had a little more than 110,000 of these individual files which was pushing the 150,000 document limit for our version of ColdFusion. I also knew that there were some other projects to digitize back-issues of student newspapers that would push this document count higher.

We had a few choices to make, upgrade our current license to the Enterprise Edition which has a 250,000 document limit, purchase a Google Search Appliance, or find a relatively easy-to-implement IR engine. Each had its own pros and cons, and we initially leaned toward purchasing the Enterprise license. I also have a forthcoming project (hopefully) that will take traditional structured data and pair it with unstructured data reports. For that, I need something that is capable of indexing pretty much anything. However, as funding in academic libraries is challenging at times, I started poking around with Lucene.

I remembered that a few years ago, Macromedia had released some code called lindex in its DRK 3. I found the CD and tried it out. I noticed that it was built on the 1.2 release of Lucene and used an old version of PDFBox to extract text from PDFs. Since there have been some significant improvements to Lucene since version 1.2 (the most current version is 2.1), I thought I would try replacing the lucene-core jar file with the more recent one, but that just led to a heap of problems. I kept looking around, but most of the projects out there haven't kept their code up to date with Lucene, so I figured I'd start playing around with it.

I'll be doing a series of posts on indexing, implementing different parts of not only Lucene and some of its contributed modules, but some third party software to help create thematic categories in the search results (http://demo.carrot2.org/demo-stable/main) and index management using ColdFusion.

For the time being, here's a demo of the search engine I'm working on. It is an index of William and Mary's student newspaper, The Flat Hat, from 1939 to 1950.