SolColdfusion Update

I've been knocked out by a really bad cold the last couple of weeks and I'm just starting to get things back to normal. I did want to send a quick post on the status of the SolColdfusion project. After seeing Ray's Seeker project, it reminded me that I hadn't set up a project at RIAForge yet, so I took care of that last night. You can now access the official project at http://solcoldfusion.riaforge.org/.

Since the project site is up-and-running, I also submitted it to the Solr project SOLR-404 (I know it's a coincidence that it's number 404, but it makes me wonder if it's some type of bad omen ;) ...). Anyway, if you think the client should get included in the project, be sure to vote for it!

I've also been working on a generic interface, much like Erik Hatcher's Solr Flare. It's been a bit slow coming as I've not had a lot of time to work on these projects, but hopefully things will settle down shortly so I can devote a bit more time to them.

Solr Schema

If you've ever worked on a project that involved Coldfusion's bundled version of Verity, you've no doubt run into the issue of trying to confine your fields into the structure that Verity imposes, and those custom fields are really precious in these instances. About 6 months ago, I ran into an issue with a search project where I had about 125,000 documents to index. Since we also wanted to be able to use the indexes for some other projects, I was a bit nervous to commit almost the entire allotment of indexable objects to one collection. This launched me into writing a custom search engine and indexer using Lucene and slapping Coldfusion around the responses to do things that Verity did. However, once the projects were complete, I never really got around to making it easy to use. It does cool stuff like search across multiple collections, context highlighting, relevancy calculations, term vector calculations, "did you mean", etc. Essentially everything I think all good search engines need to be able to do. Something this system lacked was an easy way to define the fields that you wanted indexed (along with a knowledge of Java to actually make the changes).

The ability to create any number of fields to index in different ways (along with faceting) is a real strong point of Solr. Not only can you add fields and choose how that data is analyzed, you can create your own field types that process the information in your index the way you want them to be.

This is done in the $SOLR_HOME/config/schema.xml file. The first section (<types>) defines the types of fields that you will be using, and how Solr should process them with Lucene. If you look at some of the fieldtypes, you'll get an idea of what's possible. For instance, the fieldtype for "string" is an untokenized field that doesn't normalize the fields and sorts missing information last.

<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

However, if you need a more robust fieldtype, look at the fieldtype for "text". This uses a whitespace tokenizer (splits words with whitespace) and uses the stopwords defined in the stopwords.txt file. It does some other processing (filters words, converts them to lowercase, runs a porter stemmer, and then removes duplicates). This fieldtype also defines what to do when a query is passed to it (uses the same filters). This is slightly different than the defined "textTight" which does not perform any further analysis on the text when being queried. You'll probably find that most of these work for most instances, but if you need to, you can build your own fieldtype that has very specific indexing and query filters.

The next section contains the actual fields you want to use in the aptly named "fields" element. This is where you actually define the fields that will be in your index, the type of analysis to perform on the field, if it should be indexed, stored, have term vectors, or be multivalued (have multiple instances of the same field in the index).

Let's say you're wanting to develop an indexing schema for books (hey, I work in a library). At a very basic level, you'd want a field for an id, title, author, reviews, and a set of topics (or tags). Your fields element would contain something along the lines of:

<field name="id" type="string" indexed="true" stored="true"/>
<field name="title" type="text" indexed="true" stored="true" termVectors="true" />
<field name="titleStr" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="author" type="text" indexed="true" stored="true"termVectors="true" />
<field name="authorStr" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="review" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="topic" type="text" indexed="true" stored="true" multiValued="true" termVectors="true"/>
<field name="topicStr" type="string" indexed="true" stored="false" multiValued="true"/>

You'll notice that I have a couple of extra fields for title, author, and topic, these are for the faceting info and are just untokenized fields to make the calculations for facets a little more efficient.

Now, we're almost done with creating the schema. We just need to declare a unique key, default search field, and default search operator.

<uniqueKey>id</uniqueKey>
<defaultSearchField>title</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>

Remember when I made the fields with the "Str" suffix? We can use a really cool feature of Solr called a "copyField" that literally copies the information from one field to another.

<copyField source="author" dest="authorStr"/>
<copyField source="title" dest="titleStr"/>
<copyField source="topic" dest="topicStr"/>

It's worth mentioning here that Solr indexes are not databases! While there are some similarities in the way that Solr allows you to add, update, select, store, and delete information from the index, Solr isn't an RDBMS. I've seen a few discussions where there is some confusion as to why Solr can't do the equivalent of a stored procedure, or some other function of a database.

Now, your index server is ready to receive documents to search against. The server, in with the above example as the schema, will expect information to be in the following format:

<doc>
   <field name="id">1</field>
   <field name="title">Solr Rocks!</field>
   <field name="author">Barr, Foo</field>
   <field name="review">This book rocks!</field>
   <field name="review">This book is horrible!</field>
   <field name="topic">information retrieval systems</field>
   <field name="topic">xml</field>
   <field name="topic">search</field>
   <field name="topic">apache foundation</field>
</doc>

Next week when I get some time, I'll write about creating facet queries...

Solr and Coldfusion -- Setting Up

To get up and running with Solr, you'll need some type of Servlet container. Typically when folks start talking about servlet containers, they're talking about Tomcat or Jetty. In fact, Solr comes with Jetty 6.1.3 (they haven't upgraded to 6.1.5 yet in the distribution). You may also hear about Resin, but in my experience, it runs a bit slower than Jetty and Tomcat. As a small note, servlet containers are different than J2EE application servers like JRun, Geronimo, GlassFish, and JBoss (which use servlet containers like Tomcat and Jetty, but also have EJB containers and can handle other types of logic). If you have a J2EE application server running, you can easily use Solr, and if not, consider using Jetty or Tomcat as your container server.

Since your environment can be as varied as there are IT departments, I won't try to cover everything. Essentially you need to have at least the Java 1.5 JRE. However, I would strongly suggest the most current Java JDK (and not the JRE) as it has performance enhancements to run in server mode (with -server). If you don't already have this Java version installed on your server (assuming this is the same server running CF), don't worry, ColdFusion will still work if you install the required Java runtime.

Essentially the process for deploying Solr, once you have a servlet container up-and-running is to drop the solr.war file into the webapps directory on the server. It won't do anything at this point as you need to set the configuration files for Solr. The easiest way to do this is copy the files from example/solr into a new directory (which I will refer to now as solr_home).

You can tell Java about the home directory by setting the solr.solr.home (-Dsolr.sol.home), set the JNDI lookup ("java:comp/env/solr/home"), or just throw it into the JVM's working directory (the default path is ./solr). Now you just need to make sure everything is running. Just point your browser to http://<server>:<port>/solr/admin. You should then see the administration interface (you may need to restart your servlet container to get everything working properly), but it's not an administrative interface like you get in CFAdmin. This is more of an informational administration panel. You can make sure everything is running, that there are documents in your index is set up properly, check out the schema and configuration files, and thread information. Really the only thing you can administer here is the log level.

For some more specific notes on intalling Solr in Tomcat and Jetty, check out Solr's wiki. In particular, if you're going to need multiple instances of Solr to run, pay attention to the sections on Multipe Solr apps on those wiki pages.

Coldfusion Solr Client - SolColdfusion

As I hinted at yesterday, I was close to having some code in the pipeline to abstract using Solr. I've finished the initial code with the following built in. Here's a brief setup guide to start playing with the code.

First, you're going to need to grab the latest release version of Solr (currently 1.2). The only real requirement to run this software is that you have a JRE of 1.5 or higher. Untar/zip the file somewhere convenient and open a command prompt. Get to the example directory in the apache-solr.1.2.x folder (cd /example). To start up the sample server running Jetty, just issue the following command:

java -jar start.jar

This will start a new instance of the Solr server on your computer on port 8983. You can make sure this is running by navigating to http://localhost:8983/solr (NOTE: this is a link to your computer. If you get an error, it's because your computer isn't running an instance of Solr on port 8983).

At this point, it's probably good to send you over to the Solr website to take a look at their tutorial. Go ahead. I'll wait...

...

Great, you're back.

You've seen some basic inserting, deleting, and querying of Solr index data. You may have also noticed that there are clients for PHP, Ruby, Python, and Java...no ColdFusion. I want to do a little more testing on this before I submit the patch, but I've added the initial code as an encosure here to do updating, deleting, and searching in Coldfusion.

The CFC SolColdfusion should be in the path org/apache/client (at least that's where I'm putting in for the purposes of this initial demonstration). The initialization takes one required parameter (the Solr host) and then has two optional parameters (port and path).

To set this up, create an instance with

<cfset solr = createObject("component", "org.apache.solr.client.SolColdfusion").init("http://localhost", "8983", "/solr") />

Now, there are a lot of different parameters you can send to Solr to perform different queries. And, since some of these key names can repeat, I chose to implement sending these parameters as an array. So, let's set this up.

<cfset params = arrayNew(1) />

<cfset params[1][1] = "indent">
<cfset params[1][2] = "on" />
<cfset params[2][1] = "wt">
<cfset params[2][2] = "standard" />
<cfset params[3][1] = "fl" />
<cfset params[3][2] = "*,score" />
<cfset params[4][1] = "qt" />
<cfset params[4][2] = "standard" />
<cfset params[5][1] = "wt" />
<cfset params[5][2] = "standard" />

These parameters are basically what are the defaults that Solr will return back to you. If you want highlighting, you would need to add two additional row vectors with 'hl = on' and 'hl.fl = '.

Searching is straight forward, taking a query, the start row, number of rows to return, and the array of parameters:

<cfset results = solr.search("*:*", 0, 10, params) />

This searches all fields and all content and returns back an XML document with the search results in it.

<cfdump var="#results#" />

In the result node, you'll see that Solr returns an xmlAttribute of

numFound
of 0 (assuming you don't have anything in the index). Let's add an example document from the documents that come with Solr.

<!--- Create a new sample document --->
<cfxml variable="sample">
<doc>
<field name="id">F8V7067-APL-KIT</field>
<field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field>
<field name="manu">Belkin</field>
<field name="cat">electronics</field>
<field name="cat">connector</field>
<field name="features">car power adapter, white</field>
<field name="weight">4</field>
<field name="price">19.95</field>
<field name="popularity">1</field>
<field name="inStock">false</field>
</doc>
</cfxml>

<!--- add this document to the index --->
<cfset solr.add(sample) />
<cfset solr.commit() />
<cfset solr.optimize() />

<!--- search for the newly added document --->
<cfset results = solr.search("id:F8V7067-APL-KIT", 0, 10, params) />

<cfdump var="#xmlParse(results)#" />

You'll notice I used a commit and optmize statement. Neither of these statements are necessary every time you add a document, but be aware that Solr caches documents and won't flush the new documents to disk unless you either commit the documents or the mergefactor setting you used in your solrconfig.xml file has been reached.

Now, let's delete this document...

<cfset solr.deleteById("F8V7067-APL-KIT") />
<cfset solr.commit() />

Don't forget to commit deletions to the index!

There'll be more soon (add multiple documents, delete by queries). In the mean time, try it out. If you have any comments, questions, concerns, whatever, let me know.

ColdFusion and Solr

I've spent the last few months working on some projects that didn't really have anything to do with ColdFusion (lots of Java and PHP). One of the projects I've been working with (Vufind.org) uses Solr as it's indexing/search engine. That's starting to get picked up by some pretty big companies (Netflix just relaunched their search using Solr this week).

I've been working with Solr in Java for a bit now, and I wanted to start to build an interface for using it as a search engine (my Lucene code is stuck in open source limbo) in Coldfusion. One of the cool things about Solr is that it returns results back through HTTP (in XML, JSON, or ruby).

As soon as I get the code finished, I'll post it as a patch in Solr.