Solr Schema

If you've ever worked on a project that involved Coldfusion's bundled version of Verity, you've no doubt run into the issue of trying to confine your fields into the structure that Verity imposes, and those custom fields are really precious in these instances. About 6 months ago, I ran into an issue with a search project where I had about 125,000 documents to index. Since we also wanted to be able to use the indexes for some other projects, I was a bit nervous to commit almost the entire allotment of indexable objects to one collection. This launched me into writing a custom search engine and indexer using Lucene and slapping Coldfusion around the responses to do things that Verity did. However, once the projects were complete, I never really got around to making it easy to use. It does cool stuff like search across multiple collections, context highlighting, relevancy calculations, term vector calculations, "did you mean", etc. Essentially everything I think all good search engines need to be able to do. Something this system lacked was an easy way to define the fields that you wanted indexed (along with a knowledge of Java to actually make the changes).

The ability to create any number of fields to index in different ways (along with faceting) is a real strong point of Solr. Not only can you add fields and choose how that data is analyzed, you can create your own field types that process the information in your index the way you want them to be.

This is done in the $SOLR_HOME/config/schema.xml file. The first section (<types>) defines the types of fields that you will be using, and how Solr should process them with Lucene. If you look at some of the fieldtypes, you'll get an idea of what's possible. For instance, the fieldtype for "string" is an untokenized field that doesn't normalize the fields and sorts missing information last.

<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

However, if you need a more robust fieldtype, look at the fieldtype for "text". This uses a whitespace tokenizer (splits words with whitespace) and uses the stopwords defined in the stopwords.txt file. It does some other processing (filters words, converts them to lowercase, runs a porter stemmer, and then removes duplicates). This fieldtype also defines what to do when a query is passed to it (uses the same filters). This is slightly different than the defined "textTight" which does not perform any further analysis on the text when being queried. You'll probably find that most of these work for most instances, but if you need to, you can build your own fieldtype that has very specific indexing and query filters.

The next section contains the actual fields you want to use in the aptly named "fields" element. This is where you actually define the fields that will be in your index, the type of analysis to perform on the field, if it should be indexed, stored, have term vectors, or be multivalued (have multiple instances of the same field in the index).

Let's say you're wanting to develop an indexing schema for books (hey, I work in a library). At a very basic level, you'd want a field for an id, title, author, reviews, and a set of topics (or tags). Your fields element would contain something along the lines of:

<field name="id" type="string" indexed="true" stored="true"/>
<field name="title" type="text" indexed="true" stored="true" termVectors="true" />
<field name="titleStr" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="author" type="text" indexed="true" stored="true"termVectors="true" />
<field name="authorStr" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="review" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="topic" type="text" indexed="true" stored="true" multiValued="true" termVectors="true"/>
<field name="topicStr" type="string" indexed="true" stored="false" multiValued="true"/>

You'll notice that I have a couple of extra fields for title, author, and topic, these are for the faceting info and are just untokenized fields to make the calculations for facets a little more efficient.

Now, we're almost done with creating the schema. We just need to declare a unique key, default search field, and default search operator.

<uniqueKey>id</uniqueKey>
<defaultSearchField>title</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>

Remember when I made the fields with the "Str" suffix? We can use a really cool feature of Solr called a "copyField" that literally copies the information from one field to another.

<copyField source="author" dest="authorStr"/>
<copyField source="title" dest="titleStr"/>
<copyField source="topic" dest="topicStr"/>

It's worth mentioning here that Solr indexes are not databases! While there are some similarities in the way that Solr allows you to add, update, select, store, and delete information from the index, Solr isn't an RDBMS. I've seen a few discussions where there is some confusion as to why Solr can't do the equivalent of a stored procedure, or some other function of a database.

Now, your index server is ready to receive documents to search against. The server, in with the above example as the schema, will expect information to be in the following format:

<doc>
   <field name="id">1</field>
   <field name="title">Solr Rocks!</field>
   <field name="author">Barr, Foo</field>
   <field name="review">This book rocks!</field>
   <field name="review">This book is horrible!</field>
   <field name="topic">information retrieval systems</field>
   <field name="topic">xml</field>
   <field name="topic">search</field>
   <field name="topic">apache foundation</field>
</doc>

Next week when I get some time, I'll write about creating facet queries...

Real Life XSLT 2.0 transformations

I ran into a bit of a situation that was really blowing my mind. I have a rather large XML file (around 20,000+ lines) marked up in TEI that I wanted to do some transformations on (a day book and ledger from the 1850s). Essentially the code follows the format

...
<figure>
   <head>Page 12</head>
   <graphic url="0023_p12"/>
</figure>

<fw type="header" place="top-center">
   <name type="place" key="7022220">Williamsburg</name>,
   <date value="1850">1850</date>,
</fw>

<table>
   <row>
      <cell>
         <date value="1850-10-03"><choice><abbr>Oct<hi rend="sup;underline">r</hi></abbr><expan>October</expan></choice> 3<hi rend="sup">th</hi> 1850</date>
      </cell>
      <cell>
         <name type="person" key="griffss01">Doct<hi rend="sup;underline">r</hi> S S Griffin</name>
       </cell>
       <cell>&nbsp;</cell>
   </row>
   ...
</table>
<pb/>
...

What I wanted to accomplish was group all this together in separate divs for HTML output (ok, I actually need to write each page to its own file, but this is pretty much just one more step).

I just could not find a way to group this info this way using XSLT 1 without wrapping each page within its own div structure. I didn't really want to go back and do this, so I asked the TEI-L list. David Sewell pinged me back with some XQuery code that recursively recalls the document structure for a given node.

He also mentioned that it would be pretty easy to write an XSLT 2 transformation that groups these nodes together. I did a little bit of digging and came up with

<xsl:template match="tei:div">
   <xsl:for-each-group select="*" group-ending-with"tei:pb">
      <div class="page">
          <xsl:apply-templates select="current-group()" />
      </div>
   </xsl:for-each-group>
</xsl:template>

This transformed the pages to what I was wanting

<div class="page">
   <img src="0023_12.png" alt="Page 12" />
   
   <h1 class="fw">Williamsburg, 1850,</h1>
   
   <table>
      <tr>
         <td>
         <span class="abbr">Oct<sup><u>r</u></sup></span><span class="expan">October</span> 3<sup>th</sup> 1850</date>
      </td>
      <td>
         <a href="javascript:getName('griffss01');>Doct<sup><u>r</u></sup> S S Griffin</a>
       </td>
       <td>&nbsp;</td>
   </row>
   ...
</table>
</div>

<div class="page">

   ...
</div>

The XSLT processor for ColdFusion doesn't support XSLT 2.0 (it's still a draft spec). However, Saxon does (specifically Saxon 8). For more on doing XSLT transformations, see XSLT 2.0 in ColdFusion.

Getting XML from MSSQL Server

I've been playing with all the AJAX stuff that's been coming out lately. I suppose that like a lot of folks, I was creating a query, then having a generic function that created the XML in a proxy file for the JavaScript (Ray Camden has a really nice function for transforming a query to XML).

Last week I was doing some research to find a way to do some XML searching and stumbled upon the FOR XML statement. I knew that most RDBMSs were capable of dealing with XML record sets, but it's been years since I've even looked at any of the XML stuff for MSSQL.

The FOR XML statement returns a query result and transforms rows into XML elements. There are three arguments that this can take:

  • RAW: Transforms each row into an element with a generic identifier (<row/>) as the element tag.
  • AUTO: Returns the results in a simple nested XML tree
  • EXPLICIT: Allows you to define the XML tree returned

[More]

Controller Generator Update

Ray asked an interesting question on the modelglue list about what is the proper way to return from a controller CFC. In older versions of Model Glue, you had to return the event; however, there have been changes made that pass event objects by reference, making returning the event obsolete. I had missed this subtle change and the generator XSL was return the event passed to it. I've updated the code to return void.

So, you can point your transformer to http://swem.wm.edu/blogs/waynegraham/software/mg-stub-generator.xsl (or download it and change it to your liking) to reflect the updated method.

Stub Generator for Model-Glue

In working on a recent project, I got tired of going back and forth between my modelglue config file and then writing controllers, so I wrote an XSLT to generate method stubs for me. This stylesheet takes the modelglue.xml in the config folder and generates controller files with method stubs. Nothing too fancy, just with default controller stub code.

So, here are some warnings...

DO NOT test this code out on a live app. It can overwrite your existing controller(s) and leave you with just method stubs.

I use the XSLT 2.0 element to write the resulting CFC file, so you'll need an XSLT 2.0 compliant processor (Saxon 8 is currently the only one available). I use oXygen's Eclipse plugin and all you have to do is point the XSL URL to http://swem.wm.edu/blogs/waynegraham/software/mg-stub-generator.xsl (or download it) and change the transformer to Saxon8B.

One last note, most of the samples for modelglue have the default controller named "myController" which can be different from the actual path to you controller. The XSLT is based on the name attribute of the controller element, so make sure you change this value to what you want your controller file to be named.

So, what does the code look like? Like I said, it's relatively simple (less than 40 lines). First, I set the output method to text and then loop over each of the controller nodes creating a file for each:

<xsl:output method="text"/>
<xsl:strip-space elements="*"/>

<xsl:template match="/">
   <xsl:for-each select="/modelglue/controllers/controller">
      <xsl:variable name="filename" select="concat('../controller/',@name, '.cfc')" />
      <xsl:message>
         Creating <xsl:value-of select="$filename"/>
      </xsl:message>
      <xsl:result-document href="{$filename}" method="text" indent="yes">
         <xsl:call-template name="mg-controller" />
      </xsl:result-document>
   </xsl:for-each>
</xsl:template>

The filename variable is actually what tells the entity where to write the file, so if you want the stub in a different place, just change the relative path. This code assumes that you're currently in the /appname/config folder. In the xsl for loop, I call a template (mg-controller) that actually creates the component code.

<xsl:template name="mg-controller">
&lt;cfcomponent name="<xsl:value-of select="@name"/>" displayname="<xsl:value-of select="@name"/>" output="false" hint="I am a generated controller" extends="ModelGlue.Core.Controller"&gt;
   &lt;cffunction name="init" access="Public" returnType="Controller" output="false" hint="I build a new controller"&gt;
      &lt;cfargument name="ModelGlue" required="true" type="ModelGlue.ModelGlue" /&gt;
      &lt;cfargument name="InstanceName" required="true" type="string" /&gt;
      &lt;cfset super.Init(arguments.ModelGlue) /&gt;

      &lt;!--- Controllers are in the application scope: Put any application startup code here. ---&gt;

      &lt;cfreturn this /&gt;
   &lt;/cffunction&gt;
   <xsl:for-each select="message-listener">
   <xsl:sort select="@function"/>
   &lt;cffunction name="<xsl:value-of select="@function"/>" access="Public" returnType="ModelGlue.Core.Event" output="false" hint="I am an event handler."&gt;
      &lt;cfargument name="event" type="ModelGlue.Core.Event" required="true"&gt;
      &lt;!--- TODO: Implement <xsl:value-of select="@function" /> function ---&gt;

      &lt;cfreturn arguments.event /&gt;
      &lt;/cffunction&gt;
   </xsl:for-each>
&lt;/cfcomponent&gt;
</xsl:template>

You might notice that the tabs are a little out of wack...this is on purpose since text output keeps the same tab structure that is in the resulting code.

What would be really nice is to adapt the code to get something like MyEclipse's Struts editor that allows you to graphically map out your nodes and pages by filling out a wizard. Now if I can just figure out Eclipse's GEF...

RSS Information Visualization

Last week I attended a lecture by Andries van Dam (Brown University's Vice President for Research) entitled Immersive Virtual Reality in Scientific Visualization. He highlighted how their Cave project was allowing geologists to explore Mars (and train astronauts for their eventual missions on the planet). He closed the presentation with a couple of tablet PC applications, one for chemistry that allows you to hand draw 3D models of molecules and the other, a math visualization program.

The math application (MathPad2) was actually a lot cooler. The premise of the software is that it's far easier to hand write complex mathematical formulas than it is to type them in (ever try to type in a calculus problem?). Not only does the software allow you to write down your problem, you can then draw pictures to simulate the mathematical formulas you wrote down. You can see an example of the difference between constant speed and velocity (along with damped harmonic oscillation) at their site.

These examples got me thinking about how we visualize our data. Take a blog for example. We generally tag (or otherwise categorize) the content we author. But what exactly does that mean? I believe this is where data visualization begins to play an important role.

When you go to a site with a tag cloud, you can quickly infer information about the content of that site. If, say, ColdFusion is denoted in a larger font in the cloud than ASP, you may conclude (even if at a subconscious level) that the site is geared more toward ColdFusion than ASP.

There's a small (some may argue that it's actually quite significant) flaw in this data visualization...you cannot see the interaction between the different tags, nor their relation to individual postings.

[More]

XSLT 2.0 in ColdFusion

While not totally on topic with this posting, but definitely worth reading is Steven Erat's explanation of using the XMLSearch function with default namespaces.

XSLT 2.0 was released as its W3C Candidate Recommendation 3 earlier this month (http://www.w3.org/TR/2005/CR-xslt20-20051103/). There are a lot of nice new features that go along with the new specification (user functions, grouping, multiple outputs, temporary trees, character mappings, datatype bindings, etc.). And, it's relatively painless to implement in ColdFusion.

The first thing you need is an XSLT 2.0 compliant parser. Right now, that means Saxon 8. You'll need to put at least the Saxon8.jar file into your Coldfusion classpath (see Christian Cantrell's cheat sheet).

The next part is to write a wrapper for the transformation. Because Mark Mandel already wrote a replacement for XMLTransform that allows for parameters, I used his xslt function code as a jumping off point. Really there's only one line that has to be changed (other than renaming the function). The code line

var tFactory = createObject("java", "javax.xml.transform.TransformerFactory").newInstance();

simply needs to be rewritten as

var tFactory = createObject("java", "net.sf.saxon.TransformerFactoryImpl");

You can grab the function here.

So, now a quick example of the power of XSLT 2.0...

Let's say you have an XML file that lists cities in the US with their state and populations:

<?xml version="1.0" encoding="UTF-8"?>
<cities>
<city name="Williamsburg" state="Virginia" pop="11998" />
<city name="New York City" state="New York" pop="80000" />
<city name="Washington" state="DC" pop="553523" />
<city name="Richmond" state="Virginia" pop="300000" />
</cities>

Now, you want to display cities grouped by state and output the state's total population.

In XSLT 1.0, you needed to rely on XPath queries with nested for loops:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="/">
<table>
<xsl:for-each select="cities/city[not(@state=preceding::*/@state)]">
<tr>
<td><xsl:value-of select="@state" /></td>
<td>
<xsl:for-each select="../city[@state = current()/@state]">
<xsl:value-of select="@name"/>
<xsl:if test="position() != last()">, </xsl:if>
</xsl:for-each>
</td>
<td>
<xsl:value-of select="sum(../city[@state=current()/@state]/@pop)" />
</td>
</tr>
</xsl:for-each>
</table>
</xsl:template>
</xsl:stylesheet>

However, in XSLT 2.0, the for-each-group function makes this same transformation much more straight forward:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="/">
<table>
<xsl:for-each-group select="cities/city" group-by="@state">
<tr>
<td><xsl:value-of select="@state" /></td>
<td>
<xsl:value-of select="current-group()/@name" separator=", " />
</td>
<td><xsl:value-of select="sum(current-group()/@pop)" /></td>
</tr>
</xsl:for-each-group>
</table>
</xsl:template>
</xsl:stylesheet>

Notice the difference in the amount of code, and how you have to access the different attributes. The function current-group() of the complexity of accessing the current node in your transformation, making development quicker and less buggy.

You can see the results from the above example here.

This seriously only scratches the surface of what you can do with ColdFusion and XSLT 2.0. The Saxon parser also implements the XQuery 1.0 and XPath 2.0 opening a whole new set of possibilities in ColdFusion.

Default XML Namespaces and ColdFusion

I ran into a bit of a problem late last week working with XML with ColdFusion. I was writing my XSL in oXygen which was running beautifully, but when I got to the point to move it to CF to use XMLTransform, I kept getting an error that "An error occured while Transforming an XML document. Content is not allowed in prolog."

Essentially, the XmlTransform function didn't like my root node:

<modsCollection xmlns="http://www.loc.gov/mods/v3">

The funny thing to me was the fact that CF appears to use Xalan for XML parsing, which I was also using since I am using the java.net.URLEncoder. So I figured I'd need to start writing a wrapper to access the Xalan package I was using. I did a little Googling and found this over at Compound Theory.

The post originally is about using , but also solved this problem as the javax packages have parsers that handle default namespace (without a prefix). It's a good read, especially if you try to do anything with more complex XML in ColdFusion.

Export OPML from Thunderbird

I've been using Thunderbird as my RSS reader for a a while now...and it does a pretty decent job. However, with the flurry of posts about the Google RSS reader, I decided to also have a look. There's an option to import your feed from other programs, but this requires OPML format.

Unfortunately, there's not an option to create this output file. After a little digging, there actually is a way to get Thunderbird to export the OPML from your aggregator. I found this post by Dougal Campbell where he basically fixed the bug that hid the export/import function in the "Manage Subscriptions" portion.

All you need to do is exit Thunderbird and download his patched newsblog.jar file. Then, go to your Thunderbird installation home (Program files\Mozilla Thunderbird\chrome) and rename newsblog.jar to something else (he recommends 'newsblog.jar.orig') and put the patched version in there. Now, when you start Thunderbird and go to manage your subscriptions, you will see the option to import and export OPML files.

Then just head over to Google Reader and upload the file. It takes a little bit to upload the file (I suspect they're getting hammered pretty hard right now). You'll probably also want to remove the "patched" newsblog.jar and rename your newblog.jar.orig to its original name since it's not part of the normal distribution.

oXygen Template for Model-Glue 1.0

I was writing some XML templates today in oXygen, and since Model-Glue 1.0 was released today, I decided to make one for model-glue. If you use <oXygen/> (a really great XML editor by the way that also has an Eclipse plugin), you can use the template by clicking on File --> New From Templates... (in Eclipse, File --> New --> New From Templates...)

You should get the templates dialog box (you have to name your file in Eclipse first), but click on the From URL and type in

and click on the Load button. You should see a new new Model-Glue 1.0 template. Click OK.

The template adds a couple of things, and assumes that you have extracted the modelglue.dtd file into /ModelGlue. I added an XML declaration and a public DOCTYPE declaration an entity reference for appName.

Anyway, if you use <oXygen/> as your XML editor, this can save you a little time.

Atom for BlogCFC

There are a lot of different formats for generating syndicated content (RSS 1 and 2, OPML, Atom, etc.). Currently (as of 2005), there are two big players, RSS 2.0 and Atom 1.0. One of the downfalls of the RSS 2.0 standard is that it is copyrighted by Harvard University, and work on the specificiation has halted. Because of this, future work has to be carried out under a different name (and organization). The Atompub Working Group is one organization carrying out further development of online syndication with the Atom 1.0 standard.

Without getting too far into the differences between the two, I thought one of the coolest things about Atom is that it is capable of being transferred using any network protocol (like XMPP). Atom is also an XML namespace, along with using a non-normative RelaxNG schema.

Anyway, I extended this blog (BlogCFC) to generate an Atom feed (including the RSS 1.0 and RSS 2.0 feeds it already generated). The code isn't quite ready to post yet, but I basically created a new file named atom.cfm, which is almost exactly like rss.cfm, but calls a new function in blog.cfc called generateAtom.

The generateAtom function creates the Atom XML, and includes a stylesheet from the AtomEnabled.org site. Because Atom uses the xhtml namespace, it actually uses your browser's default HTML stylesheet rather than the XML stylesheet with the tree view.

I have validated the feed against FEED Validator, though the enclosures and subjects aren't currently in there...but will be added shortly.

More Entries