I’ve been using Apache SOLR 1.4 as an indexing server for search lately. Among the fields I index are people’s names. Most of the users are English speakers, and many use their proper English name on their profiles, but their friends or colleagues only know to search for them via their common nickname. Thus, if a user stores “Kimberly” as her first name, a search for “Kim” returns no results. That’s because SOLR doesn’t know how to “stem” proper nouns. Perhaps somebody out there has written a SOLR “common English names stemmer,” but I haven’t found it.
A seemingly easy solution would be to wrap all the search queries in wildcards under the hood, so if a user enters into the search field “Kim” we silently change that to “*kim*” or some such, and only use that wildcard pattern for the “name” field in the index. For example, our query might become something like:
Where the “text” field is an aggregation of all the fields we index for each document, and the “name” field only indexes the person’s name.
Not only does this seem a little hacky, there’s a few problems. One is that I use the DisMaxRequestHandler which doesn’t allow wildcard search patterns. The other problem is that as of SOLR 1.4 leading wildcards generally don’t play nice, though I believe there are ways to handle them. Also, a search for “Richard” could never find “Dick” (or visa versa) via this method. We need something more than just simple stemming.
SOLR provides a “synonyms” token filter. This essentially allows us to create a map of words that should be considered equals. Thus, we could map:
Kim => Kimberly
SOLR would then know that “Kim” can also mean “Kimberly” and it should search for both those tokens. However, searching for “Kimberly” does not also search for “Kim”. The “=>” arrow delimiter specifies that the map is one way. By configuring the synonyms token filter to “expand” the map, designating this map:
(with commas) means that the synonyms work in both directions. Searching for “Kim” or “Kimberly” will search for both. And you can specify multiple synonyms in one line, like so:
Kim, Kimmy, Kimberly, Kimberlicious
Or, for one way (do you really want a search for “Kimberly” to also search for “Kimberlicious”?):
Kim, Kimmy, Kimberlicious => Kimberly
You can download my SOLR formatted synonyms file. You should note that this file doesn’t do any “one way mapping” even though it probably should in some cases. For example, based on this synonyms line:
Caroline, Carolyn, Carolina, Carlyne, Carline, Karoline, Carrie, Carry, Caddie, Caddy, Carlie, Carly, Callie, Cally, Carol, Lynn, Lynne, Lin
A search for “Carlie” also searches for “Lynn” which is probably not desirable. In reality, this line should probably be broken up into one or more maps that make a little more sense:
Carrie, Carry, Caddie, Caddy, Carlie, Carly, Callie, Cally, Carol, Caroline, Carolyn, Carlyne, Carline, Karoline Lynn, Lynne, Lin, Caroline, Carolyn, Carlyne, Carline, Karoline
Here all the “nick names” become grouped even though the “real names” are repeated. For example, a search for “Lynn” now searches for everything on line 2… but not, say “Caddy”. And, a search for “Carolyn” will search for “Caddy” and “Lynn” since “Carolyn” appears on both lines.
I have not done any empirical benchmarking using this synonyms file, but I can say based on my observations that searches do run slower using the synonyms filter. I don’t know the exact performance implications of making the file more complicated, but I assume that length and overlapping maps/words would only serve to slow things down.
To use the synonyms filter with this file, I first created a special field type in my SOLR schema.xml file:
<fieldType name="textEnglishName" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="english_names.txt" ignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType>
Nothing fancy here! I’m using the ASCIIFoldingFilter to replace accented characters when possible with ASCII equivalents. Also, no stemming filter is present. And, the synonyms filter is only used during query, not during index.
Then, I just use the “textEnglishName” field type for any field that indexes a person’s name:
<fields> <field name="name" type="textEnglishName" indexed="true" stored="false"/> </fields>