Facebook App Prefetching Looks Like a DDoS Attack

Recently, Facebook announced that their mobile application will implement content pre-fetching. This means that if somebody creates an FB post about a page on your website (or you run Facebook ads with a link to your site), as FB users view their timeline and see that post, the mobile app fires off a “GET” request to the linked content on your server. The FB app caches that content for a short time, and if the user clicks on the post, the app serves up the cache before sending you on to the actual site so that the response time appears to be reduced.

This is both good and bad. It’s good because who doesn’t want their website to appear to load faster for users who are trying to reach their content? It’s potentially bad, because the higher the “post reach” in the Facebook network, the more prefetching that is going to occur on your server. And it looks a lot like a DDoS attack: spikes of traffic from all over the world, but none of it gets logged in your JS analytics solutions (since prefetching does not parse and execute the JS).

Recently, even with a small reach (60,000 users reached) we experienced a consistent 60 requests/per-minute traffic surge lasting in periods of 10 minutes at a time as a boosted post and some ads rolled out across the FB network. All on Android devices (according to logged user agents) from the FB mobile app web-view user agent, mostly in the target geographic region from the ads.

With enough cash and desire to drive user acquisition, we could essentially pay Facebook to DDoS ourselves. (Or if you’re lucky enough to have a huge page following, your post could potentially do that without boosting).

Facebook does send the “X-Purpose:preview” header to let you know what the request is about, but seeing standard “combined format” log files will at first be a bit confusing (lots of traffic, random IPs, all on Android devices, and nothing logged in your JS analytics platforms).

In NGINX try this:

log_format combinedPurpose '$remote_addr - $remote_user [$time_local]  '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent" purpose:$http_x_purpose';

access_log /var/log/nginx/access.log combinedPurpose;

I’m not sure I agree with Facebook’s decision to do this. If website operators need to prepare extra capacity just handle users scrolling through an app (with a huge install base) that they don’t control, this seems like a huge waste of resources (electricity, virtual machine images spun up to respond to traffic), or operators need a very smart caching plan. All for the possibility that a user clicks a post and saves a few seconds.

Using Iodine DNS Tunneling on OS X Mavericks

For a long time I have had a T-Mobile unlimited data plan which allowed tethering my laptop to my Android Moto G LTE phone which runs Android KitKat 4.4.3. After switching plans tethering is now blocked by an “up-sell” screen when I connect to my hotspot. I don’t really mind paying for tethering, so I called up T-Mobile only to be told my particular day-by-day “unlimited” data plan doesn’t even have a tethering add-on even though I’m willing to pay a little extra for that. Well then. Even the “un-carrier” is still a carrier.

So — what to do?

Well back in the old days, T-Mobile used to block tethering just by inspecting the browser’s User Agent string and would redirect users to an up-sell tethering page if a mobile browser wasn’t detected. That might stop most users, but with User-Agent switching plugins readily available for all the major browsers, this used to be an easy work around.

After a lot of googling, nobody really seems to know definitively how T-Mobile is detecting tethering, and thus there are lots of proposed work-arounds. Many suspect that T-Mobile is inspecting packets instead of the payload of the packets to determine where those packets originated, and blocking tethered packets.

Perhaps it has something to do with how KitKat sets up separate routing tables for tethered data, thus allowing the carrier to differentiate between tethered and non-tethered data. By rooting your phone you can setup different routing tables and that might work, but I don’t want to root my phone (yet).

Perhaps the carrier is inspecting packets’ TTL values. Since packets from the tethered computer have a different TTL value than packets from the phone, the carrier could discriminate which packets to block in this way. By changing OS X’s TCP TTL value, perhaps we can slip them around the roadblock. This didn’t work for me.

Some people reported that by doing some technical machinations to set the “tether_dun_required” flag on the Android phone from 1 to 0. Perhaps this worked for a while, or perhaps on some phones, but it doesn’t work for me on the Moto G.

UPDATE: Actually, if you are on T-Mobile and tethering is blocked, there is something quite easy you can try to get it working which did work for me: Basically you need to update the T-Mobile APN (“Access Point Name”) configuration. To do so on your Android: Settings -> More… -> Mobile Networks -> Access Point Names. Then, click “T-Mobile US LTE” (it might have a different name, but the URL you see underneath it should be “fast.t-mobile.com”). Tap “APN type” and to that setting, append this:


Then save the settings. Tethering works again. My understanding of this this “dun” APN type is that it stands for “Dial Up Networking”. Essentially, when your phone requires network access, it needs to connect to the data network do any variety of things, say use the internet or send an MMS. When your phone’s tethering hotspot is enabled, it is requesting to use this “dun” APN type. Android then connects using the settings for an APN in your list that has the “dun” type. Apparently, the fast.t-mobile.com APN seems to allow tethering, whereas the other “dun” APN that existed on my phone, pcweb.t-mobile.com does not. That seems a little fragile and one day T-Mobile may get wise. But, for now that works.

So — what about this whole Iodine DNS tunneling thing, then?

Well, we still have airport hotspots to tunnel through, now don’t we?

DNS Tunneling basically means that if your computer can send and receive valid DNS responses, we can hide our network traffic inside the DNS packets. This means we need to run a server process (iodine) on a remote machine with port 53 open to receive and deal with these packed DNS packets, configure DNS entries to point to that server in a particular way, and then run a local client on your OS X machine. Once you’ve got that client running, you can basically access the remote machine, but then you have to route your computer’s traffic through it. This can be done in different ways. If you’re just browsing the web, probably the easiest way is via an SSH SOCKS proxy. Or, you could fiddle with your routing tables to send all your traffic over the tunnel.

Let’s begin.

The DNS Entries

I use Amazon’s Route 53 service which makes it easy to manipulate DNS records for any domain you control. The basic process is that you need to setup an “NS” record for a subdomain of a domain you control. It can be a little confusing:

  • Let’s say you control example.com.
  • Choose a subdomain you want to use for setup, it can be anything, say: t.example.com (t for tunnel! keep it short)
  • We’re going to run the iodined process on a server at IP Address A.B.C.D.
  • This iodined process on A.B.C.D is going to act as the canonical Nameserver for the t.example.com subdomain

This iodined server process binds to port 53, just a like a DNS server. Our iodine client process that we’ll run on our laptop is going to take our computer’s traffic and wrap it up into DNS requests for t.example.com. Any upstream DNS server from our client is going to say “Oh, hey, you should query A.B.C.D for the IP of t.example.com — send those packets over there”.

Iodined will then then take our DNS requests with the wrapped-up traffic, dump it onto the server’s network, get a response and then “answer” the DNS request with a wrapped-up response. The packets look like normal DNS traffic, so in theory they should be able to be passed around the internet “per usual”, except that these DNS packets contain extra data, namely the traffic to and from your computer.

Because a lot of captive portals (like a cell carrier that blocks tethering, or an airport hotspot) allow DNS traffic to the outside world (if not HTTP or other traffic), as long as we can reach/query our iodined DNS server and receive responses from it, we’re in business.

The trick is, we need to tell the world “Hey — if you want to know the IP address of t.example.com, look for it here, at DNS server A.B.C.D”.

So, basically, let’s call our nameserver ns.t.example.com. It points to A.B.C.D in our DNS setup as an A record:

ns.t.example.com A record => A.B.C.D

Now, we need to assign that new shiny ns.t.example.com as a nameserver for t.example.com this is a “nameserver” (NS) record:

t.example.com NS => ns.t.example.com

That’s it — anytime any client wants to know the IP address of “t.example.com”, it’s going to ask “ns.t.example.com” which runs our iodined process.

The Server

I use a small Amazon AWS EC2 instance running Ubuntu. You need to make sure that the security group assigned to the instance allows incoming traffic on port 53 (the standard port for DNS processes).

As root:

apt-get install iodine

We need to then actually run an iodined process. In doing so we need to tell iodine what subnet we are going to use for our little private tunnel. Your computer is going to create a virtual tunnel device that will use this same subnet. So — it’s very important to use a subnet that is not being used by the server OR your computer. Amazon EC2 uses portions of the private subnet for internal addressing, and it uses portions of the subnet for internal services like it’s own DNS system. Most home routers use or sometimes This worked for me:

iodined -f -c -P secret t.example.com

replace “secret” with a passphrase that the client will also supply. We don’t want to route traffic for just anybody. (Make sure you are running “iodined” with a “d” at the end! The program “iodine” (no “d”) is for the client…)


Obviously you need the iodine program installed. The easiest way is to install homebrew and then “brew install” it:

brew install iodine

On OS X Mavericks, you are going to have to do this as well (to get the tuntap tunnel working correctly):

sudo cp -pR $(brew --prefix tuntap)/Library/Extensions/tap.kext /Library/Extensions/
sudo cp -pR $(brew --prefix tuntap)/Library/Extensions/tun.kext /Library/Extensions/
sudo chown -R root:wheel /Library/Extensions/tap.kext
sudo chown -R root:wheel /Library/Extensions/tun.kext
sudo touch /Library/Extensions/
sudo cp -pR $(brew --prefix tuntap)/tap /Library/StartupItems/
sudo chown -R root:wheel /Library/StartupItems/tap
sudo cp -pR $(brew --prefix tuntap)/tun /Library/StartupItems/
sudo chown -R root:wheel /Library/StartupItems/tun
sudo kextload -b foo.tun
sudo kextload -b foo.tap

Then, we should be able to run the iodine client on our localhost:

sudo iodine -f -P secret t.example.com

Note that “iodine” might not be in your PATH. If it’s not, you can call it directly from where homebrew installs programs:

sudo /usr/local/Cellar/iodine/0.7.0/sbin/iodine -f -P secret t.example.com

Note: your version might not be “0.7.0” — adjust as needed.

You should now be able to “ping” the remote server through the tunnel:


If your local iodine process complains that you are getting too many “SERVFAIL” responses, you can start the command with a small interval, but note that the smaller the interval, the more DNS traffic you’ll be creating:

sudo iodine -f -P secret -I1 t.example.com

Routing Traffic

Now that you can ping, you can also SSH into the remote machine. If all you need is SSH, hey, you’re good to go:

ssh user@

But most of us want to at least browse the web. The easiest way is setup a SOCKS proxy via SSH, then tell your browsers to use that proxy to route all HTTP traffic. Another way is fiddle with our routes to send all traffic over the tunnel.


To setup a SOCKS Proxy over SSH:

ssh -N user@ -D 1080

This binds the proxy to localhost:1080. Any HTTP requests we make to localhost:1080 will be forwarded out to the remote machine. Tell OS X browsers to use the proxy:

  • Go to Settings -> Network -> Advanced -> Proxies.
  • Select “SOCKS Proxy”.
  • Set the proxy to localhost:1080
  • Click the “OK” button
  • Click the “Apply” button on the main network settings pane

Open a browser and your traffic should be routed over the SSH proxy.

Ok, that’s awesome, but what if we want Mail, DropBox and other non-HTTP traffic to be sent over the tunnel as well? Setup some routes. Oh, and you need to setup NAT on the remote server and alter the iptable rules as well. A little more of a headache, but doable.

Routing and NAT

This script is a great way to automatically start up iodine on your laptop and it also sets up the routes (and tears them down later) for routing all traffic through the iodine tunnel:


Once you grab that, you need to alter some of the variables. In our example with a homebrew iodine and the given subnet, change the variables at the top of the script to the following:

#### EDIT HERE ####

# Path to your iodine executable

# Your top domain

# You may choose to store the password in this script or enter it every time

# You might need to change this if you use linux, or already have
# tunnels running.  In linux iodine uses dnsX and fbsd/osX use tunX
# X represents how many tunnel interfaces exist, starting at 0

# The IP your iodined server uses inside the tunnel
# The man page calls this tunnel_ip

#### STOP EDITING ####

make sure that script is executable:

chmod a+x NStun.sh

Then run it as root:

sudo ./NStun.sh

And then try to use your browser… and it fails. Why? Because you need make sure your remote server is setup to actually forward the packets to the outside world via NAT.

NAT on the Server

A great writeup about this process already exists, please see the section called “Configuring NAT and IP masquerading”.

And that’s it — DNS tunneling for captive portals on OS X Mavericks.

Speed / Connectivity

So this method gets us around the captive portal, but the connection is not all that fast. And, sometimes even though we’re tunneled, some portals still are not completely defeated. Often, you’ll need to restart your tunnel, or perhaps networking if you see a dropped tunnel or connection.

The phone’s speed test results over 4G LTE:

Speed Test on a 4G LTE phone

Moments later, using tunneling, the tethered/tunneled computer’s results:

Speed Test on a DNS Tunneled Computer

Mapping NYC GIS Data with Google Maps

New York City publishes lots of geographic data from a variety of city departments. A lot of it is GIS data for mapping things likes city park locations, beaches, playgrounds and bathrooms. There’s even a tree census GIS project you can download for every borough. Every street light. Zoning data. Lots of fun stuff! It’s called the NYC DataMine, the geo-data sets are here. It’s cool, but the value of the data is limited unless you’re a GIS wonk, or use GIS mapping tools, which, if you use Linux or Mac like me, you might be out of luck for the free GUI tools. Why doesn’t the city publish their data in easier to read web-formats? People could use it to throw onto Google maps, make location aware NYC applications, etc. It is possible to work with their data in this way, but it takes a little wrangling.

Let’s take a look at the city Parks GIS project.

What’s inside that zip file? It’s a set of mostly binary files that describe shapes and polygons using points and line segments that demarcate the boundaries of all the NYC parks defined in the database. The shape files are “ESRI Shapefiles“, a format created by Esri, a GIS mapping software company. According to Wikipedia, Esri has captured a lot of the GIS toolset market and apparently NYC uses their products. Along with these shape files is a DBase 3 database that contains meta-data about those shapes (like, the name of the park, what borough it’s in, it’s area, etc.). Normally, you’d open these files in a program like ArcGis, but I don’t use Windows. Besides, this is 2011. I want to look at it on the web, probably on a Google Map.

So we have a few issues. The first is that the binary ESRI Shapefile (Parks.shp) needs to be interpreted into some kind of serial format for easier handling. Libraries exist in different languages to read this file format, but I’ve found them to be a bit clunky and it’s easier just to get it into something else.

Shape files basically contain definitions of shapes identified by points in 2D space. These points are (obviously) meant to be plotted on a map. But what kind of map? How is that map projected? You remember from elementary school the basic Mercator Projection: Take a transparent globe that has a light in the middle, wrap a sheet of paper around the globe’s equator to form a cylinder, turn on the light and trace the lines being projected from the globe. (That’s why it’s called a projection, after all.) Actually, what we were all taught in elementary school is not exactly the correct physical method for creating the projection, but the point is that when you project a spherical object onto a 2D surface, it gets distorted somewhere. This is important because the points of a shapefile can be spatially referenced to any of a number of projections. Today on the web, we mostly use latitude and longitude as input to a mapping API (like Google, or Yahoo!) and let the service figure out how to flatten it out back into 2D. Points described in degrees latitude and longitude are “spherically projected” but I have found it rare indeed for GIS data to be described so simply. GIS data tends to be described in a different spatial reference, and this is where our NYC Parks data gets a little complicated.

First, we need a tool that can actually read and write shapefiles and hopefully output them into more friendly formats. This is where the Geospatial Data Abstraction Layer (GDAL) library comes in. It’s available for Debian and Ubuntu as packages, and probably most other Linux distros as well. The GDAL toolset comes with a program called ogr2ogr and that’s what we’re going to use to get the shape file into something more handy.

But, in order to effectively convert our shape file, we need to know what spatial projection the points are described in, and what we want to re-project them into. The switches for ogr2ogr we are interested in for this are -s_srs and -t_srs which identify the “source file SRS (spatial reference system)” and the SRS we want to convert/re-project into. It turns out that are a lot of ways to describe an SRS. Some are well known and organizations have labeled them in a standard way. But, often two different standards bodies or organizations use different labels for the same SRS. SRS’s are sometimes described by a formatted string of key/value pairs (sometimes called “Well Known Text” in the GIS world, or “WKT”). Some geo-spatial libraries even define their own standard for describing SRS’s (if you’ve used Proj.4 you’ll know about their way). What it comes down to is that “standards” for describing spatial references don’t really exist. Or rather, there seem to be several parallel standards. Luckily, GDAL is good at understanding them.

So what’s our input SRS? The “WKT” of that SRS is going to be found in Parks.prj (for projection?). Just cat it out:


Nice. So, you can see that our projection is the Lambert Conformal Conic, and we have some other parameters in here as well. As far as my research goes, things like “Datum” (“D North American 1983”) and “PROJCS” (“NAD_1983_StatePlane_New_York_Long_Island_FIPS_3104_Feet”) indicate known US “state plane coordinate systems” that describe portions of the earth.

Ok, that’s the WKT for our input SRS (don’t worry, GDAL will just deal with that sucker). We need the output SRS. The coordinate system that GPS uses and pretty much all the mapping APIs expect as input is known as the World Geodetic System, last revised in 1984. Shorthand: WGS84. I’m not so sure ogr2ogr “knows” what WGS84 is by name. It’s man page, however, indicates that it does know about SRS’s described by a particular standards body, the Geomatics Committee. The Geomatics Committee calls WGS84 “EPSG:4326” and GDAL’s tool can handle that. (By the way, if you are using a Ruby or Python library that wraps proj.4, or you have a need to open and parse shapefile data with other tools that require quirky SRS definitions, the Geomatics Committee website has great translations of the SRS’s into WKT and proj.4 command line switches, which you will definitely need when you instantiate that RGeo object, or some such).

One more thing before actually doing this conversion/re-projection. GDAL doesn’t actually understand the Lambert Conformal Conic projection described in Parks.prj. There’s an updated (and as far as my testing goes, backwards compatible) revision of this projection which is defined as “Lambert_Conformal_Conic_2SP” and you must change your Parks.prj to read:


OK! now… what output format do we want the shapefile in? Indeed, ogr2ogr2 can output a new ESRI shapefile, or we can do something like… output it to JSON which seems like a winning format to me (check the man page for other fun formats you can use):

ogr2ogr -f "GeoJSON" -s_srs Parks.prj -t_srs EPSG:4326 Parks.json Parks.shp

And we’re done! We have a nice JSON encoded string in Parks.json (albeit a very large one), with descriptions of all the Polygons and MultiPolygons that describe the boundaries of New York City’s parks in latitude and longitude! Easily munged to throw into a Google map or some such. Each park entry even has it’s associated meta-data.

RMagick Gem install on Debian Lenny

I run Debian Lenny and I want to install RMagick, a Ruby interface to the ImageMagick libraries. Let’s try it (i have the ruby1.9.1 packages installed on Lenny for using Ruby 1.9.2, so many “ruby” commands have a 1.9.1 appended to them):

sudo gem1.9.1 install rmagick


Can't install RMagick 2.13.1. Can't find MagickWand.h.

Also there’s some warnings about “Found more than one ImageMagick installation.” Convential wisdom and google searching suggest that we can install that handy MagickWand.h header file dependency by installing the “libmagick9-dev” package from Lenny.  Unfortunately, if you do this and then re-install the gem, you are going to get an error that looks like:

checking for ImageMagick version >= 6.4.9... no

Ouch, so to install the MagickWand.h dependency, we had to downgrade our ImageMagick install to the point where the RMagick gem won’t even try to compile. This is the problem in how the ImageMagick Lenny packages are arranged, and I don’t quite understand the logic: “libmagic-dev” provides a more recent version of ImageMagick than “libmagic9-dev”… but only the older “libmagic9-dev” has the needed header files!

What to do? Backport a newer version of ImageMagick from Squeeze, of course! Follow these instructions for adding my Debian backports repository to your apt-sources, once you’re updated (make sure to pin anything you don’t want from my backports), do this:

sudo aptitude install libmagickwand-dev
sudo gem1.9.1 rmagick

And it compiles with the latest libmagick from Squeeze:

Successfully installed rmagick-2.13.1

New Lenny Backports

After rebuilding my VirtualBox Debian Lenny development images to 64bit installations, I realized that I needed to update my Debian personal package repository to include the amd64 architecture. So! I went ahead and rebuilt the latest packages from Squeeze for amd64 and i386 for Lenny, which includes Lighttpd at 1.4.28.

The following packages from Debian Squeeze are now in the PPA:

erlang-1:14.a-dfsg-2~bpo50+1 [amd64, i386]
couchdb-0.11.0-2.1~bpo50+1 [amd64, i386]
lighttpd-1.4.28-1~bpo50+1 [amd64, i386]
dkimproxy-1.2-6~bpo50+1 [amd64, i386]

Installation instruction for aptitude

UPDATE! 2015-06-04: I no longer maintain this Debian PPA.

Couchdb 0.11 Backport for Lenny

In wanting to play around with couchdb in Lenny, I found that Lenny’s official package is at version 0.8 but I wanted to test out some features from >= 0.10. Squeeze packages 0.11 so it was fairly easy to backport. I haven’t used the backport extensively, but I have noticed that Futon contains a JS error (“this.live is not a function”, futon.js?0.11.0, line 382). Perhaps this error or another issue is making the web interface essentially useless. Note that installing the backport also installs a backport of Erlang.

Installation Instructions for Aptitude:

Add this to your /etc/apt/sources.list:

deb http://www.jonmoniaci.com/debian-ppa/ lenny main contrib non-free

The install via:

aptitude update && aptitude install couchdb

UPDATE 2010-10-19:

The “this.live is not a function” problem is due to the fact that Couchdb relies on the “libjs-jquery” debian package but does not specify a version.  The jQuery “live” function was added in version 1.3.x, but Lenny ships with jQuery 1.2.x. This means you must manually install the libjs-jquery backport from either the lenny-backports repository or my repository (I have backported it in my repository — duplicating the effort of the lenny-backports folks — since I have backported a version of couchdb that relies on this package, and I think you should be able to get it all up and running with only one addition to your /etc/apt/sources.list file):

aptitude update && aptitude install libjs-jquery

UPDATE! 2015-06-04: I no longer maintain this Debian PPA.

Searching Common Nicknames in SOLR

I’ve been using Apache SOLR 1.4 as an indexing server for search lately.  Among the fields I index are people’s names. Most of the users are English speakers, and many use their proper English name on their profiles, but their friends or colleagues only know to search for them via their common nickname. Thus, if a user stores “Kimberly” as her first name, a search for “Kim” returns no results. That’s because SOLR doesn’t know how to “stem” proper nouns. Perhaps somebody out there has written a SOLR “common English names stemmer,” but I haven’t found it.

A seemingly easy solution would be to wrap all the search queries in wildcards under the hood, so if a user enters into the search field “Kim” we silently change that to “*kim*” or some such, and only use that wildcard pattern for the “name” field in the index. For example, our query might become something like:

q=(text:kim, name:*kim*)

Where the “text” field is an aggregation of all the fields we index for each document, and the “name” field only indexes the person’s name.

Not only does this seem a little hacky, there’s a few problems. One is that I use the DisMaxRequestHandler which doesn’t allow wildcard search patterns. The other problem is that as of SOLR 1.4 leading wildcards generally don’t play nice, though I believe there are ways to handle them. Also, a search for “Richard” could never find “Dick” (or visa versa) via this method. We need something more than just simple stemming.

SOLR provides a “synonyms” token filter. This essentially allows us to create a map of words that should be considered equals. Thus, we could map:

Kim => Kimberly

SOLR would then know that “Kim” can also mean “Kimberly” and it should search for both those tokens. However, searching for “Kimberly” does not also search for “Kim”. The “=>” arrow delimiter specifies that the map is one way. By configuring the synonyms token filter to “expand” the map, designating this map:

Kim, Kimberly

(with commas) means that the synonyms work in both directions. Searching for “Kim” or “Kimberly” will search for both. And you can specify multiple synonyms in one line, like so:

Kim, Kimmy, Kimberly, Kimberlicious

Or, for one way (do you really want a search for “Kimberly” to also search for “Kimberlicious”?):

Kim, Kimmy, Kimberlicious => Kimberly

All we need now is a map file that has all the common english nicknames. UsefulEnglish.ru to the rescue! They have lists of common male and female English nicknames.

You can download my SOLR formatted synonyms file. You should note that this file doesn’t do any “one way mapping” even though it probably should in some cases. For example, based on this synonyms line:

Caroline, Carolyn, Carolina, Carlyne, Carline, Karoline, Carrie, Carry, Caddie, Caddy, Carlie, Carly, Callie, Cally, Carol, Lynn, Lynne, Lin

A search for “Carlie” also searches for “Lynn” which is probably not desirable. In reality, this line should probably be broken up into one or more maps that make a little more sense:

Carrie, Carry, Caddie, Caddy, Carlie, Carly, Callie, Cally, Carol, Caroline, Carolyn, Carlyne, Carline, Karoline

Lynn, Lynne, Lin, Caroline, Carolyn, Carlyne, Carline, Karoline

Here all the “nick names” become grouped even though the “real names” are repeated. For example, a search for “Lynn” now searches for everything on line 2… but not, say “Caddy”. And, a search for “Carolyn” will search for “Caddy” and “Lynn” since “Carolyn” appears on both lines.

I have not done any empirical benchmarking using this synonyms file, but I can say based on my observations that searches do run slower using the synonyms filter. I don’t know the exact performance implications of making the file more complicated, but I assume that length and overlapping maps/words would only serve to slow things down.

To use the synonyms filter with this file, I first created a special field type in my SOLR schema.xml file:

<fieldType name="textEnglishName" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="english_names.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

Nothing fancy here! I’m using the ASCIIFoldingFilter to replace accented characters when possible with ASCII equivalents. Also, no stemming filter is present. And, the synonyms filter is only used during query, not during index.

Then, I just use the “textEnglishName” field type for any field that indexes a person’s name:

  <field name="name" type="textEnglishName" indexed="true" stored="false"/>

VirtualBox ACPI problems

After using the freely available VMware Player for a while to launch Debian images for web development,  as well as Windows images for IE browser testing, I’ve been slowly migrating everything to VirtualBox. In my experience, VirtualBox images seem to run a bit faster and when using Windows images, the OS UI seems to be more responsive. VirtualBox doesn’t come “out-of-the-box” with virtualized network interfaces bridged to your host, so you have to do some configuration to be able to SSH or connect to the virtual images (outside of using the console provided by VirutalBox).

When I created a new VirtualBox instance (guest OS is Ubuntu Karmic) with an installation of Windows XP SP2, the image would freeze for up to 30 seconds at a time (during installation as well as normal operation after rebooting into the installed OS), and I noticed VirtualBox was logging messages like this during the freezes:

TM: Giving up catch-up attempt at a 61 452 850 245 ns lag; new total: 1 121 014 977 319 ns

After some searching, it seems like my particular Windows installation’s ACPI was not working well with VBox. This forum thread (post by user kyboren) solved the issue. It’s a long thread, so I’m re-posting below:

After booting into Windows:

  • Right-click ‘My Computer’
  • Go to the hardware tab, click ‘Device Manager’
  • Expand the ‘Computer’ item
  • Select ‘ACPI Multiprocessor PC’
  • Right-click it and select ‘Update Driver’
  • Choose “Install from a list or specific location (Advanced)’
  • Choose ‘Don’t search. I will choose the driver to install.’
  • Choose ‘Standard PC’ from the list.
  • Reboot

After rebooting, Windows hardware detection will re-detect most of the virtualized hardware, simply walk through that process.

lighty 1.4.26 in debian lenny

Recently I’ve been playing around with SWFUpload v2.2.0.  In testing, I discovered that I would always receive back from the server an HTTP 400 “Bad Request” when the uploader tried to POST.  I run Debian Lenny and Lighty. Now, Lenny ships with Lighttpd 1.4.19 and my browser has Shockwave Flash 10.0 r42 (OS 10.5, Firefox 3.6.2). Flash seems to send an erroneous “Expect: 100-Continue” HTTP header during HTTP POST operations. Apparently some servers, like Apache, will silently ignore this and allow the operation, however Lighty does not.

Before researching this more, I decided that backporting the version of Lighty from Debian Squeeze (testing) to Lenny might solve this issue. Following these instructions I built a deb package in Lenny based on source from Squeeze and installed Lighty 1.4.26.

Although the backport functions well, it seems that even though Lighty (as of version 1.4.21) allows you to set a custom configuration value to ignore the erroneous 100-Continue header, it still cannot handle the Flash mulit-part boundary bug. Alas, we have to wait until Lighty 1.5 to get this working which I’m unwilling to run in production right now.

Since others might find the backport of Lighty 1.4.26 to Lenny useful, here it is:


download and install as root with:

wget http://www.jonmoniaci.com/debian-ppa/pool/main/l/lighttpd/lighttpd_1.4.26-1~bpo50+1_i386.deb
dpkg -i lighttpd_1.4.26-1~bpo50+1_i386.deb

I have not tested the other portions of the backport (mysql vhosts, etc), since I only use the main server.

UPDATE! 2010-05-07: I have recently created a Debian repository on my server so instead of just downloading and using dpkg -i to install the backport, you can add this to your /etc/apt/sources.list:

deb http://www.jonmoniaci.com/debian-ppa/ lenny main contrib non-free


aptitude update && aptitude install lighttpd

And it will install from my repository. Note that I have not signed the repository as of yet, so the packages will be considered “untrusted”.

UPDATE! 2010-10-18: The backported version of Lighttpd in the repository is now 1.4.28 (from Squeeze).

UPDATE! 2015-06-04: I no longer maintain my Debian PPA.