New York City publishes lots of geographic data from a variety of city departments. A lot of it is GIS data for mapping things likes city park locations, beaches, playgrounds and bathrooms. There’s even a tree census GIS project you can download for every borough. Every street light. Zoning data. Lots of fun stuff! It’s called the NYC DataMine, the geo-data sets are here. It’s cool, but the value of the data is limited unless you’re a GIS wonk, or use GIS mapping tools, which, if you use Linux or Mac like me, you might be out of luck for the free GUI tools. Why doesn’t the city publish their data in easier to read web-formats? People could use it to throw onto Google maps, make location aware NYC applications, etc. It is possible to work with their data in this way, but it takes a little wrangling.
Let’s take a look at the city Parks GIS project.
What’s inside that zip file? It’s a set of mostly binary files that describe shapes and polygons using points and line segments that demarcate the boundaries of all the NYC parks defined in the database. The shape files are “ESRI Shapefiles“, a format created by Esri, a GIS mapping software company. According to Wikipedia, Esri has captured a lot of the GIS toolset market and apparently NYC uses their products. Along with these shape files is a DBase 3 database that contains meta-data about those shapes (like, the name of the park, what borough it’s in, it’s area, etc.). Normally, you’d open these files in a program like ArcGis, but I don’t use Windows. Besides, this is 2011. I want to look at it on the web, probably on a Google Map.
So we have a few issues. The first is that the binary ESRI Shapefile (Parks.shp) needs to be interpreted into some kind of serial format for easier handling. Libraries exist in different languages to read this file format, but I’ve found them to be a bit clunky and it’s easier just to get it into something else.
Shape files basically contain definitions of shapes identified by points in 2D space. These points are (obviously) meant to be plotted on a map. But what kind of map? How is that map projected? You remember from elementary school the basic Mercator Projection: Take a transparent globe that has a light in the middle, wrap a sheet of paper around the globe’s equator to form a cylinder, turn on the light and trace the lines being projected from the globe. (That’s why it’s called a projection, after all.) Actually, what we were all taught in elementary school is not exactly the correct physical method for creating the projection, but the point is that when you project a spherical object onto a 2D surface, it gets distorted somewhere. This is important because the points of a shapefile can be spatially referenced to any of a number of projections. Today on the web, we mostly use latitude and longitude as input to a mapping API (like Google, or Yahoo!) and let the service figure out how to flatten it out back into 2D. Points described in degrees latitude and longitude are “spherically projected” but I have found it rare indeed for GIS data to be described so simply. GIS data tends to be described in a different spatial reference, and this is where our NYC Parks data gets a little complicated.
First, we need a tool that can actually read and write shapefiles and hopefully output them into more friendly formats. This is where the Geospatial Data Abstraction Layer (GDAL) library comes in. It’s available for Debian and Ubuntu as packages, and probably most other Linux distros as well. The GDAL toolset comes with a program called
ogr2ogr and that’s what we’re going to use to get the shape file into something more handy.
But, in order to effectively convert our shape file, we need to know what spatial projection the points are described in, and what we want to re-project them into. The switches for
ogr2ogr we are interested in for this are
-t_srs which identify the “source file SRS (spatial reference system)” and the SRS we want to convert/re-project into. It turns out that are a lot of ways to describe an SRS. Some are well known and organizations have labeled them in a standard way. But, often two different standards bodies or organizations use different labels for the same SRS. SRS’s are sometimes described by a formatted string of key/value pairs (sometimes called “Well Known Text” in the GIS world, or “WKT”). Some geo-spatial libraries even define their own standard for describing SRS’s (if you’ve used Proj.4 you’ll know about their way). What it comes down to is that “standards” for describing spatial references don’t really exist. Or rather, there seem to be several parallel standards. Luckily, GDAL is good at understanding them.
So what’s our input SRS? The “WKT” of that SRS is going to be found in Parks.prj (for projection?). Just cat it out:
Nice. So, you can see that our projection is the Lambert Conformal Conic, and we have some other parameters in here as well. As far as my research goes, things like “Datum” (“D North American 1983”) and “PROJCS” (“NAD_1983_StatePlane_New_York_Long_Island_FIPS_3104_Feet”) indicate known US “state plane coordinate systems” that describe portions of the earth.
Ok, that’s the WKT for our input SRS (don’t worry, GDAL will just deal with that sucker). We need the output SRS. The coordinate system that GPS uses and pretty much all the mapping APIs expect as input is known as the World Geodetic System, last revised in 1984. Shorthand: WGS84. I’m not so sure
ogr2ogr “knows” what WGS84 is by name. It’s man page, however, indicates that it does know about SRS’s described by a particular standards body, the Geomatics Committee. The Geomatics Committee calls WGS84 “EPSG:4326” and GDAL’s tool can handle that. (By the way, if you are using a Ruby or Python library that wraps proj.4, or you have a need to open and parse shapefile data with other tools that require quirky SRS definitions, the Geomatics Committee website has great translations of the SRS’s into WKT and proj.4 command line switches, which you will definitely need when you instantiate that RGeo object, or some such).
One more thing before actually doing this conversion/re-projection. GDAL doesn’t actually understand the Lambert Conformal Conic projection described in Parks.prj. There’s an updated (and as far as my testing goes, backwards compatible) revision of this projection which is defined as “Lambert_Conformal_Conic_2SP” and you must change your Parks.prj to read:
OK! now… what output format do we want the shapefile in? Indeed,
ogr2ogr2 can output a new ESRI shapefile, or we can do something like… output it to JSON which seems like a winning format to me (check the man page for other fun formats you can use):
ogr2ogr -f "GeoJSON" -s_srs Parks.prj -t_srs EPSG:4326 Parks.json Parks.shp
And we’re done! We have a nice JSON encoded string in Parks.json (albeit a very large one), with descriptions of all the Polygons and MultiPolygons that describe the boundaries of New York City’s parks in latitude and longitude! Easily munged to throw into a Google map or some such. Each park entry even has it’s associated meta-data.