Gregor The Map Guy: February 2015

Some months ago, I set up a Amazon EC2 instance and set up Varnish on it, then had it act as a caching proxy for our map tiles. Gorgeous map tiles, just rather a lot of data every month and it would be best to offload that onto Amazon's fast network.

It's working well: after a little bit of fussing with setting Cache-Control and Expires headers via Apache's Expires and Headers modules, browsers definitely have expiration dates and are caching the map tiles. And that's where I left it lie for some months, on account of more pressing duties.

But I got to revisit this last week and noticed: Only Chrome was reporting that the map tiles were being loaded from cache. Firebug and IE's F12 tools, were showing code 200 for those map tiles. That is to say, they had the map tiles in their cache... but were requesting them anew anyway!

Cache Expiration

These map tiles change very rarely, once per year. So we were very generous about the maximum age of them in the cache. We advise the browser that 90 days is good.

ExpiresActive On
Cache-control set Header "max-age=7776000"

But so what? The browser has in its cache an image fetched on Feb 22 (6 days ago) and that the file is definitely too old come May 22... But what about today? How does the browser know that there isn't a newer version?

If the image file had a Last-Modified time, the browser could send along a If-Modified-Since header, and the server could send back a 304 (Not Modified) if the file hasn't been modified. But this is not a static file. It's generated by TileStache (perhaps from on-disk cache, perhaps not), and TileStache does not include a Last-Modified header.

From a standpoint of TileStache's capability to support multiple storage backends (Caches class types), this kinda makes sense. A S3 backend, a on-disk backend, memcache, or even hybrids of multiple cache backends... I can see why they skipped that.

As a result, the browser has it in the cache but the only way to know that it's the most recent version... is to download it again.

Etag

The other mechanism for identifying that a file has not changed since you last downloaded it, is the humble Etag header. An Etag is a string that identifies whether the file has changed: it could be a timestamp, a version number, an MD5 or CRC32 checksum, just about anything. Apache's own module does some hashing of the file's mtime, inode number, and filesize to generate a nice scramble.

The server sends the Etag along with the file content:

Content-type: image/jpeg
Etag: abcdefgh12345678
Content-length: 213452
begin file payload here

The browser on a subsequent request to that same URL, would send that same Etag back to the server as an advisory "If your etag matches this, then I already have the current copy and I don't need it again":

GET /12/34/56.jpeg
If-None-Match: "abcdefgh12345678"

An added benefit of using Etag instead of a Last-Modified time, is that the Etag won't run afoul of your clock being borked. For our extreme max-age of 90 days that's not a big issue, but if you're doing more typical caching of 1 or 2 hours and someone's PC gets set to the wrong timezone, they missed out -- but Etag abc123 will still be abc123 no matter what time it is.

But TileStache doesn't generate Etag headers either, so now what?

TileStache Issue 22

Looks like 2 years ago someone else had run into this, and asked for someone to add either Last-Modified or else Etag support into TileStache. Looks like it didn't get done. Fortunately, shr3k's advice was pretty close to the mark and it proved quite easy to add an MD5 digest to all three entry points: WSGI, CGI, and ModPython.

So, if they ever accept that pull, or you patch your own installation of TileStache, your TileStache will automagically generate Etag headers.

At first I was concerned that it would add significant processing time, generating that Etag. But in fact it's not even measurable: if there's a millisecond or two of difference, it's nothing detectable by the time a browser and a network are involved.

Hooray! (well, for caching proxies)

So we're all set: Our Varnish CDN is sending the Etag on to the client and is also caching it for itself. Browsers are using the Etag and sending it back to the Varnish, and they're getting back 304 responses instead of kilobytes of image content. For tiles being fetched, response times are 100-150 ms; for a tile already in cache, more like 20-40ms start-to-finish for the browser to find out that it's already set.

But there is a limitation: The Etag is only useful to caching proxies such as Varnish, and not if your browser is accessing TileStache directly. TileStache will not heed a If-None-Match header from the browser; TileStache would simply send you back the image content. This trick is definitely limited to caching proxies which are smart enough to handle Etag and know to bail with a 304.

Then again, how hard would it be to detect the header and fetch the tile content, then calculate the Etag? It'd come after fetching the content, so perhaps not a win for the server load; but could save some network load... An interesting idea...

So here was a fun puzzle, to occupy a few hours last week.

There's an agency which does inspections at certain locations, either levying a fine for failing to comply or giving them a thumb-up for following the rules. It's infeasible to visit every location in the state, so they have a sampling methodology.

Clusters and Samples

Their testing methodology is as follows:

Define a set of clusters for the year. A "cluster" is a set of ZIP code boundaries,
which are contiguous,
containing a total of between 20 and 80 locations.
A "sample" is a set of locations (say 500),
composed by randomly selecting clusters and putting them together, to achieve that many locations (so maybe 6-25 zipcode clusters).

The rationale here is:

The use of clusters, means that locations would be visited and inspected in geographical blocks, which makes travel easier. An individual zipcode is too small, perhaps having 1 or 2 locations, or even none at all, and they prefer to send out a group to do 50-ish locations in an area.
The clusters themselves are randomly generated, so one can't claim favoritism and bias. A given zipcode may or may not be clustered in with any particular neighboring zipcodes from one year to the next.
To generate the sample, clusters are randomly selected: a cluster of 5 zipcodes and 32 locations here, 8 zipcodes and 78 locations there, a cluster of 4 zipcodes and 21 locations, ... Again, no favoritism and bias -- it's a random collection of randomly-generated polygons.

Having read the first part, about the clustering system, I thought to myself: "A gerrymander maker!" Starting at a given location, it crawls over neighboring areas, collecting them into one giant polygon until that polygon has the desired characteristics. In this case it's number of locations, but hypothetically could be any other calculation doable on that polygon such as census demographics or voting habits.

Gerrymanders / Clusters

So here's part 1: the generation of those clusters.

Prerequisites:

Zipcode boundary polygons in a table.
Each zipcode polygon has an attribute locationcount which is the number of location-points contained inside each zipcode polygon. This was calculated earlier in desktop GIS since they wanted those statistics anyway. (see also http://gis.stackexchange.com/questions/54661/count-points-in-polygon-with-postgis)
The zipcode polygons are reasonably well digitized so their edges touch, but human error means that two neighboring zipcodes may not actually touch. Note the buffer operation below which accommodates this.
The examples below are PHP using Kohana's DB API, with PostgreSQL and PostGIS as the database and GIS heavy-lifting.

Strategy:

Take a list of all zipcode polygons, iterate over them.
For each one, do a "while true" collecting neighboring polygons and neighboring-neighboring polygons and neighboring-neighboring-neighboring polygons... until we have a list of polygon ID#s whose summed locationcount is in range.
Keep an associative array of polygon ID#s that have already been used in a cluster. If this polygon is already in a cluster, skip it. If not, well, it is now.
Having accumulated a set of clusters (cluster 1234 = polygons [ 56, 789, 1023, 45, 67, 8910 ]) get the geom union of those selected zipcode polygons, to get the geometry of the cluster as a whole. This is done in a second pass, for performance and for debugging: the two steps can easiy be debugged separately from each other.

// the preferred range of locations in a cluster
// the maximum is not enforced: if we go from 19 to 81 it's still >=20 and we call it
// however, in the final checking/reporting stage we do tally up clusters whose
// location population is outside this range, and report them as being under-populated or over-populated just FYI
$cluster_locations_min = 20;
$cluster_locations_max = 80;

// create our registry of clusters: uniqueID => cluster-info ( name, polygonids, geom, locationcount )
//     the unique ID for a cluster is not umportant, as long as it's unique
//     we use the ID# of the first zipcode polygon
$clusters    = array();

// create a quick-reference of zipcode polygons which have already been assigned to a cluster
// so we can skip over them quickly
// remember: associative array lookups are swift: array-within-array lookups are slow
$already     = array( '0'=>TRUE ); // don't let this be empty, makes for invalid NOT IN () clauses

// make up some prepared statements
// to fetch the total number of locations out of a given set of polygon ID#s, so we can construct clusters
// to fetch a random neighbor of the geomunion of a given set of polygon ID#s, so we can construct clusters (the cluster will sprawl and crawl, neighbors of neighbors)
//      tip: buffering a little bit, allows us to make up for not-quite-precise digitization
// to fetch the total union geometry of a set of polygon ID#s, used when we have a final set of IDs and want to grab the final geometry
$findclusterlocationcount = DB::query(Database::SELECT, "SELECT SUM(locationcount) AS locationcount FROM zipcodeboundaries WHERE id IN :clusteridlist");
$findclusterrandomneighbor = DB::query(Database::SELECT, "SELECT id FROM zipcodeboundaries WHERE ST_INTERSECTS( the_geom, (SELECT ST_BUFFER(ST_UNION(the_geom),0.0001) FROM zipcodeboundaries WHERE id IN :clusteridlist ) ) AND id NOT IN :clusteridlist AND id NOT IN :alreadylist ORDER BY RANDOM() LIMIT 1");
$findclustergeomunion      = DB::query(Database::SELECT, "SELECT ST_MULTI(ST_UNION(the_geom)) AS the_geom FROM zipcodeboundaries WHERE id IN :clusteridlist");

//
// pass 1: collect the polygon ID#s to form clusters
//

// the list of all Polygon ID#s so we can iterate over them and construct a cluster from each one
// the name of the polygons used merely for output/display/debugging
$all_polygons = DB::select('name,id')->from(zipcodeboundaries)->execute();

// start in on a cluster, and recursively keep adding RANDOMLY-SELECTED neighboring polygons to it
// until we collect a location count in the acceptable range
foreach ($all_polygons as $polygon) {
    // first and easiest: if this polygon has already been assigned to a cluster, we don't need to process it at all
    if (array_key_exists($polygon['id'],$already)) continue;

    // start by assuming that this cluster is this 1 polygon and has no locations
    // then use the while loop to keep spreading outward
    $polygon_ids_in_this_cluster = array($polygon['id']);
    $locationcount               = 0;
    while (TRUE) {
        $locationcount = $findclusterlocationcount->param(':clusteridlist',$polygon_ids_in_this_cluster)->execute();
        $locationcount = $locationcount[0]['locationcount'];

        // if we exceed the minimum then we have a cluster
        if ($locationcount >= $cluster_locations_min) {
            // verbose output
            print "     Complete Cluster: ID {$polygon['id']} with $locationcount locations. Components are: " . implode(',',$polygon_ids_in_this_cluster);
            // tag all component polygons as taken
            foreach ($polygon_ids_in_this_cluster as $id) $already[$id] = TRUE;
            // log the cluster
            $clusters[ $polygon['id'] ] = array( 'name'=>$polygon['name'], 'polygonids'=>$polygon_ids_in_this_cluster, 'locationcount'=>$locationcount );
            // done
            break;
        }

        // got here so we don't have enough locations yet
        // pick a random neighbor and add it to the $polygon_ids_in_this_cluster and try again
        // if we come up with 0 neighbors (all neighbors already in a cluster) then we give up,
        //     and accept that this cluster is undersized
        $new_neighbor_id = $findclusterrandomneighbor->param(':clusteridlist',$polygon_ids_in_this_cluster)->param(':alreadylist',array_keys($already))->execute();
        if (! @$new_neighbor_id[0]['id']) {
            // verbose output
            print "     Undersized Cluster: ID {$polygon['id']} with $locationcount locations. Components are: " . implode(',',$polygon_ids_in_this_cluster);
            // tag all component polygons as taken
            foreach ($polygon_ids_in_this_cluster as $id) $already[$id] = TRUE;
            // log the cluster
            $clusters[ $polygon['id'] ] = array( 'name'=>$polygon['name'], 'polygonids'=>$polygon_ids_in_this_cluster, 'locationcount'=>$locationcount );
            // done
            break;
        }
        $polygon_ids_in_this_cluster[] = $new_neighbor_id[0]['id'];
    }
}
printf("Done calculating %d clusters.", sizeof($clusters) );

//
// pass 2: we have polygon IDs, use geom union to create geometries for each cluster
// $clusters is an assocarray of a cluster-ID mapped onto info: list of polygon IDs, location count
//

printf("Calculating geometry union for %d clusters.", sizeof($clusters) );
foreach (array_keys($clusters) as $clusterid) {
    $unions = $findclustergeomunion->param(':clusteridlist',$clusters[$clusterid]['polygonids'])->execute();
    $clusters[$clusterid]['the_geom'] = $unions[0]['the_geom'];
    $clusters[$clusterid]['the_geom'] = $unions[0]['the_geom'];
}

//
// save to database
//

print "Saving to database.";
DB::delete('sampling_clusters')->execute();
foreach (array_values($clusters) as $cluster) {
    $fields = array_keys($cluster);
    $values = array_values($cluster);
    DB::insert('sampling_clusters', $fields)->values($values)->execute();
}
Database::instance()->query(NULL, "VACUUM FULL ANALYZE sampling_clusters");
print "Done saving.";

//
// generate some statistics about the generated cluster set
// not vital, but informative
//
$stats = array();
$stats['total_clusters'] = sizeof($clusters);
$stats['notenoughlocations'] = 0;
$stats['toomanylocations']   = 0;
$stats['locationcount_min'] = array(); // these are lists during the loop and we use min() and max() below; faster than doing "if X > min, min=X" here in the loop
$stats['locationcount_max'] = array(); // these are lists during the loop and we use min() and max() below; faster than doing "if X > min, min=X" here in the loop
$stats['locationcount_avg'] = 0;       // this is summed up in the loop; then divide by number of clusters for average
foreach (array_values($clusters) as $cluster) {
    $stats['locationcount_min'][] = $cluster['locationcount'];
    $stats['locationcount_max'][] = $cluster['locationcount'];
    $stats['locationcount_avg'] += $cluster['locationcount'];

    if ($cluster['locationcount'] < $cluster_locations_min) $stats['notenoughlocations']++;
    if ($cluster['locationcount'] > $cluster_locations_max) $stats['toomanylocations']++;
}

$stats['polygoncount_min'] = min($stats['polygoncount_min']);
$stats['polygoncount_max'] = max($stats['polygoncount_max']);
$stats['polygoncount_avg'] /= sizeof($clusters);

$stats['locationcount_min'] = min($stats['locationcount_min']);
$stats['locationcount_max'] = max($stats['locationcount_max']);
$stats['locationcount_avg'] /= sizeof($clusters);

foreach ($stats as $keyword=>$value) {
    print "STATS: $keyword = $value";
}

Results

These excerpts from three runs, show the sort of randomness and variation we're looking for. Polygon number 76 (I don't know its ZIP code, doesn't matter) sometimes got joined up to 1 other zipcode, sometimes to 4, and the resulting cluster was right on target for number of locations.

Tallied up a Cluster: 76 : 23 Locations / Polygon IDs 76,167,102,107
Tallied up a Cluster: 100 : 31 Locations / Polygon IDs 100,94,99
Tallied up a Cluster: 392 : 39 Locations / Polygon IDs 392,16

Tallied up a Cluster: 76 : 20 Locations / Polygon IDs 76,75
Tallied up a Cluster: 100 : 21 Locations / Polygon IDs 100,106,101,117
Tallied up a Cluster: 392 : 28 Locations / Polygon IDs 392,393

Tallied up a Cluster: 76 : 27 Locations / Polygon IDs 76,167,173,102,145
Tallied up a Cluster: 100 : 26 Locations / Polygon IDs 100,133,101,104,267
Tallied up a Cluster: 392 : 30 Locations / Polygon IDs 392,395

Even the undersized clusters follow the pattern we hope for, of not being consistent. Polygon 204 has no locations at all. In one case it was stuck into a corner and resulted in a cluster with 0 locations (not a problem, means nobody will be visiting that area) but on other runs, Polygon 204 got paired up with 6 or 4 or 10 others.

Undersized Cluster: 204 : Locations 0 / Polygon IDs 204
Tallied up a Cluster: 199 : 22 Locations / Polygon IDs 199,210,304,201,209,204,289
Tallied up a Cluster: 207 : 30 Locations / Polygon IDs 207,211,210,194,198,204,209,199,201,305,304,289

Not too shabby!

Next: Sampling Those Clusters

So there it is, a gerrymander maker. Our criteria for a cluster being acceptable, is a given number of locations contained within them. But it could just as easily have been based on income, race, voting habits... Point is, keep accumulating areas until you get the demographic you want, and there's your district / cluster / whatever.

And the neat thing is that even for a good-sized state in the western USA, this thing executes in about a minute. So if you aren't happy with the results, you can grind through it as many times as you like.

Gregor The Map Guy

Saturday, February 28, 2015

Rasberry Pi v2 is awesome

Caching and Etag headers - A day's work in Apache and TileStache