Skip to content

Latest commit

 

History

History
114 lines (77 loc) · 8.17 KB

README.md

File metadata and controls

114 lines (77 loc) · 8.17 KB

jeospatial

jeospatial is a simple geospatial point database library for Java. It aims to provide an easy-to-use and reasonably-high-performance set of tools for solving the k-nearest-neighbor problem on the earth's surface.

Geospatial point databases in this library are implemented using vantage point trees (or vp-trees), which are a data structure that performs binary space partitioning on a metric space. Construction of a geospatial point database executes in O(n log(n)) time and searches against that database execute in O(log(n)) time. As a practical point of reference, it takes about a second and a half to construct a vp-tree that contains roughly 30,000 geospatial points on a 2007 MacBook Pro, and about two seconds to execute 10,000 searches against that tree (for a search throughput of about 5,000 searches/second). By contrast, it takes between 300 and 400 milliseconds to sort a list of those 30,000 points by distance from a query point (for a search throughput of 2-3 searches/second).

Major concepts

The two most important interfaces in the jeospatial library are in the com.eatthepath.jeospatial package.

The GeospatialPoint interface defines a single point on the earth's surface; concrete implementations of the GeospatialPoint interface can be found in the com.eatthepath.jeospatial.util package.

The GeospatialPointDatabase interface defines the behavioral contract for classes that index collections of GeospatialPoints and provide facilities for performing nearest-neighbor searches among those points. The VPTree and LockingVPTree classes are both concrete implementations of the GeospatialPointDatabase class and can be found in the com.eatthepath.jeospatial.vptree package.

For additional details, see the API documentation.

Examples

Finding nearest neighbors

Let's say we have a list of all of the zip codes in the United States and we want to find the ten closest zip codes to some point in the world (let's say Davis Square in Somerville, MA, USA). We might do something like this:

// Load a bunch of zip codes from a file and construct a vp-tree from
// those points
List<ZipCode> zipCodes = ZipCode.loadAllFromCsvFile();
VPTree<ZipCode> pointDatabase = new VPTree<ZipCode>(zipCodes);

// Pick a query point (Davis Square in Somerville, MA, USA)
SimpleGeospatialPoint davisSquare = new SimpleGeospatialPoint(42.396745, -71.122479);

// Find the ten nearest zip codes to Davis Square
List<ZipCode> nearestZipCodes = pointDatabase.getNearestNeighbors(davisSquare, 10);

The nearestZipCodes list will have ten elements; the first will be the closest zip code to Davis Square, the second will be the second closest, and so on.

Finding all neighbors within a certain radius

Assuming we have the same set of zip codes, we might want to find all of the zip codes that are within a fixed distance Davis Square. For example:

// Find all zip codes within ten kilometers of Davis Square
List<ZipCode> zipCodesWithinRange = database.getAllNeighborsWithinDistance(davisSquare, 10 * 1000);

The zipCodesWithinRange list will contain all of the zip codes—sorted in order of increasing distance from Davis Square—that are within ten kilometers of Davis Square.

Finding neighbors by other search criteria

Assuming we still have the now-familiar list of zip codes, we might want to find the closest zip codes to Davis Square that are outside of the state of Massachusetts. We could do that using the SearchCriteria class:

// Specify search criteria that matches anything outside of the state of
// Massachusetts
SearchCriteria<ZipCode> searchCriteria = new SearchCriteria<ZipCode>() {
    @Override
    public boolean matches(ZipCode zipCode) {
        return !zipCode.getState().equals("MA");
    }
};

// Find the ten closest zip codes to Davis Square that are outside of
// Massachusetts
List<ZipCode> closestOutsideMA = database.getNearestNeighbors(davisSquare, 10, searchCriteria);

The closestOutsideMA list will contain the ten closest zip codes to Davis Square sorted in order of increasing distance.

Finding points inside a bounding box

If you're working with a section of a map with a cylindrical projection (e.g. a Google or Bing map), you might want to find all of the zip codes that are visible in that section of the map. A set of bounding box search methods comes in handy here:

// Find all of the zip codes in a bounding "box"
List<ZipCode> inBoundingBox = database.getAllPointsInBoundingBox(-75, -70, 43, 42);

As might expected, the inBoundingBox list contains all of the zip codes that fall between the longitude lines of -75 and -70 degrees and the latitude lines of 42 and 43 degrees. Other variants of the getAllPointsInBoundingBox method allow for sorting the results by proximity to some point and applying additional search criteria.

Memory usage

jeospatial tries to keep memory overhead to a reasonable level by using sensible internal collection types and by aggressively trimming internal point lists. Still, any spatial index will necessarily introduce some memory overhead, and it's important—especially when dealing with large datasets—to understand how data structures and design decisions affect the memory usage of an application.

A detailed exploration of jeospatial's memory usage will be available soon, but the short version is that the overhead imposed by a VPTree is a function of the node capacity (set at construction time) of the tree. The per-point overhead cost of using a VPTree is roughly as follows for non-trivial datasets:

Node capacityOverhead per point
1112.0 bytes
256.0 bytes
430.0 bytes
817.0 bytes
1610.5 bytes
327.3 bytes
645.6 bytes
1284.8 bytes
2564.4 bytes

The default node capacity for a VPTree (at the time of this writing) is 32 points per node, so it's reasonable to expect that a default database would impose overhead memory usage somewhere between 7.3 and 10.5 bytes per point. A tree containing 1,048,576 (2^20) SimpleGeospatialPoints would take up 24 MB for the points themselves and an additional 7.3 to 10.5 MB in overhead for a total memory footprint somewhere between 31.3 and 34.5 MB.

While it may be tempting to set a high node capacity when working with VPTrees to keep memory overhead to a minimum, it's important to remember that larger node capacities carry performance tradeoffs. In addition to a lower memory overhead ratio, a larger node capacity means that a search will have to visit fewer nodes of the tree to return the desired number of points, but will have to examine every point in that node for inclusion in the search results. If your use case calls for frequent searches that return large numbers of points, a larger node capacity is a good idea across the board. Otherwise, carefully consider the tradeoffs between memory usage and processing time for your specific application. As a general rule of thumb, a node capacity somewhere around twice the size of your usual search result size strikes a good balance between processing time and memory usage.

Acknowledgements

United States ZIP code data used in examples and unit tests comes from The United States Census Bureau. Examples and unit tests make use of the opencsv library under the Apache License, Version 2.0.

License

jeospatial is an open-source project provided under the BSD License.