Concepedia

Publication | Closed Access

Geospatial Hadoop (GS-Hadoop) an efficient mapreduce based engine for distributed processing of shapefiles

14

Citations

10

References

2016

Year

Abstract

Geospatial data, input for any Geographic Information System (GIS), are gathered from various sources such as earth observation satellites (EOS), drones, mobile devices, sensor networks, RFIDs and the web. Geospatial data such as multi-spectral and temporal images are examples of raster data while vector data are stored in the shapefile format containing .shp, .shx and .dbf component files and in XML based formats (KML and GML). Shapefile format is the most widely used and can store millions of vector features. A multi-temporal dataset can have thousands of shapefiles, running into multiterabytes, challenging the limits of frameworks such as Apache Hadoop. A limitation of HDFS is co-locating data blocks of a file. Shapefiles are binary and cannot be processed with blocks split across the cluster nodes. Our proposed Extended Shapefile format (.shpx) allows MapReduce to directly access the shapefile component files using Memory mapped Input Output. SHPX format and the accompanied ShapeDist library has been compared with the most widely used archival formats. We used a modified GeoTools library for in-memory processing of shapefiles. We are able to achieve speedup of ~8.3 for distributed processing and our results demonstrate considerable improvement in performance for processing thousands of extended shapefiles and millions of features on a Hadoop cluster.

References

YearCitations

Page 1