Publication | Closed Access
Global analytics in the face of bandwidth and regulatory constraints
165
Citations
31
References
2015
Year
EngineeringBig Data AnalyticsBusiness AnalyticsGlobal DataDistributed Data AnalyticsNet NeutralityGlobal AnalyticsData ScienceSql AnalyticsManagementParallel ComputingData ManagementNetworked IntelligenceAdvanced AnalyticsComputer ScienceInformation ManagementDistributed Query ProcessingRegulatory HarmonizationData-intensive ComputingGlobal-scale OrganizationsData PortabilityRegulationMassive Data ProcessingBig Data
Global-scale organizations generate vast amounts of data across geographically distributed data centers, and querying this data as a whole introduces new research challenges at the intersection of networks and databases; current systems that compute SQL analytics by pulling all data to a central location suffer from expensive transoceanic links and may be rendered impossible by emerging regulatory constraints. The new problem of Wide-Area Big Data (WABD) seeks to orchestrate query execution across data centers to minimize bandwidth while respecting regulatory constraints. WABD combines classical query planning with novel network‑centric mechanisms such as pseudodistributed execution, joint query optimization, and deltas on cached subquery results, and its prototype Geode, built on Hive, supports all SQL operators—including joins—across global data. Geode achieves 250× less bandwidth usage than centralized analytics in a Microsoft production workload and up to 360× less on popular analytics benchmarks such as TPC‑CH and Berkeley Big Data.
Global-scale organizations produce large volumes of data across geographically distributed data centers. Querying and analyzing such data as a whole introduces new research issues at the intersection of networks and databases. Today systems that compute SQL analytics over geographically distributed data operate by pulling all data to a central location. This is problematic at large data scales due to expensive transoceanic links, and may be rendered impossible by emerging regulatory constraints. The new problem of Wide-Area Big Data (WABD) consists in orchestrating query execution across data centers to minimize bandwidth while respecting regulatory constaints. WABD combines classical query planning with novel network-centric mechanisms designed for a wide-area setting such as pseudodistributed execution, joint query optimization, and deltas on cached subquery results. Our prototype, Geode, builds upon Hive and uses 250× less bandwidth than centralized analytics in a Microsoft production workload and up to 360× less on popular analytics benchmarks including TPC-CH and Berkeley Big Data. Geode supports all SQL operators, including Joins, across global data.
| Year | Citations | |
|---|---|---|
Page 1
Page 1