Concepedia

Publication | Closed Access

Data warehousing and analytics infrastructure at facebook

433

Citations

7

References

2010

Year

TLDR

Scalable analysis on large data sets is core to many Facebook teams, supporting ad hoc analysis, dashboards, and site features such as Insights and friend recommendations, and thus requires a flexible, cost‑effective infrastructure that scales with growing data. The paper presents how the integrated systems enable a data warehouse that stores over 15 PB of data and ingests more than 60 TB daily. The architecture relies on open‑source tools—Scribe for log collection, Hadoop for storage, and Hive for analytics—to form a scalable log‑collection, storage, and analytics stack that supports daily ingestion of tens of terabytes and 15 PB of data while addressing operational challenges and guiding future enhancements.

Abstract

Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.

References

YearCitations

Page 1