Concepedia

TLDR

Data drives machine learning, yet collecting real data is costly, laborious, and raises privacy and bias concerns, while synthetic data can provide cheap, richly annotated, controllable datasets but lacks mature generation tools. The authors introduce Kubric, an open‑source Python framework designed to generate photo‑realistic scenes with rich annotations at scale, aiming to fill the gap in data‑generation tooling. Kubric integrates PyBullet physics simulation with Blender rendering, supports distributed execution across thousands of machines, and outputs terabytes of annotated data, with all code, assets, and datasets released for reuse. The authors validate Kubric by releasing 13 diverse synthetic datasets that support tasks such as 3D NeRF modeling and optical flow estimation, demonstrating its versatility and effectiveness.

Abstract

Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details. But collecting, processing and annotating real data at scale is difficult, expensive, and frequently raises additional privacy, fairness and legal concerns. Synthetic data is a powerful tool with the potential to address these shortcomings: 1) it is cheap 2) supports rich ground-truth annotations 3) offers full control over data and 4) can circumvent or mitigate problems regarding bias, privacy and licensing. Unfortunately, software tools for effective data generation are less mature than those for architecture design and training, which leads to fragmented generation efforts. To address these problems we introduce Kubric, an open-source Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines, and generating TBs of data. We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation. We release Kubric, the used assets, all of the generation code, as well as the rendered datasets for reuse and modification.

References

YearCitations

Page 1