Skluma: An Extensible Metadata Extraction Pipeline for Disorganized Data

Abstract

To mitigate the effects of high-velocity data expansion and to automate the organization of filesystems and data repositories, we have developed Skluma-a system that automatically processes a target filesystem or repository, extracts content-and context-based metadata, and organizes extracted metadata for subsequent use. Skluma is able to extract diverse metadata, including aggregate values derived from embedded structured data; named entities and latent topics buried within free-text documents; and content encoded in images. Skluma implements an overarching probabilistic pipeline to extract increasingly specific metadata from files. It applies machine learning methods to determine file types, dynamically prioritizes and then executes a suite of metadata extractors, and explores contextual metadata based on relationships among files. The derived metadata, represented in JSON, describes probabilistic knowledge of each file that may be subsequently used for discovery or organization. Skluma's architecture enables it to be deployed both locally and used as an on-demand, cloud-hosted service to create and execute dynamic extraction workflows on massive numbers of files. It is modular and extensible-allowing users to contribute their own specialized metadata extractors. Thus far we have tested Skluma on local filesystems, remote FTP-accessible servers, and publicly-accessible Globus endpoints. We have demonstrated its efficacy by applying it to a scientific environmental data repository of more than 500,000 files. We show that we can extract metadata from those files with modest cloud costs in a few hours.

References

Page 1

	Year	Citations
Glove: Global Vectors for Word Representation Jeffrey Pennington, Richard Socher, Christopher D. Manning EngineeringMachine LearningVector SpaceCorpus LinguisticsText Mining	2014	33.2K
Latent dirichlet allocation David M. Blei, Andrew Y. Ng, Michael I. Jordan Journal of Machine Learning Research Latent Dirichlet AllocationEngineeringCorpus LinguisticsAutomatic SummarizationText Mining	2003	26.9K
Efficient Estimation of Word Representations in Vector Space Tomáš Mikolov, Kai Chen, Greg S. Corrado arXiv (Cornell University) EngineeringMachine LearningVector SpaceLarge Language ModelCorpus Linguistics	2013	18.1K
Efficient Estimation of Word Representations in Vector Space Tomáš Mikolov, Kai Chen, Greg S. Corrado, arXiv (Cornell University)	2013	11.7K
Docker: lightweight Linux containers for consistent development and deployment Dirk Merkel Linux journal EngineeringOs-level VirtualizationLightweight ContainersContainerizationComputer Architecture	2014	3.3K
The NHGRI GWAS Catalog, a curated resource of SNP-trait associations Danielle Welter, Jacqueline A. L. MacArthur, Joannella Morales, Nucleic Acids Research	2013	2.9K
Topological structural analysis of digitized binary images by border following Satoshi Suzuki, KeiichiA be Computer Vision Graphics and Image Processing Geometric ModelingImage AnalysisEngineeringEdge DetectionNatural Sciences	1985	2.6K
An Overview of the Tesseract OCR Engine Ray Smith Proceedings of the International Conference on Document Analysis and Recognition EngineeringDocument Image AnalysisBiometricsImage AnalysisData Science	2007	2.1K
From databases to dataspaces Michael J. Franklin, Alon Halevy, David Maier ACM SIGMOD Record Relational DatabaseEngineeringBusiness IntelligenceDatabasesSemantic Web	2005	727
WebTables Michael Cafarella, Alon Halevy, Daisy Zhe Wang, Proceedings of the VLDB Endowment Natural Language ProcessingSearch TechnologyEngineeringInformation RetrievalData Science	2008	630

Page 1