Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers

Abstract

Identifying and extracting figures and tables along with their captions from scholarly articles is important both as a way of providing tools for article summarization, and as part of larger systems that seek to gain deeper, semantic understand-ing of these articles. While many “off-the-shelf ” tools exist that can extract embedded images from these documents, e.g. PDFBox, Poppler, etc., these tools are unable to extract ta-bles, captions, and figures composed of vector graphics. Our proposed approach analyzes the structure of individual pages of a document by detecting chunks of body text, and locates the areas wherein figures or tables could reside by reason-ing about the empty regions within that text. This method can extract a wide variety of figures because it does not make strong assumptions about the format of the figures embed-ded in the document, as long as they can be differentiated from the main article’s text. Our algorithm also demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures. Our contribution also includes methods for lever-aging particular consistency and formatting assumptions to identify titles, body text and captions within each article. We introduce a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them. Our algorithm achieves 96 % pre-cision at 92 % recall when tested against this dataset, surpass-ing previous state of the art. We release our dataset, code, and evaluation scripts on our project website for enabling future research. 1

References

Page 1

	Year	Citations

Page 1