Concepedia

Publication | Closed Access

Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers

63

Citations

12

References

2015

Year

Abstract

Identifying and extracting figures and tables along with their captions from scholarly articles is important both as a way of providing tools for article summarization, and as part of larger systems that seek to gain deeper, semantic understand-ing of these articles. While many “off-the-shelf ” tools exist that can extract embedded images from these documents, e.g. PDFBox, Poppler, etc., these tools are unable to extract ta-bles, captions, and figures composed of vector graphics. Our proposed approach analyzes the structure of individual pages of a document by detecting chunks of body text, and locates the areas wherein figures or tables could reside by reason-ing about the empty regions within that text. This method can extract a wide variety of figures because it does not make strong assumptions about the format of the figures embed-ded in the document, as long as they can be differentiated from the main article’s text. Our algorithm also demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures. Our contribution also includes methods for lever-aging particular consistency and formatting assumptions to identify titles, body text and captions within each article. We introduce a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them. Our algorithm achieves 96 % pre-cision at 92 % recall when tested against this dataset, surpass-ing previous state of the art. We release our dataset, code, and evaluation scripts on our project website for enabling future research. 1

References

YearCitations

Page 1