Publication | Closed Access
Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers
63
Citations
12
References
2015
Year
Unknown Venue
Identifying and extracting figures and tables along with their captions from scholarly articles is important both as a way of providing tools for article summarization, and as part of larger systems that seek to gain deeper, semantic understand-ing of these articles. While many “off-the-shelf ” tools exist that can extract embedded images from these documents, e.g. PDFBox, Poppler, etc., these tools are unable to extract ta-bles, captions, and figures composed of vector graphics. Our proposed approach analyzes the structure of individual pages of a document by detecting chunks of body text, and locates the areas wherein figures or tables could reside by reason-ing about the empty regions within that text. This method can extract a wide variety of figures because it does not make strong assumptions about the format of the figures embed-ded in the document, as long as they can be differentiated from the main article’s text. Our algorithm also demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures. Our contribution also includes methods for lever-aging particular consistency and formatting assumptions to identify titles, body text and captions within each article. We introduce a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them. Our algorithm achieves 96 % pre-cision at 92 % recall when tested against this dataset, surpass-ing previous state of the art. We release our dataset, code, and evaluation scripts on our project website for enabling future research. 1
| Year | Citations | |
|---|---|---|
Page 1
Page 1