Deep Learning–based Assessment of Oncologic Outcomes from Natural Language Processing of Structured Radiology Reports

TLDR

The study trains a deep NLP model on structured oncology reports to classify tumor response categories from free‑text reports and compares its performance with human readers and conventional NLP algorithms. An automated data‑mining pipeline extracts RECIST‑related TRCs from structured reports for ground truth, and a BERT model along with three feature‑rich algorithms is trained on these reports to predict TRCs in free‑text reports, with F1 scores benchmarked against radiologists, medical students, and technologist students. On 802 test free‑text reports, BERT achieved an F1 of 0.70, outperforming a linear SVM (0.63) and technologist students (0.65), matching medical students (0.73), but falling short of radiologists (0.79), while lexical complexity and semantic ambiguity caused performance drops up to 0.19. © RSNA, 2022.

Abstract

To train a deep natural language processing (NLP) model, using data mined structured oncology reports (SOR), for rapid tumor response category (TRC) classification from free-text oncology reports (FTOR) and to compare its performance with human readers and conventional NLP algorithms.In this retrospective study, databases of three independent radiology departments were queried for SOR and FTOR dated from March 2018 to August 2021. An automated data mining and curation pipeline was developed to extract Response Evaluation Criteria in Solid Tumors-related TRCs for SOR for ground truth definition. The deep NLP bidirectional encoder representations from transformers (BERT) model and three feature-rich algorithms were trained on SOR to predict TRCs in FTOR. Models' F1 scores were compared against scores of radiologists, medical students, and radiology technologist students. Lexical and semantic analyses were conducted to investigate human and model performance on FTOR.Oncologic findings and TRCs were accurately mined from 9653 of 12 833 (75.2%) queried SOR, yielding oncology reports from 10 455 patients (mean age, 60 years ± 14 [SD]; 5303 women) who met inclusion criteria. On 802 FTOR in the test set, BERT achieved better TRC classification results (F1, 0.70; 95% CI: 0.68, 0.73) than the best-performing reference linear support vector classifier (F1, 0.63; 95% CI: 0.61, 0.66) and technologist students (F1, 0.65; 95% CI: 0.63, 0.67), had similar performance to medical students (F1, 0.73; 95% CI: 0.72, 0.75), but was inferior to radiologists (F1, 0.79; 95% CI: 0.78, 0.81). Lexical complexity and semantic ambiguities in FTOR influenced human and model performance, revealing maximum F1 score drops of -0.17 and -0.19, respectively.The developed deep NLP model reached the performance level of medical students but not radiologists in curating oncologic outcomes from radiology FTOR.Keywords: Neural Networks, Computer Applications-Detection/Diagnosis, Oncology, Research Design, Staging, Tumor Response, Comparative Studies, Decision Analysis, Experimental Investigations, Observer Performance, Outcomes Analysis Supplemental material is available for this article. © RSNA, 2022.

References

Page 1

	Year	Citations

Page 1