EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos

TLDR

Surgical workflow recognition, useful for indexing video databases and optimizing OR scheduling, has been studied across multiple procedures, typically using handcrafted visual features or manually annotated tool signals. The authors introduce EndoNet, a CNN that automatically learns visual features from cholecystectomy videos to perform phase recognition and tool presence detection in a multi‑task setting, marking the first CNN‑based approach for multiple recognition tasks on laparoscopic videos. EndoNet is a convolutional neural network that jointly predicts surgical phases and tool presence using only visual input from laparoscopic videos. Experiments demonstrate that EndoNet achieves state‑of‑the‑art performance on both phase recognition and tool detection.

Abstract

Surgical workflow recognition has numerous potential medical applications, such as the automatic indexing of surgical video databases and the optimization of real-time operating room scheduling, among others. As a result, surgical phase recognition has been studied in the context of several kinds of surgeries, such as cataract, neurological, and laparoscopic surgeries. In the literature, two types of features are typically used to perform this task: visual features and tool usage signals. However, the used visual features are mostly handcrafted. Furthermore, the tool usage signals are usually collected via a manual annotation process or by using additional equipment. In this paper, we propose a novel method for phase recognition that uses a convolutional neural network (CNN) to automatically learn features from cholecystectomy videos and that relies uniquely on visual information. In previous studies, it has been shown that the tool usage signals can provide valuable information in performing the phase recognition task. Thus, we present a novel CNN architecture, called EndoNet, that is designed to carry out the phase recognition and tool presence detection tasks in a multi-task manner. To the best of our knowledge, this is the first work proposing to use a CNN for multiple recognition tasks on laparoscopic videos. Experimental comparisons to other methods show that EndoNet yields state-of-the-art results for both tasks.

References

Page 1

	Year	Citations

Page 1