Bidirectional long short-term memory for surgical skill classification of temporally segmented tasks

Jason D Kelly, Ashley Petersen, Thomas S. Lendvay, Timothy M. Kowalewski

Research output: Contribution to journalArticlepeer-review

7 Scopus citations


Purpose: The majority of historical surgical skill research typically analyzes holistic summary task-level metrics to create a skill classification for a performance. Recent advances in machine learning allow time series classification at the sub-task level, allowing predictions on segments of tasks, which could improve task-level technical skill assessment. Methods: A bidirectional long short-term memory (LSTM) network was used with 8-s windows of multidimensional time-series data from the Basic Laparoscopic Urologic Skills dataset. The network was trained on experts and novices from four common surgical tasks. Stratified cross-validation with regularization was used to avoid overfitting. The misclassified cases were re-submitted for surgical technical skill assessment to crowds using Amazon Mechanical Turk to re-evaluate and to analyze the level of agreement with previous scores. Results: Performance was best for the suturing task, with 96.88% accuracy at predicting whether a performance was an expert or novice, with 1 misclassification, when compared to previously obtained crowd evaluations. When compared with expert surgeon ratings, the LSTM predictions resulted in a Spearman coefficient of 0.89 for suturing tasks. When crowds re-evaluated misclassified performances, it was found that for all 5 misclassified cases from peg transfer and suturing tasks, the crowds agreed more with our LSTM model than with the previously obtained crowd scores. Conclusion: The technique presented shows results not incomparable with labels which would be obtained from crowd-sourced labels of surgical tasks. However, these results bring about questions of the reliability of crowd sourced labels in videos of surgical tasks. We, as a research community, should take a closer look at crowd labeling with higher scrutiny, systematically look at biases, and quantify label noise.

Original languageEnglish (US)
Pages (from-to)2079-2088
Number of pages10
JournalInternational Journal of Computer Assisted Radiology and Surgery
Issue number12
StatePublished - Dec 2020

Bibliographical note

Funding Information:
This work was supported, in part, by the Office of the Assistant Secretary of Defense for Health Affairs under Award No. W81XWH-15-2-0030, the National Science Foundation M3X CAREER grant under Award No. 1847610, as well as the National Institutes of Health’s National Center for Advancing Translational Sciences, Grant UL1TR002494. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the Department of Defense, the National Science Foundation, or the National Institutes of Health’s National Center for Advancing Translational Sciences.

Publisher Copyright:
© 2020, CARS.


  • Bidirectional LSTM
  • Crowd sourcing
  • Machine learning
  • Surgical skill
  • Surgical technical skill


Dive into the research topics of 'Bidirectional long short-term memory for surgical skill classification of temporally segmented tasks'. Together they form a unique fingerprint.

Cite this