Vision, audio, and language models for surgical interventions

The aim of this project is to propose vision and language models that support analysis and prediction of performance of the surgical team in the operating room to facilitate human-in-the-loop artificial intelligence. This PhD project will be done in the context of neurosurgical interventions, particularly for endoscopic transsphenoidal pituitary surgery (eTSS), whereby our group is curating a multimodal dataset. The specific objectives of this research include: 1) investigate audio and language models that automatically process audio into language; 20 design vision-audio-and-language models of surgical data captured in the mock operating room to assess communication of the surgical team; 3) design vision-audio-and-language models of surgical data captured in the mock operating room to process feedback articulated by experts and provide this automatedly to trainees; and 4) evaluate the performance of such models retrospectively on publicly available datasets and on a privately collection of surgical videos in the mock operating room.

Project reference

Start date

First Clinical supervisor

Second Clinical supervisor

Third Clinical supervisor

Back to projects

Vision, audio, and language models for surgical interventions