Exploiting multi-task learning for endoscopic vision in robotic surgery

Project reference: SIE_28_22
First supervisor:  Miaojing Shi
Second supervisor: Tom Vercauteren

Start date:  February 2022

Project summary: Multi-task learning is common in deep learning, where clear evidence shows that jointly learning correlated tasks can improve on individual performances. Notwithstanding, in reality, many tasks are processed independently. The reasons are manifold: 1) many tasks are not strongly correlated, benefits might be obtained for only one or none of the tasks in joint learning; 2) the scalability of learning multiple tasks is limited with the number of tasks in terms of both network optimization and practical implementation. Having a scalable and robust multi-task learning strategy however is very meaningful and of substantial potential in many real applications, i.e. endoscopic image processing. This project studies multi-task learning in endoscopic vision for robotic surgery with a particular focus on depth and optical flow estimation, surgical instrument detection and anatomy recognition, as well as surgical action recognition. The aim is to design effective multi-task learning strategies to improve the performance on all tasks.


Project description: Multi-task learning is common in deep learning: For similar tasks like detection and segmentation, or detection and counting, this has already been achieved given the supervision of one for the other. There exist clear evidence that adding one side task would help the improvement of the main task, yet it is unclear how much benefits both tasks can get in these combinations, especially if they are not strongly correlated. For this reason, multiple tasks are normally processed independently in the current fashion. Another main obstacle lies in the scalability of learning multiple tasks together in terms of both network optimization and practical implementation. To tackle this, careful designs of the conjunction of multiple tasks are needed; novel methodologies of learning paradigms are also expected. This project is placed in the endoscopic image processing domain. We aim to develop a machine learning model with general visual intelligence capacity in robotic surgery, which includes depth and optical flow estimation, surgical instrument detection and anatomy recognition, as well as surgical action recognition. Depth and optical flow estimation as well as anatomy recognition are key requirement to develop autonomous robotic control schemes that are cognizant of the surgical scene. Automatic detection and tracking of surgical instruments from laparoscopic surgery videos further plays an important role for providing advanced surgical assistance to the clinical team given the uncertainties associated with surgical robots kinematic chains and the potential presence of tools not directly manipulated by the robot. Being able to know how many and where the instruments find its applications such as: placing informative overlays on the screen; performing augmented reality without occluding instruments; visual servoing; surgical task automation; etc. Surgical action recognition is also critical to advance autonomous robotic assistance during the procedure and for automated auditing purposes.


Year 1: In the first year the student will focus on jointly learning two tasks in the robotic surgery. The project will begin with easier combinations such as depth and optical flow estimation, or surgical instrument detection and anatomy recognition, etc. These tasks are highly correlated, a typical multi-

branch deep neural network can work. Then we move on to tasks such that one is built upon the other, for instance, surgical instrument detection and surgical action recognition. They can be organized in a recurrent manner by encoding the results of their analysis in a common shared representation of the data [3].

Year 2: In the second year, the student will expand from two tasks to multiple tasks. For many tasks that are not directly related, the quality of multiple predictions is often observed to suffer with the so-called negative transfer among tasks. This is the main challenge to tackle in Year 2.

Year 3: The multi-task learning in year 1 and 2 is assumed with sufficient data and supervision on all tasks. The third-year plan will be focused on the insufficiency of supervision among tasks in practice. For some complicated task, e.g. surgical action recognition, we basically cannot have sufficient and strong annotations for it owing to both the lack of professional annotators and the complexity of the annotation cost. The target is thus to find intrinsic connections between multiple tasks such that information can be transferred from tasks with sufficient supervision to those without.