# A preliminary deep learning study on automatic segmentation of contrast-enhanced boluses in videofluorography of swallowing

### Ethical considerations

This study was conducted with the approval of the Ethics Committee of the Aichi Gakuin University School of Dentistry (#586) and in accordance with the Declaration of Helsinki. This study is a non-invasive observational study using only existing anonymous video data. Through the use of opt-out, subjects were given the opportunity to refuse to participate in the study. The ethics committee of the Aichi Gakuin University School of Dentistry has waived the requirement of informed consent from all participants.

### Participants

Twelve patients participated (seven men and five women; mean age, 58.4 ± 23.3 years; age range, 20-89 years) who attended the outpatient swallowing clinic of our hospital between November 2018 and January 2020. ; all underwent videofluorography (VFG) for examination of swallowing function.

### Videofluorography

Patients sat on a VFG chair (MK-102, Tomomi-koubou, Shimane, Japan) in normal eating position without head fixation; they were examined with a fluorographic machine (DCW-30A, Canon Medical Systems, Tokyo, Japan).

The contrast sample was prepared with 50 ml of 50% w/v barium sulfate (Baritogen Deluxe, Fushimi Laboratory, Kagawa, Japan) mixed with thickener (Throsoft Liquid 12 g/pack, Kissei Pharmaceutical Co. Ltd, Nagano, Japan). ). The concentration of this barium is much lower than that (200% w/v – 240% w/v) typically used for the upper gastrointestinal tract. This could help prevent adhesion of barium to the mucosa of the oral and pharyngeal cavities and provide sufficiently qualified images. The examiner placed a spoonful of sample (approximately 5 mL) in the patient’s mouth, and the patient began swallowing it on the examiner’s signal. Swallowing tests with this sample were performed three times and moving images were recorded.

Subsequently, examinations were performed using a 50 ml sample of 50% w/v barium sulfate (Baritogen Deluxe). The patient was instructed to place a sample from a paper cup (approximately 5 ml) in his mouth and begin swallowing on the examiner’s signal. Swallowing tests with this sample were performed three times and moving images were recorded. Consequently, six swallow tests were performed for each patient.

Diagnoses of swallowing function based on VFG images were made with the mutual consent of two radiologists and an oral surgeon with more than 20 years of experience. The presence or absence of residual contrast-enhanced bolus events and aspiration/penetration were assessed on VFG images. The severity of dysphagia was based on the penetration-aspiration scaleeleven: Seven patients in this study had healthy swallowing function, while 5 patients showed laryngeal aspiration or invasion.

### Image Preparation

VFG images (oral to pharyngeal phases) were continuously converted at 15 static images per second. Still images were standardized to a size of 256 × 256 pixels by cutting off extra space at the top and front of the images and then saved in JPEG format (Fig. 2).

### Assignment to training, validation and test data sets

Images were arbitrarily assigned to training, validation, and test data sets (Table 2). For the training data set, 1845 static images were used, including 1005 static images of 18 swallows in three patients with healthy swallowing and 840 static images of 12 swallows in two patients with laryngeal aspiration or invasion. For the validation data set, 155 static images of six swallows in a healthy swallowing patient were used. As test data set 1, 510 static images of 18 swallows in three patients with healthy swallowing were used. As test dataset 2, 1400 static images of 18 swallows in three patients with laryngeal aspiration or invasion were used.

### deep learning system

The deep learning system was built on a Windows PC with an NVIDIA GeForce 11 GB GPU (NVIDIA, Santa Clara, CA, USA) and 128 GB of memory. The deep learning segmentation procedure was performed using a U-Net built on the Neural Network Console (Sony, Tokyo, Japan). U net is a neural network for fast and accurate image segmentation, and it is composed of an encoder-decoder format symmetry structure, as shown in Fig. 3.

### Annotation

For the training and verification data sets, images were created in which contrast-enhanced areas of the bolus were segmented and colorized using Photoshop (Adobe, Tokyo, Japan); these were used in addition to the original images (Fig. 4). In the annotation work, a radiologist with more than 30 years of experience performed the segmentation of the contrast-enhanced bolus areas. Another radiologist with more than 20 years of experience confirmed them. The bolus of the still images had very strong contrast and was easy to capture. If the latter determined that the annotations were incorrect, the two radiologists discussed and corrected them. The number of revisions was less than 0.5% of the total images.

### training process

The training process was performed with a U-Net neural network using training and validation data sets paired with the original and color images (Fig. 5). U-net is a convolutional neural network to perform semantic segmentation of lesions or tissues in images, and has a nearly symmetric structure of the encoder-decoder module.7.8. The encoder module downsamples the image and reduces the resolution of the feature map to capture high-level image detail. The decoder module consists of a set of layers that augment the encoder’s feature map sample to retrieve spatial information. Learning continued until the training loss was small enough on the learning curve, and eventually 500 learning epochs were performed. Thereafter, a trained model was created.

### inference process

In the inference process, test data set 1 or 2 was applied to the trained model to evaluate the model (Fig. 5). Prior to evaluation, a radiologist identified the ground reality of the contrast-enhanced areas of the bolus on the test images. For the evaluation of the model, the Jaccard index (JI), the Sørensen-Dice coefficient (DSC) and the sensitivity were calculated according to the following equations12:

$${\text{JI}} = {\text{S}}\left( {{\text{P}} \cap {\text{G}}} \right)/{\text{S}}( {\text{P}} \cup {\text{G}})$$

$${\text{DSC}} 2 \times {\text{S}}\left( {{\text{P}} \cap {\text{G}}} \right)/\left( {{\ text{S}}\left( {\text{P}} \right) + {\text{S}}\left( {\text{G}} \right)} \right)$$

$${\text{Sensitivity}}{\text{S}}\left( {{\text{P}} \cap {\text{G}}} \right)/{\text{S}}\left ( {\text{G}} \right)$$

where S(P) was the colored bolus area in the images predicted by the learning model, and S(G) was the actual bolus area. S(P ∩ G) was the overlapping area of ​​P and G, and S(PG) was the combined area. The actual images and the images predicted by the deep learning model were overlaid, and the number of pixels in the above areas was calculated using Photoshop.