## Ⅰ. Introduction

Pretreatment patient-specific delivery quality assurance (DQA) is critical in ensuring accurate dose delivery in advanced radiotherapy techniques including intensitymodulated radiation therapy (IMRT), volumetric modulated arc therapy (VMAT), helical tomotherapy (HT), and stereotactic body radiation therapy [1-5].

HT is an IMRT delivery technique that enables highly accurate radiation doses and image acquisition using megavoltage (MV) computed tomography (CT) [6,7]. The HT setup uses an MV linear accelerator mounted on a ring gantry, and the system is modulated using a 64 binary multileaf collimator (MLC) [7]. Various treatment planning parameters can influence the DQA and treatment plan quality for HT, including pitch, field width (FW), leaf open time (LOT), planning, and actual modulation factor (MF) [6].

The DQA of HT is a laborious and complex procedure that encompasses various tasks such as the generation of DQA plans, setup of DQA devices, beam delivery, and analysis of DQA results. To handle all of these tasks, Dosimetry Check (DC) software is employed to perform DQA for HT using the exit beam fluence of the patient or phantom throughout dose reconstruction [8-10].

To reduce the workloads of dosimetrists and medical physicists, a study using statistical process control was previously conducted to predict treatment plan parameters associated with the failure of DQA in HT [6,11], identifying LOT and pitch to be such parameters [6]. In addition, several studies have employed machine and deep learning models to reduce the duration of the DQA process while increasing its accuracy for radiation therapy [11-19]. However, many institutions still perform DQA using EBT films, cheese phantoms, and DQA devices in HT. There have been no published reports on the use of machine learning models – such as the k-nearest neighbor (KNN), support vector machine (SVM), and logistic regression (LR) models – to evaluate the impact and accuracy of treatment planning parameters in HT DQA.

The purpose of this study was to evaluate the accuracy and impact of LOT and pitch using various machine learning models when EBT film-based delivery DQA is performed in HT.

## Ⅱ. Materials and methods

### 1. Data acquisition

In this study, 211 patients with successful (n=191) and failed (n=20) DQA measurements were randomly selected to evaluate the accuracy of each machine learning model. Patients with brain, head and neck (H&N), pelvic, prostate, and rectal cancer were included in this study. All selected patients were treated with tomotherapy (Accuray Inc., Sunnyvale, CA, USA).

### 2. Patient-specific DQA and data analysis

Treatment planning for all patients was performed using an HT planning station (Accuray Inc., Sunnyvale, CA, USA). A cylindrical solid water phantom (“Cheese phantom” Accuray Inc., Sunnyvale, CA, USA) was selected for all treatment plans to create the DQA plan. The center of the ionization chamber (IC, Exradin^{®} A1SL Ion-chamber; Standard Imaging, Middleton, WI, USA) was positioned in a cylindrical solid water phantom and moved to a low-dose gradient region in the target volume. A cheese phantom with an IC and Gafchromic EBT3 film (International Specialty Products, Wayne, NJ, USA) was used to measure the absolute dose and gamma values for all HT plans [10]. The differences between the calculated and measured point doses and dose distributions were computed using TomoTherapy DQA software (Accuray Inc., Sunnyvale, CA, USA). The absolute point dose difference (DD) and global gamma passing rate (GPR) were analyzed for all patients at a 10% global maximum. Throughout this process, the DD was measured within a tolerance range of ±5%, whereas the GPR was measured with a 3%/3 mm criterion. If one criterion failed, the DQA was considered to be a failure [4,10]. We only analyzed the LOT and pitch parameters, as these parameters were previously established to have the greatest impact on DQA results (Table 1). The mean and standard deviation of each of the two parameters were analyzed based on the DQA results, and a LOT proportion of < 100 ms was assessed [11].

### 3. Classffication and logistic regression models

#### 1) K-nearest neighbor (KNN)

The KNN algorithm is a supervised learning method designed to handle classification and regression problems using a feature space that encompasses the training data. Consequently, new data points are predicted according to their feature similarity to existing data points. The algorithm determines the distance between an unknown data point and the nearest ‘k’ training data points, and classifies the new point into that particular class, where the value ‘k’ is based on the number of data points selected from the training set [12]. The distance between a new data point and its nearest ‘k’ neighbors is calculated using metrics such as the Euclidean, Manhattan, and Minkowski distances [13].

#### 2) Support Vector Machine (SVM)

The SVM is a supervised machine learning method that determines a decision boundary to maximize the margin for data classification. For a given dataset of features from two groups of patients, an SVM attempts to determine the maximum hyperplane between the two classes, maximizing the distance to the closest data points – i.e., the support vector – on each side [14]. The SVM implements a kernel that maps input features to a higher-dimensional space, thereby facilitating nonlinear predictive modeling [15].

#### 3) Naive Bayes (NB) method

The NB algorithm is a supervised learning algorithm primarily used for binary and multiclass classification problems. NB belongs to the family of “probabilistic classifiers” based on Bayes’ theorem, which uses the strong assumption of independence between features [14]. The NB assumption can be used to infer the conditional probability of the value of an output variable from a given input value. As one of the simplest Bayesian network models, NB is often used for document classification [13]. However, higher accuracy levels can be achieved when this algorithm is combined with kernel density estimation [16].

#### 4) Logistic regression (LR)

LR is a classical machine learning algorithm typically used for binary classification tasks. It is a binary classification model that predicts the probability of given data belonging to a certain category as a value between 0 and 1 [13,14], using a function that represents probability as an S-shaped curve within that interval. A representative example is a logistic function [13].

### 4. Dataset strategy for validation

To evaluate the machine learning methods under consideration, all data in this study were randomly divided between a training set (n = 168) and a test set (n = 43) at a ratio of 8:2. Additionally, ten-fold cross-validation was performed to evaluate model performance using the Python programming language (Python 3.8).

### 5. Predictive model evaluation

Various evaluation metrics were used to analyze the prediction models in terms of accuracy, precision, sensitivity, and F1 score.

#### 1) Coefficient of determination (*R*^{2} or *R*-squared)

The coefficient of determination, denoted as *R*^{2}, represents the proportion of variance in a dependent variable as explained by a linear regression model. In other words, *R*^{2} is a measure of the model’s ability to predict or explain results in a linear regression setting. Generally, a high *R*^{2} value indicates that the model is a good fit to the data. *R*^{2} is determined as follows:

where, $\widehat{y}$ is the predicted valued of *y*, *y* is the mean value of *y* and *y _{i}* is the

*i*-th observed value [13,17].

#### 2) Mean squared error (MSE)

MSE, calculated as the sum of squared differences between predicted and actual target values divided by the number of data points, represents the average squared difference between actual and predicted values. The resulting value measures the variance in the residuals. The MSE is calculated using the following equation:

where, N is the number of the observation, *y _{i}* is the

*i*th observed value, and $\widehat{y}$ is the predicted valued of

*y*[13,17].

#### 3) Root mean squared error (RMSE)

RMSE is the standard deviation of residuals (prediction errors), which are measures of the distance between points on the regression line. Thus, the RMSE is a measure of the spread of residuals, defined as

where MSE is the mean squared error, N is the number of the observation, *y _{i}* is the

*i*th observed value, and $\widehat{y}$ is the predicted valued of

*y*[13,17].

#### 4) Confusion matrix

In this study, the performance of the prediction model was evaluated using confusion matrices encompassing accuracy, precision, recall, and F1-score.

## Ⅲ. Results

### 1. Evaluation of four prediction algorithms using scale-dependent metrics

The evaluation results of the four prediction algorithms using scale-dependent metrics for all patients are presented in Table 2 and Fig. 1. For R^{2}, the NB model achieved the highest value (0.624), whereas the SVM model exhibited the lowest value (0.078). Furthermore, the NB model achieved the lowest MSE value of 0.038, whereas the SVM model obtained the highest MSE of 0.139. Similar results were observed for the RMSE.

### 2. Performance metrics of four predicition models using confusion matrix

Fig. 2 shows a confusion matrix that summarizes the prediction results obtained by the four models. In the KNN model, two false negatives (FNs) and two false positives (FPs) were observed (Fig 2(a)). Furthermore, the SVM model exhibited four FNs. In contrast, both the NB and LR models exhibited only two FNs each (Fig 2(c, d)).

The performance metrics of the four prediction models in terms of accuracy, precision, sensitivity, and F1-score using the confusion matrix are presented in Table 2 and Fig. 3. The NB and LR models achieved the best results, with an accuracy of 0.953. In terms of precision, all models except KNN achieved a score of 1.000. Furthermore, all models except the SVM achieved a sensitivity of 0.949. Additionally, the NB and LR models both achieved F1-scores of 0.973, whereas the SVM and KNN models obtained F1-scores of 0.950 and 0.949, respectively (Fig. 3).

## Ⅳ. Discussion

This study represents the first attempt to evaluate the performance of various machine learning models on EBT-film-based DQA in the context of HT. The evaluation results of the four prediction algorithms using scale-dependent metrics are summarized in Table 2 and Fig. 1, with predictive performance metrics shown in Fig. 2 and 3.

For all DQAs, the proportions of LOTs below 100 ms are summarized in Table 1 along with pitch. Accuray, a reliable company, recommended maintaining an LOT rate below 100ms and 30% owing the risk of increased MLC errors and DQA failures [18]. We confirmed that in all cases, the LOT values below 100 ms fell within an average of 22% in successful DQA cases, and more than 49% in failure cases. These results are consistent with Accuray’s recommendations and our previous results [4].

Cavinato et al.[19] were the first to develop models for determining patient-specific QA results using LOTs and sinograms for HT plans, confirming that one of the three models under consideration achieves 100% sensitivity while reducing the DQA load by approximately 35% [19]. The present study was not conducted to develop a model to predict DQA results, but rather to predict DQA results using existing machine learning techniques commonly used for classification and regression analysis. Consequently, it is difficult to directly compare the results of this study with those of other studies. However, our results confirm that LOT is the treatment plan parameter that has the most significant impact on DQA results, which correlates with the findings of a prior study.

Wall et al.[15] demonstrated that the SVM model achieves the best performance when predicting DQA outcomes in VMAT treatment plans, whereas the present study identified the NB and LR algorithms as optimal prediction models. Because the study by Wall et al. examined VMAT treatment plans, whereas our study focused on HT plans, it was difficult to analyze the resulting differences in model accuracy. However, the SVM model developed by Wall et al. was optimized through hyperparameter tuning, which may have led to improved performance. Accordingly, we plan to tune the models’ hyperparameters in subsequent studies to further evaluate model accuracy.

Because HT planning systems span a wide range of treatment planning parameters, it is time-consuming and labor-intensive to replan and determine patient-specific QA to modify these parameters. Therefore, we believe that dosimetrists can predict DQA results in advance by following the acceptable DQA ranges of each parameter for anatomical regions.

This study had several limitations owing to its retrospective design. The sample size of patients considered in this study was 211, which is less than that of a previously published study [19]. To address this limitation in the future and prepare predictive models for routine institutional use, we plan to collect additional DQA data from various anatomical sites. Many researchers have evaluated the impact of treatment plan parameters on the O-ring gantry linac (Halcyon). Other studies have analyzed VMAT plans using decision tree models such as the random forest, AdaBoost, and gradient boosting algorithms [20,21]. In a previous study, we similarly examined the parameters that have the greatest impact on DQA using the classification and regression tree (CART) model, a type of decision tree model [11]. However, all prediction models considered in this study were evaluated using only the LOT and pitch parameters. Therefore, in the future, we plan to apply various decision tree models to determine the most influential treatment planning parameters for DQA. Furthermore, the present study only considered publicly available machine learning models, which are difficult to directly compare to those developed in prior studies. Ultimately, we plan to develop a highly accurate prediction model that can minimize the workload associated with the DQA process.

## Ⅴ. Conclusion

In this study, various machine learning methods were used to evaluate the significance of LOT and pitch as treatment planning parameters affecting film-based DQA results in the context of HT. The prediction accuracy of DQA results was confirmed using machine learning. In clinical practice, although it may be difficult to routinely predict DQA results using only LOT and pitch as parameters, the methodology examined in this study may reduce the workloads of medical physicists and dosimetrists by predicting these results in advance.