Journal Search Engine
Search Advanced Search Adode Reader(link)
Download PDF Export Citaion korean bibliography PMC previewer
ISSN : 2288-3509(Print)
ISSN : 2384-1168(Online)
Journal of Radiological Science and Technology Vol.47 No.4 pp.263-270
DOI : https://doi.org/10.17946/JRST.2024.47.4.263

Prediction of Delivery Quality Assurance Via Machine Learning in Helical Tomotherapy

Kyung Hwan Chang
Department of Radiological Science, Far East University

This work was supported by the 2023 Far East University Research Grant (FEU2023R29)


Corresponding author: Kyung Hwan Chang, Department of Radiological Science, Far East University, 76-32 Daehak-gil, Gamgok-myeon,
Eumseong-gun, Chungcheongbuk-do, 27601, Republic of Korea / Tel: +82-43-880-3825 / E-mail: nightholicmp@gmail.com
18/06/2024 01/07/2024 08/07/2024

Abstract


The objective of this study was to evaluate the accuracy and impact of leaf open time (LOT) and pitch using various machine learning models on EBT film-based delivery quality assurance (DQA) performed on 211 patients of helical tomotherapy (HT). We randomly selected passed (n=191) and failed (n=20) DQA measurements to evaluate the accuracy of the k-nearest neighbor (KNN), support vector machine (SVM), naive Bayes (NB) and logistic regression (LR) models using scale-dependent metrics such as the coefficient of determination (R2), mean squared error (MSE), and root MSE (RMSE). We evaluated the performance of the four prediction models in terms of the accuracy, precision, sensitivity, and F1-score using a confusion matrix, finding the NB and LR models to achieve optimal results. The results of this study are expected to reduce the workload of medical physicists and dosimetrists by predicting DQA results according to LOT and pitch in advance.



방사선치료 시 다양한 기계학습을 이용한 선량품질관리 결과의 예측

장경환
극동대학교 방사선학과

초록


    Ⅰ. Introduction

    Pretreatment patient-specific delivery quality assurance (DQA) is critical in ensuring accurate dose delivery in advanced radiotherapy techniques including intensitymodulated radiation therapy (IMRT), volumetric modulated arc therapy (VMAT), helical tomotherapy (HT), and stereotactic body radiation therapy [1-5].

    HT is an IMRT delivery technique that enables highly accurate radiation doses and image acquisition using megavoltage (MV) computed tomography (CT) [6,7]. The HT setup uses an MV linear accelerator mounted on a ring gantry, and the system is modulated using a 64 binary multileaf collimator (MLC) [7]. Various treatment planning parameters can influence the DQA and treatment plan quality for HT, including pitch, field width (FW), leaf open time (LOT), planning, and actual modulation factor (MF) [6].

    The DQA of HT is a laborious and complex procedure that encompasses various tasks such as the generation of DQA plans, setup of DQA devices, beam delivery, and analysis of DQA results. To handle all of these tasks, Dosimetry Check (DC) software is employed to perform DQA for HT using the exit beam fluence of the patient or phantom throughout dose reconstruction [8-10].

    To reduce the workloads of dosimetrists and medical physicists, a study using statistical process control was previously conducted to predict treatment plan parameters associated with the failure of DQA in HT [6,11], identifying LOT and pitch to be such parameters [6]. In addition, several studies have employed machine and deep learning models to reduce the duration of the DQA process while increasing its accuracy for radiation therapy [11-19]. However, many institutions still perform DQA using EBT films, cheese phantoms, and DQA devices in HT. There have been no published reports on the use of machine learning models – such as the k-nearest neighbor (KNN), support vector machine (SVM), and logistic regression (LR) models – to evaluate the impact and accuracy of treatment planning parameters in HT DQA.

    The purpose of this study was to evaluate the accuracy and impact of LOT and pitch using various machine learning models when EBT film-based delivery DQA is performed in HT.

    Ⅱ. Materials and methods

    1. Data acquisition

    In this study, 211 patients with successful (n=191) and failed (n=20) DQA measurements were randomly selected to evaluate the accuracy of each machine learning model. Patients with brain, head and neck (H&N), pelvic, prostate, and rectal cancer were included in this study. All selected patients were treated with tomotherapy (Accuray Inc., Sunnyvale, CA, USA).

    2. Patient-specific DQA and data analysis

    Treatment planning for all patients was performed using an HT planning station (Accuray Inc., Sunnyvale, CA, USA). A cylindrical solid water phantom (“Cheese phantom” Accuray Inc., Sunnyvale, CA, USA) was selected for all treatment plans to create the DQA plan. The center of the ionization chamber (IC, Exradin® A1SL Ion-chamber; Standard Imaging, Middleton, WI, USA) was positioned in a cylindrical solid water phantom and moved to a low-dose gradient region in the target volume. A cheese phantom with an IC and Gafchromic EBT3 film (International Specialty Products, Wayne, NJ, USA) was used to measure the absolute dose and gamma values for all HT plans [10]. The differences between the calculated and measured point doses and dose distributions were computed using TomoTherapy DQA software (Accuray Inc., Sunnyvale, CA, USA). The absolute point dose difference (DD) and global gamma passing rate (GPR) were analyzed for all patients at a 10% global maximum. Throughout this process, the DD was measured within a tolerance range of ±5%, whereas the GPR was measured with a 3%/3 mm criterion. If one criterion failed, the DQA was considered to be a failure [4,10]. We only analyzed the LOT and pitch parameters, as these parameters were previously established to have the greatest impact on DQA results (Table 1). The mean and standard deviation of each of the two parameters were analyzed based on the DQA results, and a LOT proportion of < 100 ms was assessed [11].

    3. Classffication and logistic regression models

    1) K-nearest neighbor (KNN)

    The KNN algorithm is a supervised learning method designed to handle classification and regression problems using a feature space that encompasses the training data. Consequently, new data points are predicted according to their feature similarity to existing data points. The algorithm determines the distance between an unknown data point and the nearest ‘k’ training data points, and classifies the new point into that particular class, where the value ‘k’ is based on the number of data points selected from the training set [12]. The distance between a new data point and its nearest ‘k’ neighbors is calculated using metrics such as the Euclidean, Manhattan, and Minkowski distances [13].

    2) Support Vector Machine (SVM)

    The SVM is a supervised machine learning method that determines a decision boundary to maximize the margin for data classification. For a given dataset of features from two groups of patients, an SVM attempts to determine the maximum hyperplane between the two classes, maximizing the distance to the closest data points – i.e., the support vector – on each side [14]. The SVM implements a kernel that maps input features to a higher-dimensional space, thereby facilitating nonlinear predictive modeling [15].

    3) Naive Bayes (NB) method

    The NB algorithm is a supervised learning algorithm primarily used for binary and multiclass classification problems. NB belongs to the family of “probabilistic classifiers” based on Bayes’ theorem, which uses the strong assumption of independence between features [14]. The NB assumption can be used to infer the conditional probability of the value of an output variable from a given input value. As one of the simplest Bayesian network models, NB is often used for document classification [13]. However, higher accuracy levels can be achieved when this algorithm is combined with kernel density estimation [16].

    4) Logistic regression (LR)

    LR is a classical machine learning algorithm typically used for binary classification tasks. It is a binary classification model that predicts the probability of given data belonging to a certain category as a value between 0 and 1 [13,14], using a function that represents probability as an S-shaped curve within that interval. A representative example is a logistic function [13].

    4. Dataset strategy for validation

    To evaluate the machine learning methods under consideration, all data in this study were randomly divided between a training set (n = 168) and a test set (n = 43) at a ratio of 8:2. Additionally, ten-fold cross-validation was performed to evaluate model performance using the Python programming language (Python 3.8).

    5. Predictive model evaluation

    Various evaluation metrics were used to analyze the prediction models in terms of accuracy, precision, sensitivity, and F1 score.

    1) Coefficient of determination (R2 or R-squared)

    The coefficient of determination, denoted as R2, represents the proportion of variance in a dependent variable as explained by a linear regression model. In other words, R2 is a measure of the model’s ability to predict or explain results in a linear regression setting. Generally, a high R2 value indicates that the model is a good fit to the data. R2 is determined as follows:

    R 2 = 1 ( y i y ^ ) 2 ( y i y ¯ ) 2
    (1)

    where, y ^ is the predicted valued of y, y is the mean value of y and yi is the i-th observed value [13,17].

    2) Mean squared error (MSE)

    MSE, calculated as the sum of squared differences between predicted and actual target values divided by the number of data points, represents the average squared difference between actual and predicted values. The resulting value measures the variance in the residuals. The MSE is calculated using the following equation:

    M S E = 1 N i = 1 N ( y i y ^ ) 2
    (2)

    where, N is the number of the observation, yi is the ith observed value, and y ^ is the predicted valued of y [13,17].

    3) Root mean squared error (RMSE)

    RMSE is the standard deviation of residuals (prediction errors), which are measures of the distance between points on the regression line. Thus, the RMSE is a measure of the spread of residuals, defined as

    R M S E = M S E = 1 N i = 1 N ( y i y ^ ) 2
    (3)

    where MSE is the mean squared error, N is the number of the observation, yi is the ith observed value, and y ^ is the predicted valued of y [13,17].

    4) Confusion matrix

    In this study, the performance of the prediction model was evaluated using confusion matrices encompassing accuracy, precision, recall, and F1-score.

    Ⅲ. Results

    1. Evaluation of four prediction algorithms using scale-dependent metrics

    The evaluation results of the four prediction algorithms using scale-dependent metrics for all patients are presented in Table 2 and Fig. 1. For R2, the NB model achieved the highest value (0.624), whereas the SVM model exhibited the lowest value (0.078). Furthermore, the NB model achieved the lowest MSE value of 0.038, whereas the SVM model obtained the highest MSE of 0.139. Similar results were observed for the RMSE.

    2. Performance metrics of four predicition models using confusion matrix

    Fig. 2 shows a confusion matrix that summarizes the prediction results obtained by the four models. In the KNN model, two false negatives (FNs) and two false positives (FPs) were observed (Fig 2(a)). Furthermore, the SVM model exhibited four FNs. In contrast, both the NB and LR models exhibited only two FNs each (Fig 2(c, d)).

    The performance metrics of the four prediction models in terms of accuracy, precision, sensitivity, and F1-score using the confusion matrix are presented in Table 2 and Fig. 3. The NB and LR models achieved the best results, with an accuracy of 0.953. In terms of precision, all models except KNN achieved a score of 1.000. Furthermore, all models except the SVM achieved a sensitivity of 0.949. Additionally, the NB and LR models both achieved F1-scores of 0.973, whereas the SVM and KNN models obtained F1-scores of 0.950 and 0.949, respectively (Fig. 3).

    Ⅳ. Discussion

    This study represents the first attempt to evaluate the performance of various machine learning models on EBT-film-based DQA in the context of HT. The evaluation results of the four prediction algorithms using scale-dependent metrics are summarized in Table 2 and Fig. 1, with predictive performance metrics shown in Fig. 2 and 3.

    For all DQAs, the proportions of LOTs below 100 ms are summarized in Table 1 along with pitch. Accuray, a reliable company, recommended maintaining an LOT rate below 100ms and 30% owing the risk of increased MLC errors and DQA failures [18]. We confirmed that in all cases, the LOT values below 100 ms fell within an average of 22% in successful DQA cases, and more than 49% in failure cases. These results are consistent with Accuray’s recommendations and our previous results [4].

    Cavinato et al.[19] were the first to develop models for determining patient-specific QA results using LOTs and sinograms for HT plans, confirming that one of the three models under consideration achieves 100% sensitivity while reducing the DQA load by approximately 35% [19]. The present study was not conducted to develop a model to predict DQA results, but rather to predict DQA results using existing machine learning techniques commonly used for classification and regression analysis. Consequently, it is difficult to directly compare the results of this study with those of other studies. However, our results confirm that LOT is the treatment plan parameter that has the most significant impact on DQA results, which correlates with the findings of a prior study.

    Wall et al.[15] demonstrated that the SVM model achieves the best performance when predicting DQA outcomes in VMAT treatment plans, whereas the present study identified the NB and LR algorithms as optimal prediction models. Because the study by Wall et al. examined VMAT treatment plans, whereas our study focused on HT plans, it was difficult to analyze the resulting differences in model accuracy. However, the SVM model developed by Wall et al. was optimized through hyperparameter tuning, which may have led to improved performance. Accordingly, we plan to tune the models’ hyperparameters in subsequent studies to further evaluate model accuracy.

    Because HT planning systems span a wide range of treatment planning parameters, it is time-consuming and labor-intensive to replan and determine patient-specific QA to modify these parameters. Therefore, we believe that dosimetrists can predict DQA results in advance by following the acceptable DQA ranges of each parameter for anatomical regions.

    This study had several limitations owing to its retrospective design. The sample size of patients considered in this study was 211, which is less than that of a previously published study [19]. To address this limitation in the future and prepare predictive models for routine institutional use, we plan to collect additional DQA data from various anatomical sites. Many researchers have evaluated the impact of treatment plan parameters on the O-ring gantry linac (Halcyon). Other studies have analyzed VMAT plans using decision tree models such as the random forest, AdaBoost, and gradient boosting algorithms [20,21]. In a previous study, we similarly examined the parameters that have the greatest impact on DQA using the classification and regression tree (CART) model, a type of decision tree model [11]. However, all prediction models considered in this study were evaluated using only the LOT and pitch parameters. Therefore, in the future, we plan to apply various decision tree models to determine the most influential treatment planning parameters for DQA. Furthermore, the present study only considered publicly available machine learning models, which are difficult to directly compare to those developed in prior studies. Ultimately, we plan to develop a highly accurate prediction model that can minimize the workload associated with the DQA process.

    Ⅴ. Conclusion

    In this study, various machine learning methods were used to evaluate the significance of LOT and pitch as treatment planning parameters affecting film-based DQA results in the context of HT. The prediction accuracy of DQA results was confirmed using machine learning. In clinical practice, although it may be difficult to routinely predict DQA results using only LOT and pitch as parameters, the methodology examined in this study may reduce the workloads of medical physicists and dosimetrists by predicting these results in advance.

    Figure

    JRST-47-4-263_F1.gif

    Evaluation results of k-nearest neighbor (KNN), support vector machine (SVM), naive bayes (NB) and logistic regression (LR) prediction models using scale-dependent metrics. The histogram of the coefficient of determination (R2), mean squared error (MSE) and root mean squared error (RMSE) for each prediction models are shown in red, blue and pink, repectively.

    JRST-47-4-263_F2.gif

    Confusion matrices for machine learning algorithms (a: KNN, b: SVM, c: NB, and d: LR).

    JRST-47-4-263_F3.gif

    Performance metrics of k-nearest neighbor (KNN), support vector machine (SVM), naive bayes (NB) and logistic regression (LR) prediction models in terms of accuracy, precision, sensitivity and F1-score. The histogram of accuracy, precision, sensitivity and F1-score for each prediction models are shown in red, blue, pink and light green, respectively.

    Table

    Summary of the pitch and leaf open time (LOT) in the passing and failing delivery quality assurance groups

    Abbreviation: LOT, the percentage of leaf open time below 100 ms; DD, dose difference; GPR, gamma passing rate

    Evaluation results of four prediction algorithms using scale-dependent metrics

    Abbreviation: R<sup>2</sup>, the coefficient of determination; MSE, mean squared error; RMSE, root mean squared error; KNN, K-nearest neighbor; SVM, support vector machine; NB, naive bayes; LR, logistic regression

    Reference

    1. Chang KH, Ji Y, Kwak J, Kim SW, Jeong C, Cho B, et al. Clinical implications of High Definition Multileaf Collimator (HDMLC) Dosimetric Leaf Gap (DLG) Variations. Prog Med Phys. 2016;27(3):111-6.
    2. Cho B. Intensity-modulated radiation therapy: A review with a physics perspective. Radiat Oncol J. 2018;36(1):1-10. Epub 2018 Mar 30
    3. Thiyagarajan R, Nambiraj A, Sinha SN, Yadav G, Kumar A, Subramani V, et al. Analyzing the performance of ArcCHECK diode array detector for VMAT plan. Reports of Practical Oncology & Radiotherapy. 2016;21(1):50-6. Epub 2015 Dec 2.
    4. Chang KH. Treatment planning guideline of EBT-film based delivery quality assurance using statistical process control in helical tomotherapy. Journal of Radiological Science and Technology. 2022;45(5): 439-48.
    5. Chang KH. A comparison of patient-specific delivery quality assurance (DQA) devices in radiation therapy. Journal of Radiological Science and Technology. 2023;46(3)231-8.
    6. Guckenberger M, Meyer J, Wilbert J, Baier K, Bratengeier K, Vordermark D, et al. Precision required for dose-escalated treatment of spinal metastases and implications for image-guided radiation therapy (IGRT). Radiother Oncol. 2007;84(1):56-63. Epub 2007 Jun 11.
    7. Montgomery DC. Statistical quality control. New York: Wiley; 2009.
    8. Chung E, Kwon D, Park T, Kang H, Chung Y. Clinical implementation of Dosimetry Check™ for TomoTherapy® delivery quality assurance. J Appl Clin Med Phys. 2018;19(6):193-9.
    9. McCowan PM, Asuni G, van Beek T, van Uytven E, Kujanpaa K, McCurdy BM. A model-based 3D patient- specific pre-treatment QA method for VMAT using the EPID. Phys Med Biol. 2017;62(4):1600-12.
    10. Chang KH, Kim DW, Choi JH, et al. Dosimetric comparison of four commercial patient-specific quality assurance devices for helical tomotherapy. J Korean Phys Soc. 2020;76:257-63.
    11. Chang KH, Lee YH, Park BH, Han MC, Kim J, Kim H, et al. Statistical analysis of treatment planning parameters for prediction of delivery quality assurance failure for helical tomotherapy. Technol Cancer Res Treat. 2020;19:1533033820979692.
    12. Siddalingappa R, Kanagaraj S. K-nearest-neighbor algorithm to predict the survival time and classification of various stages of oral cancer: A machine learning approach. F1000Res. 2023;16(11):70.
    13. Kubat M. An introduction to machine learning. 1st ed. Springer Publishing Company, Incorporated; 2015. https://link.springer.com/book/10.1007/978-3-319-20010-1
    14. Cilla S, Viola P, Romano C, Craus M, Buwenge M, Macchia G, et al. Prediction and classification of VMAT dosimetric accuracy using plan complexity and log-files analysis. Phys Med. 2022;103:76-88.
    15. Wall PDH, Fontenot JD. Application and comparison of machine learning models for predicting quality assurance outcomes in radiation therapy treatment planning. Informatics in Medicine Unlocked. 2020;18:100292.
    16. Kononenko I. Inductive and bayesian learning in medical diagnosis. Appl Artif Intell. 1993;7(4):317-37.
    17. Jierula A, Wang S, OH T-M, Wang P. Study on accuracy metrics for evaluating the predictions of damage locations in deep piles using artificial neural networks with acoustic emission data. Applied Sciences. 2021;11(5):2314.
    18. Thomas SJ, Geater AR. Implications of leaf fluence opening factors on transfer of plans between matched helical tomotherapy machines. Biomedical Physics & Engineering Express. 2017;4(1):017001.
    19. Cavinato S, Bettinelli A, Dusi F, Fusella M, Germani A, Marturano F, et al. Prediction models as decision- support tools for virtual patient-specific quality assurance of helical tomotherapy plans. Phys Imaging Radiat Oncol. 2023;26:100435.
    20. Zhu H, Zhu Q, Wang Z, Yang B, Zhang W, Qiu J. Patient-specific quality assurance prediction models based on machine learning for novel dual-layered MLC linac. Med Phys. 2023;50(2):1205-14.
    21. Kusunoki T, Hatanaka S, Hariu M, Kusano Y, Yoshida D, Katoh H, et al. Evaluation of prediction and classification performances in different machine learning models for patient-specific quality assurance of head-and-neck VMAT plans. Med Phys. 2022;49(1):727-41.