Performance analysis and knowledge-based quality assurance of critical organ auto-segmentation for pediatric craniospinal irradiation

Patient data selection

This study was performed by using retrospectively acquired CT scans on Phillips IQon Spectral CT (Philips Healthcare, Cleveland, OH) and commercially available clinical MIM software (MIM, version 7.3.2, MIM Software Inc., Cleveland, OH). The datasets of pediatric patients who underwent CSI treatments were randomly selected from November, 2015 to August, 2023. Institutional review board approval was obtained prior to the study. Patient ages were from 2 to 25 years, with a median age of 8. The patients had CT acquisition using 120 kVp, 1.171 spiral pitch factor, 0.625 mm collimation, a 50 cm field of view, 512 × 512 matrix size, and slice thickness of 1.5 mm based on CSI protocol.

For each patient, dosimetrists first manually delineated the normal structures with great care, followed by a thorough review of their work. After this initial phase, physicians made any necessary adjustments and provided their final approvals. This step-by-step collaboration between dosimetrists and physicians guaranteed that the contours of normal organ structures met the high clinical standards required for treatment planning. This process established the ground truth contours for 170 patients including brain, brainstem, chiasm, left cochlea, right cochlea, esophagus, left eye, right eye, left lens, right lens, left lung, right lung, left kidney, right kidney, left optic nerve, and right optic nerve.

Automated segmentation was applied to the CT scans of the CSI patients through an atlas segmentation software, employing a deformable image registration algorithm known as the VoxAlign Deformation Engine, developed by MIM Software. The atlas algorithm searched an archive containing 170 prior CSI ground truth cases and identified the CT scan that closely matched the current patient’s anatomy. By minimizing intensity disparities between the two images through deformable image registration of the historical patient’s CT, the algorithm generated a deformed vector field. This vector field was then employed to modify the contours from the previous patient, aligning them with the anatomical specifics of the current patient20. The atlas algorithm was applied to 100 CSI test cases to establish the atlas test contours.

The CT scans of these 100 CSI test patients were also auto-segmented using a neural network algorithm (Contour ProtégéAI, MIM Software Inc., Cleveland, OH) to establish the neural network test contours. For each patient, the neural network algorithm was executed, employing a U-Net architecture to convert the input image into a segmentation mask for each specific anatomical structure21. Both the CT Head and Neck neural network and CT Thorax models (Contour ProtégéAI, MIM Software Inc., Cleveland, OH) were used for each patient. These individual contours were aggregated, resulting in a collection of 16 listed contours for 100 test patients that were recorded for the purpose of comparison.

Contour comparison

100 CSI patients that were auto-segmented by both atlas and the neural network methods were analyzed using 13 distinct metrics and compared against the ground truth contours. The metrics can be split into two categories, overlap and distance measures. The crisp definitions for all metrics were adapted from a study22.

Overlap metrics

The segmentations of the ground truth and the test were defined as \(S_g\) and \(S_t\), respectively. Both segmentations were split up into two classes, \(S_g= \left\S_g^1,S_g^2\right\\) and \(S_t= \left\S_t^1,S_t^2\right\\), where the first class \(\left(S_g^1,S_t^1\right)\) was the anatomy of interest and the second class \(\left(S_g^2,S_t^2\right)\) was the background. An assignment function \(f_g^i(x)\) defined if a point \(x\) in a medical image volume was in the feature of interest of the background, where \(f_g^i\left(x\right)=1\) if \(x \in S_g^i\), and \(f_g^i\left(x\right)=0\) if \(x \notin S_g^i.\) The definition of \(f_t^i(x)\), the assignment function for the test segmentation, was defined similarly. If \(X=\left\x_1,\dots ,x_n\right\\) defined the point set of all points inside of the medical image volume, all points in the image were members of the feature of image or background classes, meaning \(f_g^1\left(x\right)+ f_g^2\left(x\right)=1\) and \(f_t^1\left(x\right)+ f_t^2\left(x\right)=1\) for all \(x \in X\).

The common cardinalities describing the overlap of the two segmentations were the four aspects of a confusion matrix, including true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). These were defined by the sum of agreement between the classes of segmentation of \(i\in S_g\) and \(j\in S_t\), calculated as

$$m_ij=\sum_r=1^X\rightf_g^i\left(x_r\right)f_t^j\left(x_r\right),$$

where \(TP= m_11, FP=m_10, FN= m_01\), and \(TN= m_00\).

True positives described the instances that were part of the feature of interest and were correctly classified as part of the feature of interest by the test segmentation. False positives described the instances that were part of the background and were incorrectly classified as part of the feature of interest by the test segmentation. False negatives described the instances that were part of the feature of interest and were incorrectly classified as background by the test segmentation. True negatives were describing the instances that were part of the background and were correctly classified as background by the test segmentation.

The overlap measures used these cardinalities. All overlap measures ranged from 0 to 1, where 0 indicated low agreement between ground truth and test segmentations and 1 indicated high agreement between ground truth and test segmentations. The Dice Similarity Coefficient (DSC), commonly used in comparing medical volume segmentation, measured the degree of overlap between two segmentations. It is defined by

$$DSC=\frac=\frac2TP2TP+FP+FN.$$

The Jaccard index (JAC) measured overlap and can be related to DSC by,

$$JAC=\fracS_g^1 \cap S_t^1\rightS_g^1 \cup S_t^1\right=\fracTPTP+FP+FN=\fracDSC2-DSC.$$

The true positive rate (TPR), also known as Sensitivity or Recall, measured the proportion of positive cases correctly identified by the test segmentation and is defined by,

$$Recall = Sensitivity = TPR =\fracTPTP+FN.$$

The true negative rate (TNR), also known as Specificity, measured the proportion of negative cases correctly identified by the test segmentation and is defined by,

$$Specificity = TNR =\fracTNTN+FP.$$

The positive predictive value (PPV), also known as Precision, represented the proportion of predicted positive cases that were actually positive and is defined by,

$$Precision= PPV =\fracTPTP+FP.$$

The rand index (RI) is commonly used to measure similarity between data clustering but can also evaluate classifications. The RI described the proportion of correct identifications by the test segmentation and is defined by,

$$RI =\fracTP+TNTP+TN+FP+FN.$$

It was noted during the calculations that since the entire CT was used as an input for both the ground truth and test segmentations that the number of true negatives (TN) was extremely high, causing the TNR and RI to be very close to 1 (e.g. 0.99998) even if a segmentation had a low TPR and PPV (e.g. below 0.2). To allow for the TNR and RI to provide meaningful data, the segmentation comparisons were performed after a bounding box was created. The bounding box was the smallest rectangular volume enclosing both ground truth and test segmentations. Note that this bounding box did not have any effect on TP, FP, or FN since all actual positives and predicted positives were included in the bounding box.

Distance metrics

The next set of metrics were the distance metrics, which used the Hausdorff Distance (HD), a measure of the distance between the ground truth and test segmentations. The HD measured the maximum of the distances describing each point in both sets of surface points of each volume to its closest point in the other set of surface points and is defined by,

$$HD=\underset\textmax\left(h\left(A,B\right),h(B,A)\right) ,$$

where \(h\left(A,B\right)\) is the directed Hausdorff Distance between point set A and point set B described by,

$$h\left(A,B\right)=\underseta\in A\textmax\undersetb\in B\textmin\Vert a-b\Vert ,$$

where \(\Vert a-b\Vert \) is Euclidean distance between two points. However, the HD was sensitive to outliers, so other metrics were developed to compare the distance. The other distance metrics were calculated from the nearest neighbor distances for all points in A and all points in B. This nearest neighbor function, returning a vector of distances describing the minimum distances from set A to B, can be written as:

$$d\left(A,B\right)= \undersetb\in B\textmin\Vert a-b\Vert \textfor\,a\,\textin\,A.$$

The traditional HD was designated as \(HD_max\) which can also be written as,

$$HD_max=\textmax\left(d\left(A,B\right),d\left(B,A\right)\right),$$

where \(d\left(A,B\right)\) and \(d\left(B,A\right)\) are vectors. The other distance metrics were calculated as follows:

$$HD_std=\textstd\left(d\left(A,B\right),d\left(B,A\right)\right),$$

$$HD_min=\textmin\left(d\left(A,B\right),d\left(B,A\right)\right),$$

$$HD_mean=\textmean\left(d\left(A,B\right),d\left(B,A\right)\right),$$

$$HD_median=\textmedian\left(d\left(A,B\right),d\left(B,A\right)\right),$$

$$HD_95\%=\textpercentile_95\%\left(d\left(A,B\right),d\left(B,A\right)\right).$$

The mean distance to agreement (MDA) measured the mean of the nearest neighbors from only one set of surface points to the other as described by

$$MDA=\textmean\left(d\left(A,B\right)\right).$$

Knowledge-based quality assurance tool

An in-house knowledge-based quality assurance tool was developed by leveraging the distinctive patterns of HU distributions for each individual organ. Sample HU distributions were acquired for each ground truth contour, a total of 1600 HU distributions encompassing 100 patients for 16 organs. The QA tool used kernel density estimation (KDE) which is a statistical technique to estimate the probability density function of a random variable and provide insights into the underlying distribution of data points. In a CT image, the data points were interpreted as voxels, each accompanied by a corresponding HU value. To generate a KDE for each patient’s organs, a collection of HU values was extracted. This KDE approach utilized kernel densities to construct a probability density function, effectively capturing the normalized distribution so that the total area under the probability distribution is equal to 1. This methodology permitted a comparison that remained uninfluenced by the organs’ varying sizes, which can differ widely within a pediatric population.

To validate the consistency of the HU distributions across multiple patients, KDEs were randomly divided into two groups. One group consisted of KDEs for 80 patients, which were utilized to establish a reference baseline distribution. The second group comprised KDEs for 20 patients, intended for validation purposes.

Subsequently, KDSs for the cohort of 80 reference patients were collectively calculated for each organ. This procedure yielded 16 averaged KDEs, corresponding to the 16 distinct organs. The standard deviation (SD) among these KDEs across the reference patient group was also determined. The agreement value of 0.95 was observed when the ground truths of 20 test patients were compared against the reference KDEs of 80 patients. This correspondence directly conformed to the statistical empirical guideline referred to as the two SDs or two-sigma rule. Therefore, it was logical to anticipate that the newly generated contours would fall within the range of ± 95% of the ground truth HU distributions.

To test the agreements of the newly auto-segmented contours, KDEs were created for two datasets: the first one was comprised of KDEs from 100 patients, utilized to establish a baseline distribution of ground truth contours. The second dataset consisted of KDEs from 10 patients who were not included in the library of auto-segmentation tools. These 10 patients’ data were segmented by both atlas and neural network methods for further analysis. The KDEs from 100 patients for each organ were averaged, resulting in the creation of 16 benchmark KDEs corresponding to the 16 organs. These averaged KDEs were employed as the baseline distribution for each respective organ. Upper bounds for the baselines were determined by adding two SDs to the distributions, while lower bounds were established by subtracting two SD from the distributions. These upper and lower bounds encompassed 95% of the ground truth distributions at each HU value, adhering to the empirical 2-sigma rule.

Test contours from atlas and neural network methods were compared against the baseline distribution using ± 2 SD values as part of the knowledge-based quality assurance procedure. This comparison was performed by creating a KDE for the test contour from its HU distribution. The extent of the test KDE, denoted as \(KDE_T\), which falls within the lower and upper bounds of the baseline KDE, represented as \(LB_KDE_B\) and \(UB_KDE_B\) respectively, was calculated. This calculation yielded a quantitative measure that gauged the level of agreement, as formulated below.

$$Agreement \,\,value= \frac\sum _iLB_KDE_B\le KDE_T(i)\le UB_KDE_Blength(KDE_T).$$

The agreement values between the ground truth belonging to 100 cases and 10 test distributions were computed for each organ. This procedure was carried out to verify whether the agreement value with the baseline distribution effectively represented the quality of auto-segmented contour sets. Based on the agreement values, the discrepancies were investigated between ground truth and auto-segmented contour pairs.

Institutional review board statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board (IRB) (IRB number: 23-1456 and date of approval: 8/3/2023) of St. Jude Children’s Research Hospital, Memphis, TN.

Informed consent

Informed consent was waived due to the nature of retrospective study by the Institutional Review Board (IRB) (IRB number: 23-1456 and date of approval: 8/3/2023) of St. Jude Children’s Research Hospital, Memphis, TN.

link

By admin

Leave a Reply

Your email address will not be published. Required fields are marked *