Translate this page into:
Deep Learning-based Heart Localization and Segmentation for Congenital Heart Disease Diagnosis Using You Only Look Once
*Corresponding author: Aymen Djellouli, Department of Computer Science, University of Tlemcen, Tlemcen, Algeria. djellouli201@gmail.com
-
Received: ,
Accepted: ,
How to cite this article: Djellouli A, Merzoug M, Hadjila M, M’hamedi M, Etchiali A, Bekkouche A. Deep Learning-based Heart Localization and Segmentation for Congenital Heart Disease Diagnosis Using You Only Look Once. J Card Crit Care TSS. 2025;9:226-38. doi: 10.25259/JCCC_45_2025
Abstract
Congenital heart disease (CHD) presents significant diagnostic challenges due to complex anatomical variations. Accurate whole-heart segmentation from 3D computed tomography (CT) is important for treatment planning but remains difficult. This paper introduces and evaluates a two-phase deep learning pipeline leveraging you only look once (YOLO) architectures for efficient heart localization and segmentation in CHD cases using the ImageCHD dataset. The first phase uses YOLOv8n for heart localization, achieving high accuracy with 99.5% mean average precision (mAP)@50 and 81.168% mAP@50-95 by utilizing a custom slice-filtering data preparation strategy. The second phase uses YOLOv11-seg variants (n, s, m) for pixel-wise segmentation of seven cardiac structures within the localized regions. While training metrics indicated effective learning, validation results revealed significant limitations for the segmentation task across all the models. Key challenges included overfitting, evidenced by increasing validation loss and low mask mAP@50-95 (plateauing around 0.26–0.27), and difficulty in distinguishing foreground small structures from background, confirmed by confusion matrix analysis. Notably, increasing model size did not resolve these core issues. Despite the segmentation challenges, this study demonstrates the strong potential of YOLOv8 for rapid medical object localization and explores the feasibility and potential of YOLOv11-seg for whole-heart segmentation in CHD. Future work should focus on advanced augmentation, regularization, and potentially alternative architectures to improve segmentation robustness for clinical applicability.
Keywords
Congenital heart disease
Heart localization
Heart segmentation
You only look once
You
INTRODUCTION
Accurate heart segmentation in medical imaging is important for precise diagnosis and effective treatment planning, especially in cases of congenital heart disease (CHD). CHD includes a variety of structural anomalies present at birth, affecting the heart’s architecture and function. The complexity of these malformations requires detailed analysis of cardiac anatomy to guide clinical decisions,[1,2] and makes accurate segmentation essential for understanding the spatial relationships between structures and planning surgical or interventional procedures.[3,4]
Accurate segmentation maps allow the reconstruction of 3D heart models that support surgical planning and simulation by clarifying anatomical relationships and aiding preoperative decision-making,[5,6] which enhances surgical outcomes.[7,8] Moreover, it is essential for quantitatively assessing cardiac function, such as measuring ventricular volumes and ejection fractions,[9] which support disease monitoring and evaluation of therapeutic efficacy.[10,11]
Despite its importance, whole-heart segmentation remains challenging due to the variability of cardiac anatomy, particularly in CHD cases,[12] as well as imaging issues such as noise, low contrast, and artifacts in computed tomography (CT) and magnetic resonance imaging (MRI).[13] Furthermore, manual segmentation, the current gold standard, is labor-intensive, time-consuming, and susceptible to variations between observers, making it impractical for routine clinical workflows.[14]
Deep learning has advanced cardiac segmentation considerably. Convolutional neural networks (CNNs), particularly U-Net and its variants, have achieved state-ofthe-art performance in segmenting cardiac structures from 3D medical images.[15,16] However, their complexity means these methods are computationally expensive and often unsuitable for real-time applications.[17]
Recently, the you only look once (YOLO) family of models has garnered attention in computer vision due to their balance of speed and accuracy in object detection tasks.[18] Though initially developed for 2D detection problems, YOLO has been successfully adapted for several medical applications, including tumor detection, organ localization, and lesion classification.[19,20] Its efficiency suggests potential for time-sensitive clinical use in cardiac imaging. Recent studies have explored its utility in various 3D imaging contexts, such as lung nodule detection and brain tumor segmentation, suggesting that similar approaches could be applied to cardiac imaging.[21,22]
This study aims to explore the feasibility of a YOLO-based framework for whole-heart segmentation in 3D CT scans, particularly in CHD cases. Using YOLO for localization and segmentation, we aim to evaluate whether YOLO can provide a lightweight alternative to traditional segmentation networks in scenarios where rapid inference and reduced computational demand are essential.
The rest of the paper is organized as follows. Section 2 reviews existing heart segmentation methods, including both traditional techniques and recent deep learning approaches, with a focus on CHD cases. Section 3 introduces our proposed method, highlighting the application of YOLO in medical imaging, identifying current research gaps, and comparing our approach with 3D segmentation methods. Section 4 outlines the materials and experimental setup. Section 5 presents the results, showcasing the model’s performance through quantitative metrics and visual examples. Finally, Section 6 concludes the study with key findings and future directions.
Related work
Traditional methods for heart segmentation
Traditional methods for heart segmentation have relied heavily on atlas-based approaches, deformable models, and thresholding techniques. Atlas-based methods use predefined templates of cardiac anatomy to guide segmentation, leveraging registration techniques to align the atlas with the target image.[12] While effective for normal cardiac anatomy, these methods often struggle with the high anatomical variability seen in CHD cases.[14] Deformable models, such as active contours and level sets, iteratively adjust a shape to fit the boundaries of cardiac structures.[23,24] These methods are flexible but can be sensitive to initialization and noise, limiting their robustness in clinical settings.[25] Thresholding techniques, which segment regions based on intensity values, are simple and fast but often fail to capture complex anatomical details, especially in low-contrast regions.[26] Despite their limitations, these traditional methods laid the groundwork for automated heart segmentation and highlighted the need for more advanced techniques to handle the challenges posed by CHD.
Recent advances in deep learning for heart segmentation and CHD cases
The advent of deep learning has significantly impacted medical image analysis, particularly in cardiac segmentation. The medical image computing and computer-assisted intervention (MICCAI) 2017 multi-modality whole heart segmentation[27] and automated cardiac diagnosis challenge[28] challenges have accelerated the development of numerous state-of-the-art heart segmentation models. For example, Yang et al.,[29] integrated fully convolutional networks[30] with 3D operations, transfer learning, and deep supervision to effectively extract 3D contextual information, addressing challenges in training deep neural networks using hybrid loss functions.
Two-stage approaches have also garnered attention in cardiac image analysis. Wang et al.,[31] proposed a modified U-Net architecture that first detects a region of interest from the full volume and subsequently segments voxels at the original resolution, enhancing segmentation accuracy. Similarly, Payer et al.,[32] utilized a dual-CNN framework, comprising a location CNN to localize the centers of heart substructures and a segmentation CNN focused on the heart region, promoting anatomically plausible configurations.
Xu et al.,[33] employed a deep learning-based segmentation followed by graph matching to determine categories of anomalous vessels, demonstrating the utility of combining segmentation with structural analysis. In a subsequent study, Xu et al.,[34] explored CHD diagnosis using deep learning and shape similarity metrics; however, the performance was not yet sufficient for clinical application.
While 3D methodologies dominate the field, 2D approaches remain pertinent due to their reduced computational demands. One notable method is DeSPPNet, a multiscale deep learning model that integrates spatial pyramid pooling with dense connectivity to capture both detailed and contextual information in cardiac MRI images. In comparative analyses, DeSPPNet outperformed other networks, showcasing its efficacy in segmenting complex cardiac structures.[35] Another significant contribution comes from Azarmehr et al.,[36] who investigated various encoder– decoder models for segmenting the endocardium of the left ventricle (LV) in 2D echocardiographic images. Their study found that the U-Net model achieved superior performance, with an average Dice coefficient of 0.92 and a Hausdorff distance of 3.97, highlighting the robustness of 2D U-Net architectures in cardiac segmentation tasks. In addition, Wibowo et al.,[37] proposed an innovative approach using an ensemble of lightweight encoders within a U-Net architecture to optimize cine MRI segmentation. They introduced a novel 2D thickness algorithm to convert segmentation outputs into 2D representations of cardiac volumes, enabling the classification of cardiac muscle diseases without relying on clinical features. This method not only improved segmentation accuracy but also provided a computationally efficient solution for real-time applications. These studies collectively underscore the potential of 2D deep learning methods in effectively segmenting cardiac structures in CHD cases, offering promising avenues for improved diagnosis, treatment planning, and clinical applicability.
Our proposed approach
YOLO in medical imaging
While YOLO was originally designed for real-time object detection in 2D images,[18] its applications in medical imaging have expanded significantly. YOLO-based models have been successfully used for tumor detection,[38,39] organ localization,[40,41] and lesion classification,[42,43] demonstrating their speed and accuracy in handling medical data. For example, YOLO has been adapted for detecting lung nodules in CT scans, achieving real-time performance without compromising accuracy.[21] Similarly, YOLO-based frameworks have been proposed for segmenting organs in abdominal CT scans, highlighting their potential for handling 3D data.[44] Another valuable work is the study of Balasubramani et al.,[45] for enhancing cardiac function assessment, where they used YOLO-based approach to automate LV segmentation in echocardiography. These successes suggest that YOLO could be a viable alternative for whole-heart segmentation, particularly in scenarios where speed and efficiency are critical.
Gaps in the literature
Despite the advancements in deep learning-based segmentation, several gaps remain. First, most state-of-the-art methods, such as 3D U-Net and nnU-Net, are computationally intensive and may not be suitable for real-time applications.[16] Second, while these methods perform well on normal cardiac anatomy, their accuracy often drops in the presence of complex congenital defects, which are common in CHD cases.[12] Third, the reliance on 3D convolutions in methods like 3D U-Net can lead to high memory usage, limiting their applicability to large 3D volumes.[17]
To address these gaps, this paper proposes an approach that processes 2D slices sequentially in all axes (axial, sagittal, and coronal) and aggregates the results to reconstruct 3D segmentations. This method leverages the efficiency of 2D convolutions while maintaining the spatial context of 3D data. Unlike 3D U-Net, which processes entire volumes at once, the sequential processing of 2D slices reduces memory requirements and enables real-time performance. In addition, this approach can be easily integrated with YOLO-based models, which are inherently designed for 2D data but can be extended to 3D tasks.
Comparison with 3D segmentation methods
The choice between sequential 2D slice processing and full 3D segmentation depends on the specific requirements of the task. 3D U-Net and similar methods excel in capturing spatial relationships across all three dimensions, making them ideal for tasks where volumetric context is critical. However, their high computational cost and memory usage can be prohibitive for real-time applications or large datasets. In contrast, sequential 2D slice processing offers a more efficient alternative, particularly when combined with YOLO’s real-time capabilities. While this approach may sacrifice some volumetric context, it has been shown to achieve comparable accuracy in many medical imaging tasks, especially when combined with postprocessing techniques to refine the final 3D segmentation.[19,20]
For the aim of this study, developing a fast, accurate, and robust method for whole heart segmentation in CHD cases, the sequential 2D slice processing approach is particularly well-suited. It balances computational efficiency with the ability to handle complex anatomical variations, making it a practical choice for clinical applications. Moreover, the integration of YOLO’s real-time detection capabilities ensures that the method can be deployed in time-sensitive scenarios, such as intraoperative imaging or emergency diagnostics.
MATERIAL AND METHODS
Dataset
In this study, we used the ImageCHD dataset, a comprehensive collection of 110 three-dimensional (3D) CT images encompassing various types of congenital heart disease (CHD). This dataset forms a valuable resource for CHD classification and segmentation studies.[34]
The dataset includes 16 types of congenital heart disease, categorized into eight common types and eight rarer ones. The common types include atrial septal defect, atrioventricular septal defect, patent ductus arteriosus, pulmonary atresia, ventricular septal defect, coarctation, tetralogy of Fallot, and transposition of the great arteries. The less common types include aortic arch hypoplasia, anomalous pulmonary venous drainage, common arterial trunk, double aortic arch, double outlet right ventricle (RV), double superior vena cava, interrupted aortic arch, and pulmonary artery (PA) sling. Accurate labeling and segmentation of cardiac images are crucial, particularly when dealing with the intricate structures associated with congenital heart defects (CHD). In the ImageCHD dataset, a team of four cardiovascular radiologists undertook the annotation process. Each radiologist individually segmented the images, ensuring detailed precision, while diagnoses were collectively validated by all four experts to maintain consistency and accuracy. The annotation process for each image took an average of 1–1.5 h.
The dataset contains detailed segmentations of seven primary cardiac structures: LV, RV, left atrium (LA), right atrium (RA), myocardium (Myo), aorta (Ao), and PA. These comprehensive annotations serve as a robust foundation for both localization and segmentation tasks, thereby enhancing research in automated detection and analysis of congenital heart diseases.[34]
YOLO algorithm
Object detection techniques are generally classified into two categories: two-stage detectors and single-stage detectors.[46] Two-stage detectors, such as Region-based convolutional neural network (Region-based CNN),[47] operate by first identifying potential regions of interest in an image and subsequently applying a classifier and bounding box regressor to these regions. Although this method typically offers high detection accuracy, it requires substantial computational resources and processing time, making it less practical for real-time applications or deployment on resource-constrained hardware.
In contrast, single-stage detectors streamline the detection process by simultaneously performing localization and classification, resulting in reduced computational demands and faster inference speeds. Prominent examples of single-stage detectors include the single shot detector (SSD),[48], Deconvolutional SSD,[49] RetinaNet,[50] and the YOLO series.[18] Among these, the YOLO series has gained significant popularity due to its balance between accuracy and speed, making it particularly suitable for deployment on edge computing devices.
Introduced in 2015 by Redmon et al., through their paper “You Only Look Once: Unified, Real-Time Object Detection,”[18] YOLO redefined object detection by treating it as a single regression problem, predicting bounding boxes and class probabilities directly from full images in one evaluation. This unified approach allowed for faster and more accurate detections compared to traditional methods.
Since its advent, the YOLO series has experienced rapid evolution. Although Redmon and Farhadi ceased their involvement with the project after YOLOv3,[51] other researchers have continued to advance the model’s capabilities. In January 2023, Ultralytics released YOLOv8, which outperformed its predecessors and marked the latest state-of-the-art in the YOLO series. YOLOv8 introduced several architectural enhancements, including a new backbone network, a refined anchor-free detection head, and a novel loss function, contributing to improved accuracy and efficiency across various tasks. The visual artificial intelligence tasks supported by YOLOv8 are object detection, instance segmentation, pose estimation, oriented bounding boxes object detection and image classification.[52]
YOLOv11, a new version of YOLO developed by Ultralytics in 2024, has significantly advanced computer vision capabilities across various tasks, including object detection, feature extraction, instance segmentation, pose estimation, tracking, and classification. This model represents a substantial breakthrough in real-time object detection technology.
Key architectural enhancements in YOLOv11 include the introduction of novel modules such as C3K2 (Cross stage partial with 2 × 2 kernels), the spatial pyramid pooling - fast block, and convolutional block with parallel spatial attention. These enhancements improve the model’s ability to extract features and detect objects more effectively. The improved feature extraction capabilities have broadened YOLOv11’s applicability to a wider range of scenarios, making it more versatile for various computer vision tasks.[53,54]
During this study, two versions of YOLO models were used: the YOLOv8[52] and the YOLOv11.[55]
Models’ architecture
The YOLO model is divided into three main components: the backbone, neck, and head. This design facilitates efficient and accurate object detection by streamlining the process into a unified framework.
The backbone’s primary function is feature extraction from input images at various spatial resolutions. This process involves stacked convolutional layers and specialized modules that generate progressively abstract feature maps at varying resolutions. For instance, in YOLOv3, the backbone utilized Darknet-53, a 53-layer convolutional network, to perform this task.[51] By capturing both low-level details such as edges and textures, as well as high-level semantic information, the backbone lays the foundation for effective object detection.
The neck module acts as an intermediary layer that refines and aggregates features output by the backbone. By integrating multiple-scale representations through operations such as upsampling, downsampling, and feature concatenation, the neck enhances the network’s ability to detect objects of varying sizes. Techniques such as path aggregation network are commonly incorporated in this stage to enhance feature fusion capabilities.
The head of the YOLO model is responsible for the final prediction outputs. It interprets the refined feature maps received from the neck and produces the coordinates of bounding boxes along with their class probabilities. In YOLOv8, the architecture adopts a decoupled head design, allowing separate optimization paths for object classification and localization. This decoupling facilitates more targeted learning, improving both detection precision and overall model performance.[53,54]
Experimental setup
In this work, the methodology was structured into two distinct phases:
Phase 1: Heart localization
The first phase of this work focused on heart localization, which is a critical step for reducing computational complexity and improving segmentation accuracy. A YOLOv8n object detection model was used to localize the heart and its great arteries within CT scans. YOLOv8n is a lightweight variant of YOLOv8, designed for efficient object detection with minimal computational overhead.
A custom data preparation process was implemented, wherein for each CT scan, an algorithm calculated the ground truth bounding box of the heart based on its label. Subsequently, we scanned through each slice of the CT scan along all axes, saving those slices as images when heart pixels constituted more than 30% of the total pixels in the slice. This approach focused on relevant data, eliminating non-essential information.
Following this preparation, we obtained a total of 8940 training samples, with an additional 1873 samples reserved for validation and 794 for testing. After the data preparation phase, the YOLOv8n model was trained for 500 epochs on a batch size of 64 images per iteration using the free hardware configuration provided by Google Colaboratory, which includes an Intel(R) Xeon(R) CPU @ 2.00GHz, a Tesla T4 Graphics processing unit (GPU) with 16 GB of VRAM, and 12 GB of RAM.
Phase 2: Heart segmentation
The second phase of this work focused on heart segmentation. For this stage, we wanted to test the improvements made in the YOLOv11 model; therefore, the YOLOv11n-seg, YOLOv11s-seg, and YOLOv11m-seg variants were chosen to compare their performance. These models are segmentation variants of YOLOv11, designed to predict pixel-wise masks in addition to bounding boxes.
To prepare the data for segmentation, we transformed the heart masks from the ImageCHD dataset after cropping it from the scans in the first phase (heart localization) into YOLO annotation format. This involved converting the masks for each class (heart parts) into polygon coordinates, which were then used to train the models. The models were trained with the goal of detecting and segmenting seven key anatomical heart regions: left and right atria, left and RVs, Myo, Ao, and PA, from the background, and each model’s training process was monitored through key metrics, including segmentation loss, box loss, classification loss, and detection-specific metrics such as mean average precision (mAP)@50, mAP@50-95, precision, and recall for both bounding box (B) and mask (M) predictions.
After the data preparation phase, a total of 76362 training images and 9420 images for the validation set were obtained. We trained the YOLOv11n-seg and YOLOv11s-seg models for 500 epochs on a batch size of 144 and 80 images per iteration, respectively. Due to time constraints, we train YOLOv11m-seg for 224 epochs on a batch size of 48 images. The hardware configuration used in this phase is the one provided by Kaggle with GPU T4 × 2 accelerator enabled in order to exploit the multi-GPU feature included in YOLO models training. This phase aimed to achieve precise segmentation of the heart structures within the localized regions identified in the first phase. This two-phase approach leverages the strengths of the YOLO architecture in both localization and segmentation tasks, designed to address the complexities inherent in CHD cases. Figure 1 shows an overview of our approach, while the flowchart depicted in Figure 2 provides a graphical representation of our methodology.

- Overview of the used method where the red box contains the heart area.

- Methodology flowchart.
RESULTS
For the heart and great arteries localization phase, the results showed that the model achieved satisfactory accuracy in localizing the heart in CT scan images, demonstrating the effectiveness of both YOLOv8n architecture and the data preparation strategy in this study. The model performance has been evaluated using mAP metrics at an intersection over union (IoU), where the model gives 99.5% for mAP50 and 81.168 % for mAP50–95.
The results are illustrated in Figure 3, and in Figure 4, we can see the evaluation results on a random batch of the validation set.

- The final results after you only look oncev8n training for 500 epochs (the x axis) (box_loss, cls_loss, and dfl_loss) for both training and validation sets and precision, recall, mAP50, mAP50-95 metrics (the y axis). Blue dots represent loss/metric value for each epoch, red dots represent the smothed curve

- You only look oncev8n evaluation results where (a) a random batch from the validation set and its ground truth. (b) The model prediction on the validation set batch.
For the segmentation phase, due to the complexity of the task, the results were low in terms of segmentation accuracy.
The results of YOLOv11n-seg, YOLOv11s-seg and YOLOv11m-seg training are respectively illustrated in Figures 5-7 that show the box-loss, seg-loss, cls-loss, and dfl-loss curves for both training and validation sets, and also show the precision, recall, mAP50, mAP50-95 metrics curves for both objects detection (metrics/precision[B], metrics/recall[B], metrics/mAP50[B], metrics/mAP50-95[B] respectively) and segmentation (metrics/precision[M], metrics/recall[M], metrics/mAP50[M], metrics/mAP50-95[M] respectively). For YOLOv11n-seg and from Figure 5, all training losses show a good decreasing pattern throughout the 500 epochs, indicating that the model is effectively learning from the training data. The curves relatively flatten out, suggesting convergence on the training set. For the metrics curves, training metrics consistently improve, reflecting the loss decrease. Precision, Recall, and mAP for both bounding boxes (B) and masks (M) increase and start to plateau, confirming learning, even if it is still low. On the other hand, the validation behavior shows that the validation box loss remains relatively flat after an initial adjustment. The classification loss is also relatively stable. However, the distribution focal loss for validation-loss (distribution focal loss, related to bounding box refinement for validation samples) shows a slight but steady increase after epoch ~150.

- Training results for you only look oncev11n-seg where the x axis represent the epochs and the y axis represent loss/metrics for each epoch. Blue dots represent loss/metric value for each epoch, red dots represent the smothed curve

- Training results for you only look oncev11s-seg where the x axis represent the epochs and the y axis represent loss/metrics for each epoch. Blue dots represent loss/metric value for each epoch, red dots represent the smothed curve

- Training results for you only look oncev11m-seg where the x axis represent the epochs and the y axis represent loss/metrics for each epoch. Blue dots represent loss/metric value for each epoch, red dots represent the smothed curve.
The most concerning plot is the val/seg-loss curve, where the validation segmentation loss decreases initially but then starts to increase significantly from around epoch 100 onwards. This is a classic indicator of overfitting, specifically on the segmentation task. The model is learning intricate details like small and adjacent parts or noise specific to the training set’s segmentation masks that do not generalize to the unseen validation data.
For validation metrics, we can observe that they increase initially but plateau relatively early (around epoch 100–150) and remain stagnant or even slightly decrease afterward. The final validation mAP50-95 for masks is around 0.25, which is quite low for segmentation, suggesting poor pixel-level accuracy and difficulty with varying IoU thresholds. The mAP50 is higher (~0.45), indicating the model is better at getting the general location and shape correct but struggles with precise boundary delineation. The divergence between steadily improving training metrics and plateauing/worsening validation metrics/losses confirms overfitting.
DISCUSSION
Similar to the YOLOv11n-seg model, all training losses decrease steadily by looking to Figure 6, and training metrics improve throughout, indicating effective learning on the training set. The final training metrics (e.g., precision, mAP) seem slightly higher than the “n” model, which is expected given the larger capacity of the “s” variant. For the validation part the val/box-loss is perhaps slightly more stable than in the nano model. However, the critical val/seg-loss shows the exact same pattern of initially decreasing and then sharply increasing after ~150 epochs, indicating significant overfitting on segmentation, just like the “n” model. The val/dfl-loss also increases, suggesting overfitting in localization refinement. The validation metrics plateau similarly to the “n” model, perhaps reaching slightly higher peak values (e.g., mask mAP50-95 around 0.27 versus 0.25, mask mAP50 around 0.44 versus 0.45 - very similar). Despite the slightly better peak, the overall pattern and the final performance are comparable to the “n” model, and the overfitting issue persists strongly. The gap between training and validation performance remains large.
As shown in Figure 7, we can observe that, as with the previous models, all training losses decrease consistently throughout the 224 epochs, indicating effective learning and convergence on the training dataset. The training metrics for both bounding boxes (B) and masks (M) show steady improvement, plateauing towards the end of training. The peak training metrics are comparable or slightly higher than the “s” model, as expected from the increased capacity. However, the model is still suffering from the overfitting problem on the segmentation task that is not resolved by increasing model size to “m”.
To understand more about models’ performance, we use their confusion matrices, while metrics like mAP provide an overall score, the confusion matrix offers a wider view of performance for each individual class. Furthermore, it helps in identifying specific misclassification patterns by revealing which classes are being confused with each other. Moreover, it provides the number of these errors that help us in comparing the models and diagnosing their weaknesses for future improvements. Figures 8-10 represent confusion matrices for YOLOv11n-seg, YOLOv11s-seg, and YOLOv11m-seg, respectively.

- Confusion matrices for you only look oncev11n-seg model.

- Confusion matrices for you only look oncev11s-seg model.

- Confusion matrices for you only look oncev11m-seg model.
The confusion matrix of the YOLOv11n-seg model in Figure 8 provides insight into how well the network is able to classify pixels across different anatomical classes. The diagonal elements, which represent the true positives, show a reasonably accurate classification for several heart structures. Among these, the LA and PA stand out with the highest number of correctly identified pixels. Other structures, such as the LV, RV, RA, Myo, and AO, also demonstrate adequate correct classification, despite the lower pixel counts. However, a major issue is revealed in the off-diagonal elements, particularly concerning confusion with the background. A large number of pixels belonging to actual cardiac structures are misclassified as background, which is clearly evident from the last row of the matrix. For example, 24307 pixels from the PA are wrongly identified as background. Furthermore, the last column of the matrix highlights a critical false-positive issue, where background pixels are misclassified as various structures, such as 11093 background pixels predicted as PA and 10656 as LA. This suggests that the model has significant difficulty in effectively distinguishing small heart regions from the surrounding background, resulting in both false negatives (missed structures) and false positives (incorrectly predicted structures).
In addition, there is notable inter-class confusion between anatomically adjacent structures. For instance, the LA is misclassified as PA in 3186 pixels, and PA is confused with LA in 5189 pixels. These errors imply that the model struggles with accurately delineating boundaries between nearby regions, likely due to subtle intensity variations or overlapping textures within the CT images. Overall, the YOLOv11n-seg model demonstrates a lack of precision in separating both the foreground structures and the background, indicating a need for better boundary modeling and spatial awareness.
As shown in Figure 9, the confusion matrix for the YOLOv11s-seg model exhibits similar behavior to its smaller counterpart, the “n” variant. The diagonal entries representing correctly classified pixels remain comparable, with some classes showing slight improvements. For example, PA classification improves marginally (18635 vs. 19852), while other classes, such as RV show a minor drop in correct predictions (5371 vs. 5185). These variations suggest small quantitative differences but no significant architectural advantage.
The issue of background confusion persists across this model as well. Although certain improvements are observed, such as fewer misclassified LV (4268 vs. 4253) and Myo (7194 vs. 7168) pixels, the number of LA and PA pixels predicted as background decreases notably (20831 vs. 20276 and 24307 vs. 23438, respectively). Background pixels misclassified as structures remain high and almost identical to the YOLOv11n-seg model. This reinforces the observation that the model, despite its increased capacity, continues to struggle with segmenting small, low-contrast, or spatially ambiguous boundaries between structures and background. Patterns of inter-structure confusion also remain consistent. The PA is still misclassified as LA in 4880 instances (slightly improved from 5189), while LA is misclassified as PA even more frequently (3186 vs. 3441). These consistent misclassifications suggest that increasing model complexity alone is insufficient to resolve ambiguities caused by overlapping anatomical features or limited contextual awareness.
From Figure 10 observation, we find that YOLOv11mseg, despite its increased capacity, suffers from the exact same weaknesses as the smaller variants, where it has the same difficulty in separating foreground small parts from background and confusion between adjacent structures without much improvement over previous models.
CONCLUSION
This study introduces a YOLO-based pipeline for wholeheart segmentation in CHD cases, combining the strengths of YOLOv8 for localization and YOLOv11-seg for mask prediction. For heart and its great vessels localization, YOLOv8n gives 99.5% for mAP50 and 81.168% for mAP50-95, which is considered a satisfactory accuracy that helps a lot in CHD automatic diagnosis and shows the potential of YOLO models in medical assisting tasks, and of course, there is room for future improvement. However, the segmentation task was not as successful as the previous phase, as the models demonstrated effective training behavior, as indicated by steadily decreasing training losses and improving performance metrics, but evaluation on the validation set revealed notable limitations. A key challenge observed was overfitting in the segmentation; at the same time, the validation mask mAP plateaued at low levels. The confusion matrix analysis revealed significant challenges in accurately distinguishing between foreground cardiac structures and background regions, suggesting difficulties in boundary detection, low-contrast region handling, or insufficient model confidence. Furthermore, there was notable misclassification between anatomically adjacent or visually similar regions, such as the left and right atria, highlighting insufficient fine structural discrimination. Scaling the model led to only minor gains and failed to resolve key issues such as overfitting and segmentation inaccuracies, suggesting limitations in architecture or training strategy. These findings suggest that, in its current form, YOLO is not competitive with benchmark segmentation frameworks such as nnUNet and 3D U-Net. However, the potential value of YOLO lies in its lightweight architecture and speed, which may be relevant for applications requiring rapid inference or when computational resources are constrained. This exploratory study highlights the trade-offs between efficiency and accuracy and underscores the need for further development, such as incorporating regularization strategies, improved data augmentation, or hybrid architectures that combine the strengths of detection and segmentation networks.
Acknowledgment:
The authors would like to acknowledge the ImageCHD project team for granting access to the dataset used in this research.
Ethical approval:
Institutional Review Board approval was not required, as this study involved only anonymized data that are publicly available upon request.
Declaration of patient consent:
Patient’s consent not required as there are no patients in this study.
Conflicts of interest:
There are no conflicts of interest.
Use of artificial intelligence (AI)-assisted technology for manuscript preparation:
AI tools were used only for language assistance in this manuscript. The extent of use was limited to improving grammar, sentence structure, and readability. No part of the scientific content, data analysis, interpretation of results, figures, or images was generated or modified using AI.
Financial support and sponsorship: Nil.
References
- The Challenge of Congenital Heart Disease Worldwide: Epidemiologic and Demographic Facts. Semin Thorac Cardiovasc Surg Pediatr Card Surg Annu. 2010;13:26-34.
- [CrossRef] [PubMed] [Google Scholar]
- The Incidence of Congenital Heart Disease. J Am Coll Cardiol. 2022;39:1890-900.
- [CrossRef] [PubMed] [Google Scholar]
- Multimodality Imaging Guidelines for Patients with Repaired Tetralogy of Fallot: A Report from the AmericanSsociety of Echocardiography: Developed in Collaboration with the Society for Cardiovascular Magnetic Resonance and the Society for Pediatric Radiology. J Am Soc Echocardiogr. 2014;27:111-41.
- [CrossRef] [PubMed] [Google Scholar]
- The Role of Machine Learning in Congenital Heart Disease Diagnosis: Datasets, Algorithms, and Insights arXiv 20252501.04493
- [Google Scholar]
- Advancements in Cardiac Biomodels using 3D Printing and Bioprinting for Surgical Planning and Training: A Systematic Literature Review. Res Biomed Eng. 2025;41:18.
- [CrossRef] [Google Scholar]
- Efficacy and Safety of 3D-Printed Models in the Surgical Planning of Congenital Heart Defects: A Systematic Review. J Med Biosci Res. 2025;2:934-44.
- [CrossRef] [Google Scholar]
- Impact of 3D Printing on Cardiac Surgery in Congenital Heart Diseases: A Systematic Review and Meta-Analysis. Arq Bras Cardiol. 2025;121:e20240430.
- [CrossRef] [Google Scholar]
- The Usefulness of 3D Heart Models as a Tool of Congenital Heart Disease Education: A Narrative Review. J Saudi Heart Assoc. 2025;37:1.
- [CrossRef] [PubMed] [Google Scholar]
- Automatic Measurements of Left Ventricular Volumes and Ejection Fraction by Artificial Intelligence: Clinical Validation in Real Time and Large Databases. Eur Heart J Cardiovasc Imaging. 2023;25:383-95.
- [CrossRef] [PubMed] [Google Scholar]
- Semi-Automated Quantification of Left Ventricular Volumes and Ejection Fraction using 3-Dimensional Echocardiography. Available from: https://www.escardio.org/journals/e/journal/of/cardiology/practice/volume/4/vol4n11-title-semi-automated-quantification-of-left-ventricular-volumes-and-e [Last accessed on 2025 Mar 20]
- [Google Scholar]
- Quantification of Left Ventricular Volume and Global Function using a Fast Automated Segmentation Tool: Validation in a Clinical Setting. Int J Cardiovasc Imaging. 2013;29:309-16.
- [CrossRef] [PubMed] [Google Scholar]
- Evaluation of Algorithms for Multi-Modality Whole Heart Segmentation: An Open-Access Grand Challenge. Med Image Anal. 2019;58:101537.
- [CrossRef] [PubMed] [Google Scholar]
- A Survey on Deep Learning in Medical Image Analysis. Med Image Anal. 2017;42:60-88.
- [CrossRef] [PubMed] [Google Scholar]
- A Review of Segmentation Methods in Short Axis Cardiac MR Images. Med Image Anal. 2011;15:169-84.
- [CrossRef] [PubMed] [Google Scholar]
- U-Net: Convolutional Networks for Biomedical Image Segmentation In: Navab N, Hornegger J, Wells WM, Frangi AF, eds. Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015. Cham: Springer International Publishing; 2015. p. :234-41.
- [CrossRef] [Google Scholar]
- nnUNet: A Self-Configuring Method for Deep Learning-Based Biomedical Image Segmentation. Nat Methods. 2021;18:203-11.
- [CrossRef] [PubMed] [Google Scholar]
- 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation In: Ourselin S, Joskowicz L, Sabuncu MR, Unal G, Wells W, eds. Medical Image Computing and Computer-Assisted Intervention - MICCAI 2016. Cham: Springer International Publishing; 2016. p. :424-432.
- [CrossRef] [Google Scholar]
- You Only Look Once: Unified, Real-Time Object Detection. Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016:779-88. Available from: https://www.cv/foundation.org/openaccess/content_cvpr_2016/html/redmon_you_only_look_cvpr_2016_paper.html [Last accessed on 2025 Mar 21]
- [CrossRef] [Google Scholar]
- YOLO for Medical Object Detection (2018-2024) In: 2024 IEEE 3rd International Conference on Electrical Power and Energy Systems (ICEPES). 2024. p. :1-7.
- [CrossRef] [Google Scholar]
- A Comprehensive Systematic Review of YOLO for Medical Object Detection (2018 to 2023) IEEE Xplore. Available from: https://ieeexplore.ieee.org/abstract/document/10494845 [Last accessed on 2025 Mar 21]
- [Google Scholar]
- Using YOLO Based Deep Learning Network for Real Time Detection and Localization of Lung Nodules from Low Dose CT Scans In: Medical Imaging 2018 Computer-Aided Diagnosis. London: SPIE; 2018. p. :347-55.
- [CrossRef] [PubMed] [Google Scholar]
- Automated Brain Tumor Segmentation and Classification in MRI Using YOLO-Based Deep Learning. IEEE Access. 2024;12:16189-207.
- [CrossRef] [Google Scholar]
- Deformable Models in Medical Image Analysis: A Survey. Med Image Anal. 1996;1:91-108.
- [CrossRef] [PubMed] [Google Scholar]
- A Combined Deep-Learning and Deformable-Model Approach to Fully Automatic Segmentation of the Left Ventricle in Cardiac MRI. Med Image Anal. 2016;30:108-19.
- [CrossRef] [PubMed] [Google Scholar]
- Active Shape Models-their Training and Application. Comput Vis Image Underst. 1995;61:38-59.
- [CrossRef] [Google Scholar]
- A Threshold Selection Method from Gray-Level Histograms. Threshold Sel. In: Method Gray-Level Histograms. United States: IEEE; 1979.
- [CrossRef] [Google Scholar]
- Multi-Scale Patch and Multi-Modality Atlases for Whole Heart Segmentation of MRI. Med Image Anal. 2016;31:77-87.
- [CrossRef] [PubMed] [Google Scholar]
- Deep Learning Techniques for Automatic MRI Cardiac Multi-Structures Segmentation and Diagnosis: Is the Problem Solved? IEEE Trans Med Imaging. 2018;37:2514-25.
- [CrossRef] [PubMed] [Google Scholar]
- Hybrid Loss Guided Convolutional Networks for Whole Heart Parsing In: Pop M, Sermesant M, Jodoin PM, Lalande A, Zhuang X, Yang G, eds. Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges. Cham: Springer International Publishing; 2018. p. :215-23.
- [CrossRef] [Google Scholar]
- Fully Convolutional Networks for Semantic Segmentation. Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015:3431-40. Available from: https://openaccess.thecvf.com/content_cvpr_2015/html/long_fully_convolutional_networks_2015_cvpr_paper.html [Last accessed on 2025 Mar 22]
- [CrossRef] [Google Scholar]
- A Two-Stage 3D Unet Framework for Multi-Class Segmentation on Full Resolution Image. United States: Cornell University. 2018;arXiv:1804.04341.
- [CrossRef] [Google Scholar]
- Multi-Label Whole Heart Segmentation Using CNNs and Anatomical Label Configurations In: Pop M, Sermesant M, Jodoin PM, Lalande A, Zhuang X, Yang G, eds. Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges. Cham: Springer International Publishing; 2018. p. :190-8.
- [CrossRef] [Google Scholar]
- Whole Heart and Great Vessel Segmentation in Congenital Heart Disease Using Deep Neural Networks and Graph Matching In: Shen D, Liu T, Peters TM, Staib LH, Essert C, Zhou S, eds. Medical Image Computing and Computer Assisted Intervention - MICCAI 2019. Cham: Springer International Publishing; 2019. p. :477-85.
- [CrossRef] [Google Scholar]
- ImageCHD: A 3D Computed Tomography Image Dataset for Classification of Congenital Heart Disease. 2021 Available from: https://www.com/arxiv.org/abs/2101.10799 [Last accessed on 2023 Nov 14]
- [CrossRef] [Google Scholar]
- DeSPPNet: A Multiscale Deep Learning Model for Cardiac Segmentation. Diagnostics (Basel). 2024;14:2820.
- [CrossRef] [PubMed] [Google Scholar]
- Automated Segmentation of Left Ventricle in 2D Echocardiography using Deep Learning. Vol arXiv. United States: Cornell University; 2020. p. :2003.07628.
- [CrossRef] [Google Scholar]
- Cardiac Disease Classification Using Two-Dimensional Thickness and Few-Shot Learning Based on Magnetic Resonance Imaging Image Segmentation. J Imaging. 2022;8:194.
- [CrossRef] [PubMed] [Google Scholar]
- YOLO-TumorNet: An Innovative Model for Enhancing Brain Tumor Detection Performance. Alex Eng J. 2025;119:211-21.
- [CrossRef] [Google Scholar]
- Application of YOLO Models for Assisted Tumor Diagnosis: A Computer Vision-Based Approach [v1] Available from: https://www.preprints.org/manuscript/202502.1402 [Last accessed on 2025 Mar 22]
- [Google Scholar]
- Automatic Localization of Normal Active Organs in 3D PET Scans. Comput Med Imaging Graph. 2018;70:111-8.
- [CrossRef] [PubMed] [Google Scholar]
- Multi-Site Organ Detection in CT Images using Deep Learning. 2020. Available from: https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-279290 [Last accessed on 2025 Mar 22]
- [Google Scholar]
- A Dermoscopic Skin Lesion Classification Technique Using YOLO-CNN and Traditional Feature Model. Arab J Sci Eng. 2021;46:9797-808.
- [CrossRef] [Google Scholar]
- Automatic Detection and Categorization of Skin Lesions for Early Diagnosis of Skin Cancer Using YOLO-v3-DCNN Architecture. Image Anal Stereol. 2023;42:101-17.
- [CrossRef] [Google Scholar]
- MedYOLO: A Medical Image Object Detection Framework. J Imaging Inform Med. 2024;37:3208-16.
- [CrossRef] [PubMed] [Google Scholar]
- Automated Left Ventricle Segmentation in Echocardiography Using YOLO: A Deep Learning Approach for Enhanced Cardiac Function Assessment. Electronics. 2024;13:2587.
- [CrossRef] [Google Scholar]
- Optimizing the Trade-off between Single-Stage and Two-Stage Object Detectors using Image Difficulty Prediction. Vol arXiv. United States: Cornell University; 2018. p. :1803.08707.
- [CrossRef] [Google Scholar]
- Region-Based Convolutional Networks for Accurate Object Detection and Segmentation. IEEE Trans Pattern Anal Mach Intell. 2016;38:142-58.
- [CrossRef] [PubMed] [Google Scholar]
- SSD: Single Shot MultiBox Detector In: Leibe B, Matas J, Sebe N, Welling M, eds. Computer Vision - ECCV 2016. Cham: Springer International Publishing; 2016. p. :21-37.
- [CrossRef] [Google Scholar]
- DSSD : Deconvolutional Single Shot Detector. United States: Cornell University; arXiv 2017:1701.06659.
- [Google Scholar]
- RetinaNet with Difference Channel Attention and Adaptively Spatial Feature Fusion for Steel Surface Defect Detection. IEEE Xplore. Available From: https://ieeexplore.ieee.org/abstract/document/9270024 [Last accessed on 2025 Mar 23]
- [Google Scholar]
- Ultralytics YOLOv8. 2023. Available from: https://github.com/ultralytics/ultralytics [Last accessed on 2025 Jul 18]
- [Google Scholar]
- YOLOv11: An Overview of the Key Architectural Enhancements. Available from: https://arxiv.org/html/2410.17725v1 [Last accessed on 2025 Mar 24]
- [Google Scholar]
- YOLOv8 to YOLO11: A Comprehensive Architecture in-depth Comparative Review. arXiv 2025:2501.13400.
- [Google Scholar]
- Ultralytics YOLO11. 2024. Available from: https://github.com/ultralytics/ultralytics [Last accessed on 2025 Jul 18]
- [Google Scholar]

