May. 24, 2024
Data and AI Metrics with Intel® Geti™
Learn how to define and leverage metrics to identify improvements and repetitively refine your AI project with Intel® Geti™ software.
Data and AI Metrics
Metrics are a critical and common foundation for AI projects, helping us measure the impact and effectiveness of our data and machine learning systems. Well-defined metrics help pinpoint deficiencies in our AI projects and enable communication of these deficiencies to the relevant stakeholders, and accordingly plan for further improvements to achieve desired project KPIs. These in turn can help us meet requirements towards reporting and transparency when collaborating with multiple domain experts.
In this article, we will explore the types of metrics that are essential for computer vision projects throughout the phases of an AI project, the definition and measurement of these metrics, and best practices for leveraging metrics for classification, detection, and segmentation type of tasks.
Right Data + Right People = Right Models A common theme in the AI space when it comes to input data is “Garbage in, Garbage out”. The quality of our data directly impacts the performance and fit of our AI systems to the use case – “bad” data can consistently hinder models from being able to effectively generalize to the real world. On the other end, with high quality data, and the right stakeholders involved in a project, we can achieve the right models!
By understanding our datasets, we can form expectations around our problem statement and influence the data editing, and annotation processes to refine the quality of our data. Data Statistics
There are three overall themes of data quality and usefulness that developers often target to characterize their dataset with metrics and statistics:
- Balance: Datasets that are balanced are distributed across different categories in a way that meets expectations. There are a variety of metrics, including the number of objects per label and performance per label, that can help with the measurement of balanced data.
- Diverse: Diverse data contains samples covering a wide range of scenarios. This can include images captured in different lighting conditions, angles, environments, etc. If our dataset is not representative of the instances we expect our models to perform well on, this can lead to data imbalances (e.g., not having the similar amount of apples and oranges in our dataset) and directly impact the use of the model in the actual application (e.g., our model classifying apples as oranges). The diversity of data instances is also connected to Responsible AI when dealing with data containing or impacting people: Low diversity can lead to model biases. Consider an intelligent vehicle application for pedestrian detection to alert drivers of nearby pedestrians. If the dataset does not contain instances of individuals in wheelchairs, the model could fail to detect disabled pedestrians and make significant mistakes in its predictions, placing certain pedestrians at a greater risk than others.
- Level of complexity: Data quality and usefulness hinge on datasets that reflect real-world complexity with appropriate annotations and configurations.
Number of objects per label
The number of objects per label measures the distribution of the data objects across different categories or classes. In the Intel Geti media dashboard, domain experts may review this metric to identify if the number of objects per label is skewed or uneven beyond expectations, in which case the data might not represent the real-world distribution of the categories. Additionally, it is possible also to see number of media, images and videos, annotated images, and/or annotated videos and frames.
Object Size Distribution
Object size distribution measures the variation of size of objects in our dataset used for AI applications. This metric is typically leveraged to understand the representativeness of the data and identify generalizations of the AI system to data.
With Intel Geti software, we approach object distribution through three categories: i) Tall, ii) wide, iii) and balanced.
We can visualize the object size corresponding to each label, including the overall height and width of objects in relation to each other.
For example, when annotating images of individuals, the ground truth bounding boxes are typically created to snap around individuals’ figures, which results in an output similar to Figure 2, with all of the object size results being balanced.
However, consider objects such as vehicles on a freeway that can appear larger when closer to a camera and smaller when farther away. If these objects are being annotated, we can expect extreme variations of object sizes, with some bounding boxes being very tall in terms of height and others having a much larger width compared to the rest of the instances in the dataset.
In scenarios such as these, we may expect taller and wider bounding boxes, however in unexpected scenarios such as object detection of smaller objects (fruits, transistors, etc.), taller and wider bounding boxes can represent anomalies in annotation that might need to be addressed.
The ellipse in the distribution chart represents the object sizes that are closest to the overall average of the object sizes. One of the key elements we can observe with this metric are whether the dots representing those variations are within or surrounding the dotted ellipse, showing us that the object sizes are similar and not varied extremely.
Model Metrics
Model Variant metrics
Model accuracy and file size are key metrics that allow us to compare the expected performance between model variants on a target device. Models with higher accuracy and lower footprint are often a more appealing choice towards lightweight deployment on edge or client devices.
Accurate models with a lower footprint can process more data in less time, enabling faster frame rates, and model execution on devices with limited or constrained compute, such as embedded systems. This is particularly important for applications such as real-time imaging for surgical guidance and diagnosis, and accurate pose estimation for smart fitness applications.
One of the methods we can use to obtain a lower footprint of our models is “halving” the precision of our models from FP32 (or 32 bits per floating point value) to FP16 (using 16 bits), reducing the memory usage of the models at runtime. By halving the precision with FP16, we expect to see very minimal degradation of the performance compared to the model with FP32 precision. Quantizing models into INT8 (8-bit integer values) can further reduce the model size but provide a tradeoff to consider as quantizing floating-point numbers into integer values may impact accuracy more negatively, depending on the use cases.
Dataset split
Dataset splits are performed to train, validate, and test AI models on different portions of the data:
- Training subset: We reserve a large subset of the data for model training
- Validation: This subset is reserved to observe the model’s performance and tune hyperparameters.
- Test subset: The test subset is used to test the model’s generalization capabilities on unseen data, that is not present in the training and validation datasets.
The size of the data allocated to each of these subsets plays an important role in the success of the model. Large training data can lead to better performance of the model – however, a small test set can hinder the ability to understand how the model will perform in the real-world.
Intel Geti software provides a visualization for the optimally determined dataset split to identify the portions of the dataset used for training, validation, and testing. These dataset sub-samples can be reviewed in the Training Datasets tab too.
F-Measure
Frequently leveraged for imbalanced data, the F-measure score is calculated through the harmonic mean of precision and recall. Precision and recall are two metrics used for evaluating the performance of models using true and false positives:
- True Positive: A case where a positive data point is correctly classified.
- False Positive: A case where a data point is incorrectly classified as belonging to a “positive” class. For example, an image of a rock classified as a hardhat is an example of a false positive.
Precision is calculated as the fraction of relevant instances among all retrieved instances, and recall is measured as the fraction of relevant instances that were retrieved – both metrics range from 0 to 1, with 1 being the best value and 0 being the worst.
The F-measure or F-score metrics leveraging the following equation, ensuring that both the precision and recall are high to receive a high F-Score: 2 x [(Precision x Recall) / (Precision + Recall)]. This metric also follows the range of 0 to 1, with close to 1 or 1 being ideal.
We can visualize the F-measure score as applied to the validation and test datasets, as well as the f-measure per label on the validation and test slices.
The overall f-measure and per-label metrics serve a similar purpose but can present interesting differences. Although the overall f-measure score for these subsets can show high performance, performance per label / performance per class metrics allow us to specifically identify which classes the model is underperforming on that we might need to remediate with additional data. This information may not be apparent through the overall f-measure metric. If the model is underperforming on a specific class, a user may consider increasing the number of instances of that class in the dataset, or exploring the quality of the samples to explore the impact on the model’s underperformance.
A large dataset size and balanced dataset typically helps improves the F-score metric, both overall and per-label, and on both the validation and test subsets. Sampling can also additionally help improve the F-score by reducing bias of the AI model towards a majority class and increasing the model’s sensitivity to minority classes.
Users typically leverage the f-measure on the validation set, which is sampled from the training dataset, to understand the effect of the chosen hyperparameters. A low f-measure on the test dataset shows that the model will perform poorly for the application it is intended for and necessitates re-training.
Optimal Confidence Threshold
The optimal confidence threshold is the minimum confidence score needed for a model prediction to be considered a correct prediction. If the confidence is below this threshold, such as the value 0.08 in Figure 5, the model will ignore this prediction entirely. The confidence threshold is used to ensure that only predictions/inferences the model is confident about are counted by the model, and incorrect/unsure predictions are filtered out to reduce the number of extraneous predictions.
If the optimal confidence threshold is defined to be higher than 0.8, this implies the model will be stricter regarding its confidence – it needs to be confident at a greater level that the predictions it considers are true. The model performance (e.g., on the F-measure metric) also needs to be high to achieve these objectives successfully.
Training metrics
Intel Geti provides different training metrics in its metrics dashboard, as displayed in Figures 6 and 7. Some examples are:
- Loss: Loss is a measure of how well an AI model fits the data. The objective of model training is to minimize the loss, which implies a more accurate and reliable model over time. As part of curve visualizations, we can check for reduced loss over time (i.e., going down and saturating) to ensure the objectives of training our model are being met.
- Learning rate: Learning rate is a parameter controlling how much the model is learning from each training run. Low learning rates imply a model is learning slower and may catch the optimal solution (but in certain cases can get stuck in a local minimum and miss the global optimum). Higher learning rates implies a model is learning faster and may overshoot the optimal solution. The ideal curve of the learning rate is a declining curve that goes down and saturates similar to an exponential downward curve.
- Time across epochs: This metric represents the training time – particularly, the duration of each training cycle – where an epoch represents one complete pass through the entire dataset. Note that pertaining to iterations, visualizations often plot the size of the dataset increasing with each iteration. The number of epochs and time across epochs helps us determine how long the model trains for. A small or large training time isn’t necessarily good or bad – small training times can sometimes be preferred when we look for models that can be rapidly created and complete more training cycles in each time, and stopping before the performance becomes stagnant. However, this may also indicate that the model is not learning enough for the data and may not perform well in production.
Momentum and Validation Metrics
There are several other validation metrics that can help with the identification of how well a model is generalizing to data, i.e., how well the model will perform with unseen data. Metrics that target this intention help with making decisions about model readiness by measuring the accuracy and consistency of the predictions/inferences of the model that are a representation of ultimately the performance of the model in production.
For detection and segmentation models, Intel Geti software supports the following metrics that can help business users and domain experts achieve this goal, as displayed in Figure 8:
- AP50: This metric represents the Average Precision score at a 0.5 IoU, or Intersection Over Union (IoU) score. IoU is used to measure how well a predicted bounding box overlaps with the ground truth bounding box. A higher AP50 score implies that the model is more precise and can detect objects with high confidence and accuracy.
- mAP: This metric represents the Mean Average Precision score, or the mean of the Average Precision values for different classes of objects. A higher mAP score implies the model is more consistent and can detect objects across different types of classes and instances.
- Iterations per epoch: This metric, calculated for the validation dataset, represents the number of times a model is evaluated on the validation dataset during one epoch. A higher number of iterations per epoch implies the model is being more frequently validated and can avoid overfitting/underfitting, directly impacting the performance of the model.
Conclusion
Data and AI metrics play a key role in the development of AI systems. The types of metrics in relation to the computer vision task and use case are constantly evolving to better understand and map the distribution of data and performance of AI systems.
What metrics do you use in your development process? We welcome your feedback on additional metrics that we can incorporate! Sign up to try Intel Geti software yourself here.
About Ria Cheruvu:
Ria Cheruvu is an AI Software Architect and Evangelist at Intel, with a master’s degree in data science from Harvard University. As a technical pathfinder, she is passionate about the importance of open-source communities and enjoys learning about and contributing to disruptive technology spaces.