Sept. 27, 2022
Does Active Learning Help Speed Up AI Model Development?
(Spoiler alert: the answer is a resounding yes!)
What will you learn from this blog?
- A brief overview of typical AI development workflow
- What is active learning?
- How does it work in Intel® Geti™ AI software?
- Real-life use cases of active learning in Intel Geti software in action
- Reducing sample bias with active learning
Typical AI development workflow
In a typical computer vision model development workflow, you start with a large dataset—typically made up of images or video frames—and label them to classify, detect, or segment objects of interest. A neural network is trained and tested on this information provided, and once the model is deemed ready, based on the performance statistics, it gets deployed in production.
While that sounds straightforward enough, in fact, the typical process is laborious, and rife with uncertainty. Before beginning, you need to determine how much data is required to get started. There is no easy answer to this question, as it can vary based on the use case complexity. So, to be safe, you typically start with a large dataset—of the order of tens of thousands of data inputs. (There are, of course, exceptions for use cases where you just don’t have that much data.) Next all the annotations (or labels) must be made on that large dataset, a time-consuming process.
Finally, a team of data scientists uses the annotated data to work on model training, testing, and validation. The train, test, and validation samples during the training process are selected randomly. There is no feedback from the domain experts at this stage until a model is ready for deployment, which makes effective collaboration challenging.
The key to improving the productivity of both the teams here—that is, data scientists and domain experts, is to involve human experts in the model-building loop for a continuous feedback process. Active learning in the Intel® Geti™ computer vision AI software enables just that. Powered by active learning, Intel Geti software keeps the human expert in the loop as the model iteratively learns from a small number of data inputs, speeding up the model development process for enterprise teams.
What is active learning?
Active learning is a scientifically proven method1 to enable deep learning models to choose sample data from a dataset in a way that helps models learn the most about the object of interest: that is, the model looks for the most informative data sample. For a computer vision use case, active learning is helpful in selecting the sample images or video frames from the dataset for an expert to annotate that are most useful for model training. This enables the algorithm to learn from a much smaller set of samples and helps to reduce the annotation effort required, resulting in time and cost savings in model development.
How does active learning work in Intel Geti AI software?
The data scientists need quality annotations with inputs from domain experts during the training process. If they were to follow the typical AI development workflow described earlier, both the teams may spend a lot of time waiting for a large number of data annotations to be completed—either by themselves or from the outsourced annotators. The ideal way would be to enable them to start with a small set of data and iteratively add data samples that help the model learn fast.
Additionally, the tools and applications used by data scientists in the machine learning model development have a steep learning curve for domain experts to actively participate to provide feedback during the data labeling and model development phase. Imagine a hospital radiologist or a quality engineer on a manufacturing production line finding time to learn software applications originally developed for data science teams to annotate large amounts of data.
Intel Geti software helps amplify such collaboration by enabling them to work together within a single platform. With the solution, we employ active learning based on how uncertain a prediction is and how different a sample looks from currently labeled samples. Intel Geti software enables users to start with a small sample of data—often, 10 to 20 images that enable training of an initial rough model. This model is used to compute active learning scores based on the predictions of a sample of unannotated data. The samples with the highest active learning scores—i.e., the ones that the model is most uncertain about or that look most distinct—are presented first to the users for annotation. The users can look at the predictions made by the initial rough model and accept or correct those predictions. This new information is used in the subsequent iterations of model training, helping the model improve accuracy much faster, with fewer data inputs, than the traditional way of AI development using random sampling.1
Active learning in Intel Geti software in action
Consider the experience of one of our early access customers: Royal Brompton and Harefield hospitals. They needed to train a computer vision model to assist medical experts in diagnosing a rare respiratory disease, primary ciliary dyskinesia (PCD). While machine learning is well suited to identifying the anomalies used in diagnosis, transferring the needed human expertise into algorithmic form has been impossible for medical specialists who are not AI experts. Additionally, medical experts’ time is highly valuable when preparing data for training such AI models – and annotating tens of thousands of images needed for a traditional computer vision development workflow for such a complex use case was not going to be conducive to medical experts’ time. How did Intel Geti software help?
The active learning feature available in Intel Geti software coupled with smart annotations (like those familiar drawing tools in photo editing software) played a key role in accelerating their model development process. Medical experts at the RBH hospital, for this use case, used active learning combined with the smart annotation features within Intel Geti software’s intuitive user interface to reduce the annotation efforts needed to build a model to assist medical experts in PCD research and potential diagnosis.
Similarly, for Naturalis Biodiversity Center, active learning in Intel Geti software helped massively reduce their annotation efforts as they worked to build a model based on tens of millions of images to monitor insect biodiversity by identifying the population of various insect species. In a typical AI development workflow, human experts would have needed to annotate hundreds of thousands of images randomly sampled across their dataset of tens of millions of images. With active learning, they were able to build an accurate model with a few tens of thousands of images.
Reducing sampling bias with active learning
What if your data is imbalanced? Say, for example, Naturalis’ dataset of millions of images only had a few tens of images that captured a rare insect. In cases of random sampling with such imbalanced data, there may not be enough samples of such specific classes of data, and more train-test-validate iterations will be needed for the model to perform at the desired metrics. This is often the case in real-world datasets with a long-tailed distribution of imbalanced datasets. These situations could result from many factors, for example, from a change in lighting conditions to a new type of part added to the production line.
As summarized by Krishnan et. al. in their paper, active learning algorithms achieve higher accuracy with a small fraction of labeled images compared to randomly selected samples in the case of imbalanced data.1 Overall, active learning methods reduce sampling biases in these situations with imbalanced data. So, in the above-mentioned scenario, active learning will be able to help the model learn about this rare insect class more accurately than random sampling.
Conclusions
Active learning algorithms help speed up time to value by reducing the annotation efforts needed to get to desired accuracy levels for production use cases, while also helping reduce the sampling biases in real-world use cases where data is often imbalanced. Active learning in Intel Geti software, coupled with human expertise, helps data scientists, machine learning professionals, and domain experts collaborate easily within the same platform and improve the productivity of AI development efforts.
1. Krishnan, R., Ahuja, N., Sinha, A., Subedar, M., Tickoo, O., & Iyer, R. (2022). Robust Contrastive Active Learning with Feature-guided Query Strategies.
arXiv:2109.06873 [cs.LG]