Both industry and research agree that to be truly successful with AI development, an organization must be “data-centric” with its AI development. Data-centricity is defined as “the discipline of systematically engineering the data used to build an AI system,” meaning that as AI stakeholders, we need to move beyond simply the concept of tweaking models on a predefined set of labels, and move to a strategy where data is “systematically engineered”, that is constantly analyzed, iterated upon, experimented with and fully operationalized to make a productive AI model.
As teams move to this “data-centric” approach, they have two options to systemize their training data workflows.
The first option is to continue with the status quo from a tooling perspective, that is, either a homebuilt solution or a combination of open-source software and custom-built Python scripts. These require constant updates to ensure that they remain truly state-of-the-art.
The second option is to purchase software that aims to empower teams to “systematically engineer” their training data, where a burgeoning category of software has sprung up, collecting around terms such as Data Engine, Training Data Platform, or our preference, Training Data Ops Software. These allow teams to stay at the cutting edge of advancements in data-centric AI but will require an upfront investment.
Each option brings its own pros and cons, and in this guide, we’ll work through questions to help you decide where it makes sense to stick with home-built software, and where and when it makes sense to invest in a training data operations solution.
Training data ops software should aim at helping businesses achieve the following goals:
To justify software spend, teams need to ensure that any purchased solution is utilized and adopted internally, otherwise, it is wasted. The first item to consider are “upstream” processes in the AI development lifecycle, namely:
Without those in place, succeeding in the development of a computer vision product will be impossible, and if success is impossible, investing in software is pointless. The average product requires not only a large amount of training data for the initial training, but a continuous supply of training data to allow for the maintenance of the model and its development should there be changes to outside conditions.
Furthermore, the team requires access to resources with a working knowledge of Python. Whilst AutoML systems such as V7’s Model Training, Google Vertex AI or AWS Sagemaker allow for business users to rapidly train models, to build true “Production” AI, a working knowledge of Python will be useful to either encode those models into devices or alternatively to iterate further upon those models.
Once a team has access to data, an idea for automating visual tasks, and a team who can provide the coding infrastructure for the creation of the product, they are ready to maximize the value of their training data, and thus a training data software product might be needed.
Crucially, what is not needed is a large labeling workforce. There are other routes to achieve accurate AI without having a large workforce from a BPO as a prerequisite. Equally, an understanding of end goals is more relevant than deciding the minutiae of algorithm design before a product takes place.
Successful data science departments are constantly innovating and looking to establish new projects for R&D purposes. However, many data scientists are drowning in Proof-of-Concept and Pilot projects as business users create new demands, without the ML Engineers being given time to properly execute on proven concepts.
Teams who are already able to easily scout for new R&D projects, establish their viability through quick and accurate Pilots or POCs, and understand which are most likely to succeed and “bet” their computer vision team resources on the right ones, whilst maintaining focus, likely have less need of training data ops software.
However, those teams that struggle with the above may benefit from training data ops software. A good training data ops platform enables R&D by:
The market for Computer Vision talent is very thin, and Computer Vision Engineers are some of the most important people in their organization for creating value.
Moreover, leaders who hire, retain, and efficiently deploy great Computer Vision engineers will win. Leaders who struggle with this will continue to struggle.
Whenever a Computer Vision engineer is involved in tasks that are manual, and do not require their specialist knowledge, they are being wasted, and wasted computer vision engineering time is costly not only in terms of salary, but also in terms of time and delay to potential AI projects coming to market.
Leaders who are considering investing in training data ops software should make a note of how their top team members are being deployed. If they are spending more than a few hours per week on manual tasks, their time is being poorly deployed. For Computer Vision engineers, they should undertake the same exercise of tracking their own time, and again, if they are finding that more than a 3-4 hours per week are spent in data operations, interacting with labeling teams, or even worse, data labeling themselves, then they are in need of training data ops software. In short, organizations where expensive computer vision talent is being wasted have a critical need for training data operations software as if they waste their talent on menial tasks, not only will they not hit their KPIs for this year, but they will struggle to retain, and attract new computer vision engineers to their team, and as such risk falling behind competitors in innovation.
About 92% of AI failure comes from poor management of training data, and poor practices pertaining to the labeling of training data.
Put simply, any team who are waiting for more than one week to access the training data they need for training models are causing unnecessary delays to their AI development lifecycle.
Reasons for delays include:
Our recommendation is to use the one week rule, and if the delays caused by data run to weeks, or months (as it is for most organizations) then consider a training data operations software.
Many AI teams are struggling under the sheer cost of supervised learning and creating the training data for multiple projects.
Data Science teams are consistently trapped in a Catch-22 situation; they know that they need more data for all of their projects, but more data increases costs exponentially as every additional labeler you contract with will require payment, and usually will be less efficient than your previously engaged labeling resources.
BPOs will tell you that adding more team members is the answer for efficient AI development, but this is incorrect. The key to reducing your total labeling tool is as follows:
If your current tooling does not provide the above six functionalities then you are not using your resources as efficiently as you could be. At this point, consider making use of training data operations software that can provide the above functionality. A particularly key piece is the understanding and constant observation of labeling behaviour to understand on an individual by individual basis the success of a task, to allow you to identify and amplify good practices, and remove those who are not performing as you may wish
Data-Centricity demands an understanding of your training data, but also an ability to experiment with data and track metrics from those experiments. Mostly, teams experiment with model hyperparameters due to the proliferation of tools like Weights and Biases, but their training data experimentation features (outside of adjusting Training:Test:Validation splits) is limited.
Teams who cannot confidently assert “I tried Experiment X with my training data, and that produced an improvement/worsening of model performance by Y” are not truly data-centric, and may wish to use a training data operations software to better develop accurate AI Models.
Side Note: A side benefit of using a training data operations platform is that not only can you experiment with data and datasets, but you can also visualise model performance across your data and understand through a visual correction layer which models are performing best from a qualitative perspective on your data.
Unfortunately, the ever-changing nature of AI as a cutting-edge field means that static internal tooling rarely keeps up with external demands. Most teams who have spent the past few years developing this tooling now realize that, as AI research develops, their tooling cannot keep up (one customer told us they spent $3m trying to develop their own training data ops platform and could not manage to produce a working software package!)
For some teams, their demands may be so specific to their own use case that external software may not be able to keep up out-of-the-box. In many of those cases, staying with an internal solution makes the most sense, as custom development for custom features is usually expensive for SaaS providers. For many other use cases however, it may make more sense to work with external vendors to adapt the project to take advantage of best-in-class tooling, whilst crucially not spending and expensive internal resources on maintaining and updating existing platforms.
As above, any time spent by a Machine Learning engineer or Data Scientist on the updating or maintenance of internal infrastructure is time poorly spent, and being able to re-deploy even a fraction of that engineer’s time onto more specialist projects is often a signficant ROI for a Computer Vision team.
In this article we’ve examined some of the reasons for and against moving to external tooling; the downside typically is losing some custom developments, and of course expenditure, but the benefits can often provide substantial Return on Investment due to better AI models, more focused Computer Vision engineering time, and lower total cost of ownership for labeling.
If there are any questions about the above, or if you want help in justifying expenditure and creating a business case for external tooling, please reach out to our AI specialists who can help you navigate those questions. Alternatively, reach out to me—matt@V7labs.com—directly and I can help walk you through this process and provide honest advice about the right time to switch.
“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”
Building AI products? This guide breaks down the A to Z of delivering an AI success story.