February 17, 2021

What It Really Takes to Build An AI System: It’s more complicated than many think

Building an AI system sounds straightforward until it isn't. From sourcing quality training data to handling real-world edge cases, productionizing models, and maintaining them over time, the true cost of building your own is far greater than most teams anticipate. Here's an honest breakdown of what it actually takes.

Limina

Company

There is an old saying in software development: the last 20% of the work takes 80% of the time. Nowhere is that more true than in building AI systems.

We live in a world of unprecedented open-source code. Companies like Google and Facebook have placed their internal AI solutions in the public domain, and there is no shortage of tutorials, frameworks, and blog posts promising that standing up an AI application is just a matter of picking the right model and switching it on.

It is not.

The level of quality and reliability required for true production deployments is consistently underestimated, even by experienced developers and engineering managers. A massive amount of work remains after the prototype is built, and the costs, both financial and in terms of team bandwidth, have a way of compounding in ways that are easy to miss at the outset.

To make this concrete, consider a Traffic Sign Recognition (TSR) system built for an automaker. It is a classic case study in the "build vs. buy" debate, and it illustrates almost every reason why real-world AI is so much harder than it looks on paper.

Why Is Building a Production AI System So Hard?

The gap between a compelling demo and a system that actually works in production is enormous. The difficulty does not come from any single problem. It comes from the accumulation of many smaller problems, each of which takes far more time to solve than anticipated.

Edge Cases Are Everywhere

Data is the single largest consumer of time and money in any AI project. This is not because data is intrinsically difficult to handle. It is because the real world contains an almost infinite number of edge cases, and most teams do not fully appreciate this until they are deep into the project.

During the TSR project described above, the team encountered LED highway signs, which look completely different from standard road signs and create camera capture problems similar to filming a computer screen. Trucks in Europe carry speed limit stickers on their backs that are visually identical to roadside signs, but they indicate the vehicle's own speed limit rather than the road's. Exit speed limit signs at highway intersections can be clearly visible from the main highway, creating false positives. And snow, which happens to be the same color as most traffic signs, can obscure critical visual information entirely.

This complexity is not limited to computer vision. Natural language presents its own version of the same problem, something Limina's own CTO explored in depth in a companion article on regular expressions and real-world language. A phone number alone can be written in dozens of different formats across different countries, with letters, extensions, and shorthand variants that no regex can reliably capture. What looks like a simple pattern-matching problem is, at scale, a deep contextual reasoning problem.

Why Is Finding a Good Training Dataset So Difficult?

Even when you know what you need to build, getting the data to build it is another challenge entirely. Many high-quality models are published and open-sourced, but the datasets used to train them for production applications are typically kept proprietary. For some data types, like credit card numbers, obtaining realistic training samples is especially difficult. A strong data moat is, in fact, the primary competitive advantage of many AI companies.

Publicly available research datasets tend to fail production requirements on several fronts. They often prohibit commercial use (ImageNet being a well-known example). They are frequently riddled with labeling errors. And they are usually built under controlled conditions that do not reflect the messiness of real-world inputs. Google's OpenImages dataset, which contains 1.7 million images across 600 labeled classes, illustrates the problem clearly: the training split contains fewer than half the labels per image compared to the validation split, which suggests a significant portion of training examples are incompletely labeled.

For the TSR system referenced here, freely available datasets did not allow commercial use, had too few examples to be useful, and only contained images captured under good lighting conditions in a single country. Cars, of course, have a habit of crossing into new jurisdictions with entirely different sign designs and traffic laws.

How Much Does It Cost to Create a Custom AI Dataset?

Building your own dataset is the obvious alternative to relying on public data, but it is neither cheap nor fast. The process begins with defining labels and collecting data, making sure every edge case is represented. From there, the team must construct reliable validation and test sets, perform data hygiene and formatting (a step that sounds mundane but has an outsized impact on model performance), and then label the data.

For most real-world tasks, labeling requires building or heavily customizing annotation tooling, because off-the-shelf tools rarely fit the specific requirements of the task at hand. Data infrastructure must also be set up to manage, version, and serve the dataset over time.

Then come the annotators themselves.

If the data can be shared outside the organization and the task does not require deep domain expertise, it may be possible to outsource annotation to a service like Amazon Mechanical Turk. In practice, however, outsourced annotation tends to be expensive and produces lower-quality labels than many teams expect. For sensitive or domain-specific tasks, internal annotators are often the only viable option, which means hiring, training, and managing an entirely new function within the team.

Annotator training is not trivial. Labels are often ambiguous unless defined with precision, which means building a detailed annotation guide complete with exhaustive examples and a living FAQ section that grows as new edge cases are discovered. Turnover in annotation roles is high, so the training and onboarding cycle tends to repeat more often than teams plan for.

All of this is compounded by the fact that requirements change. In live projects, internal specifications shift, and external requirements, such as data protection regulations and evolving privacy law, change as well. Going back over a dataset multiple times to re-label content under new criteria is common and costly.

The process has also gotten harder over the past several years. The TSR project described here predates GDPR. Building a similar dataset today would require navigating a significantly more complex privacy landscape from the very beginning of data collection.

Building the Model: What Happens After You Have Your Data?

With data in hand, the team can turn to the most visible part of the process: model development. Open-source frameworks like TensorFlow and PyTorch have made it faster than ever to get a model running, but getting it to run well is a different matter.

Even well-regarded open-source implementations frequently contain subtle bugs that affect accuracy in ways that are not immediately obvious. For instance, when building a custom MobileNet V3 implementation, the team found that none of the publicly available implementations, including the keras-applications version, accurately matched the original paper. Getting state-of-the-art models to operate at their full documented capacity consistently requires significant remediation work.

Commercial licensing is another hidden constraint. A large proportion of research paper implementations are not licensed for commercial use, which means teams must either reimplement them from scratch or find alternatives.

Production systems also rarely rely on a single model. They typically combine multiple domain-specific techniques, which means integrating several codebases together and building the test coverage needed to make that integration reliable. Open-source code is notoriously light on tests.

What Does It Take to Deploy an AI System to Production?

Deployment is where many AI projects hit their steepest and most unexpected wall.

If the application runs entirely in the cloud, deployment is relatively straightforward: package the model into a Docker container and run it. But cloud compute for machine learning is expensive. A handful of GPU-equipped instances can easily cost tens of thousands of dollars per year, and running in multiple zones to reduce latency multiplies that cost further.

For mobile and embedded applications, the complexity increases considerably. Hardware fragmentation, particularly on Android, often forces teams to run models on CPU rather than GPU, which requires model optimization. Deep learning inference packages are significantly less mature and harder to use than the training frameworks teams used to build the model in the first place. A recent example: converting a transformer model to Intel's OpenVINO package required going into OpenVINO's source code directly to fix compatibility issues after Intel's own demo example stopped working with the latest version of PyTorch.

The TSR system referenced here was built for an automotive context that required all code to be written to a 30-year-old C standard, fit within a few megabytes, and use no external libraries because of safety certification requirements. That is an extreme case, but embedded and mobile deployments routinely impose constraints that are not visible until teams are deep into the work.

Beyond the model itself, real-world applications require substantial pre- and post-processing logic, all of which must also be productionized. In the TSR case, a large amount of additional code was required to match detected signs against a navigation map. Porting code to the application language, whether C++, Java, or something else, adds further overhead.

People with deep expertise in model deployment are genuinely difficult to find and expensive to retain.

If your organization is deploying AI for data-heavy regulated use cases, such as clinical document processing in healthcare or sensitive data handling in financial services, the deployment requirements become even more demanding. Privacy-safe processing, regulatory compliance, and auditability must be built into the architecture from day one, not bolted on afterward.

The Build vs. Buy Question: When Does It Make More Sense to Use an Existing AI Solution?

The build vs. buy debate in AI is not just a cost question. It is a question of organizational capacity, risk tolerance, and strategic focus.

Building a production AI system requires a team with genuinely diverse expertise: data scientists, annotation managers, ML engineers with deployment experience, and domain specialists. Demand for these skills remains intense, which means that assembling this team takes time and costs significantly more than many organizations budget for at the outset. Staff turnover introduces additional risk. A system that required many decades of cumulative developer time to reach a high standard of accuracy can become difficult or impossible to maintain if key people leave.

There is also the ongoing cost of maintenance to consider. No AI system is static. Edge cases that were missed during initial data collection will surface in production and require the data collection, annotation, and retraining cycle to begin again. The world itself changes: a chatbot trained before 2019 would have no understanding of COVID-19. Keeping a production AI system accurate and current is a continuous investment, not a one-time project.

For organizations in regulated industries, the stakes are higher still. Pharma and life sciences companies working with clinical trial data, insurance carriers processing claims documents, and contact centers handling sensitive customer information all face compliance requirements that a general-purpose AI system is unlikely to meet out of the box, and that a custom-built system will require substantial ongoing effort to address.

Purchasing a purpose-built solution shifts those costs, risks, and maintenance burdens to a vendor whose entire organization is structured around solving them. For many companies, this is a substantially better use of capital and engineering capacity than building and maintaining the same capabilities in-house.

Ready to skip the build cycle and go straight to production-ready accuracy? Reach out to Limina to see how our data de-identification platform handles the hard parts for you.

How Limina Solves the Problems That Make AI Systems Hard to Build

The challenges described in this article are not hypothetical. They are the exact problems Limina was built to solve.

Limina's context-aware data de-identification platform is built by linguists and trained on carefully annotated data covering more than 50 entity types across 52-plus languages. The platform processes more than 70,000 words per second with better than 99.5% accuracy, a level of performance that took years of data collection, annotation iteration, and model optimization to achieve.

Where generic models treat PII detection as a pattern-matching exercise, Limina's approach understands the context surrounding sensitive information. It recognizes that "I like David Lynch's 1984 rendition of Dune" is more identifying than "I like Game of Thrones," and it handles the full complexity of real-world text, including coreference resolution, nested entities, and language-specific variations, with the kind of nuance that regex-based or lightly trained models cannot reliably deliver.

That level of accuracy does not happen by accident. It is the product of exactly the kind of sustained, expensive, expert-intensive work described throughout this article. For organizations that need accurate, compliant, privacy-safe AI, the question is not whether to invest in that quality. It is whether to build it yourself or work with a team that has already done it.

Talk to the Limina team to learn how purpose-built de-identification can fit into your existing workflows.

‍

Share this post

Copy link