“A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.”
John Gall
“Gallia est omnia divisa in partes tres”
Julius Caesar, Commentarii de Bello Gallico
If you’re going to build a complex system, you should start with a simple system. Seems obvious. And yet, if you’re building machine learning (ML) systems in a high pressure environment, you’re subjected to various pressures that can push you to act against this sage advice. Students of Latin will remember with awe (or gritted teeth) Julius Caesar’s habit of not only dividing Gaul into three parts, but nearly everything else into lists of three (think: “I came, I saw, I conquered”), a kind of “Gauls’ Law,” if you will. Inspired by a mishearing of Gall’s law (see above) as “Gaul’s Law,” I started breaking my natural language processing (NLP) projects down into three phases: end-to-end heuristics and labeling, some ML components, and systematic ML models.
In the first phase, I build a parameterized end-to-end system with heuristics and I label data. The entire system is parameterized. At first this will feel like a waste of time, because you only have one implementation for every component, but starting with this modularity makes subsequent phases much easier to implement, and, more importantly, easier to instrument and evaluate. Label data relevant to each component, write unit tests, and establish accuracy expectations. Labeled data are represented as little burlap sacks because they are precious, like burlap bags of gold.
In the second phase, I replace the worst performing heuristics with ML models.
In the final phase I replace subsystems (up to and including the entire system) with more complex models.
The foundation of this development process is the data and not the model. When you start with the data, you start with facts, and they will not become wrong and you won’t have to redo them. When you start with the model, you put together the first model much faster, but you run the risk of building the wrong model and needing to redo it, only now with more pressure because you wasted time.
To make this more concrete, I’ll show how I followed Gaul’s Law when building Salesforce’s Contact Information Parser. The Contact Information Parser is a library that detects, extracts, parses, and enhances unstructured contact data to create high quality Salesforce contact objects. Its uses include automatically creating contacts from email signatures and OCR business card data.
End-to-end Heuristics and Labeling
The initial version of the system was entirely built with heuristics. It detects spans of text that might be a contact, validates that contact candidate, and then applies a sequence of heuristics that detect individual contact fields like name, title, company. A resolver handled ties when a text span received multiple tags.
This phase takes the most time and effort because you need to build all of the scaffolding that holds the interchangeable parts together and you need to label a bunch of data, but it is also the one with the most learning. This is where you find out that your input has a lot of characters in the high UTF-8 register or has had the punctuation stripped or that what you thought were spaces were actually non-breaking spaces.
Your components are easy to understand and the entire system is more or less interpretable. You can still reason about the individual components and the system in general, and you can hand-tune everything. Because you can reason about your model, you can develop intuitions and theories about why it works and what it needs. This is information that you did not have when you started. Imagine debugging a deep-learning model if you hadn’t gone through this process.
There are non-technical benefits to starting with heuristics. They are relatively easy to write, so you can move quickly. The sooner you understand your data, have a working prototype, and get baseline accuracy and performance metrics, the less stressed your manager is going to be and the more time you can spend thinking instead of responding to bureaucratic pressure. Although your boss wants to hear “it’s done” more than anything else, telling your boss “We have data and know where to invest our effort going forward” is better than “I’m using the latest technology and this model is going to be awesome, but I can’t quantify any of this.”
At some point, however, your heuristic models are going to drive you crazy. Rules tend to be high precision, low recall, and, as they start to overlap each other, you have to install layers of exceptions and resolvers. Machine learning is now an easier and more sustainable way to improve performance. It’s time to use some ML components.
Use Some ML Components
Your labeled data allows you to identify which heuristics are the weakest and should be replaced with ML models. The machine learned models might have worse performance than the heuristics, but since the pipeline is parameterized, you can keep developing them before you swap them in. Because of all the work in the first phase, this phase is actually pretty easy because you get to start with some of your featurization already done: heuristics make for excellent features. Your recall will most likely improve, although precision will probably get worse. Because you labeled the data, you’ll know exactly by how much and you’ll have a confusion matrix for each model the moment you swap it in.
For the Contact Information Parser, we swapped out the heuristic contact detector with a CRF model and the title heuristics module with a logistic regression model. Both used features from their respective heuristic versions. They were the weakest components in the pipeline and replacing them with models significantly improved performance.
Once you’ve replaced enough components, you can set your parameters to choose the ML components over the heuristic components, and you can look at system and subsystem level performance. The worst part of the Contact Information Parser at this point was the Resolver for the field taggers. We built a recursive function to split/retag and installed multiple layers of rules to resolve spans that persisted with multiple tags. Understanding, maintaining, and improving it became very difficult and unpleasant. We thought that the system would perform better if it had information about all of the spans at once. This brought us to the final phase.
Systematic ML Models
You are now able to make informed decisions about how big of a piece is worth replacing with an ML model, how far away your system is from “good enough,” and what resources are necessary to finish the project to varying degrees of completion. You can take your entire system, with all of its labeled data and performance data, and replace it end-to-end with a gigantic deep learning model, or you can replace smaller subsystems with ML models. For this reason, this stage requires the most judgment. Luckily for you, this is also the phase in which you have the most information about and experience in solving this problem.
In the Contact Information Parser, we replaced the sequence of models and heuristics used to assign text spans to contact fields with a single CRF model, which we called the CRF Field Model. Building the model at this point in our development was simple because of all the work we had already done. We used all of the heuristics as features, added in n-grams, and produced a model that improved performance across all classes. It was easier to understand, and it was easier to improve. Instead of having to think up new rules and exceptions to those rules, we simply labeled more data. Building the CRF model from the very beginning would have been much harder and it is unlikely that we would have had ideas for features that were as good.
Practical Consequences
On a final note, you might notice that earlier I used the bizarre phrase “varying degrees of completion.” This brings me to the final, and probably most important, reason why you should follow Gaul’s Law. As early as possible, you had a system that completed the required task. It was probably unsophisticated and made harsh tradeoffs between precision and recall, but it provided end-to-end functionality. Things change quickly in the AI world. You might work in an organization that boasts being “agile” and only looks two weeks into the future. The odds that you will have enough resources to complete your project to the highest standards are nearly zero. If you plan your work so that nothing is done until everything is done, you are unlikely to ever finish. Following Gaul’s Law allows you to ship something when organizational priorities and resources shift away from your project, and gives you incremental performance data to help keep the resources around for as long as possible.