Or rather, creating a reusable ML Pipeline initiated by a single config file and five user-defined functions that performs classification, is finetuning-based, is distributed-first, runs on AWS Sagemaker, uses Huggingface Transformers, Accelerate, Datasets & Evaluate, PyTorch, wandb and more.
This is the introductory post in a three part series. To jump to the other posts, check out Creating a ML Pipeline Part 2: The Data Steps or Creating a ML Pipeline Part 3: Training and Inference
On the Data & Machine Learning team at VISO Trust, one of our core goals is to provide Document Intelligence to our auditor team. Every document that passes through the system is subject to collection, parsing, reformatting, analysis, reporting and more. Part of that intelligence is automatically determining what type of document has been uploaded into the system. Knowing what type of document has entered the system allows us to perform specialized analysis on that document.
The task of labeling or classifying a thing is a traditional use of machine learning, however, classifying an entire document – which, for us, can be up to 300+ pages – is on the bleeding edge of machine learning research. At the time of this writing, researchers are racing to use the advances in Deep Learning and specifically in Transformers to classify documents. In fact, at the outset of this task, I performed some research on the space with keywords like “Document Classification/Intelligence/Representation” and came across nearly 30 different papers that use Deep Learning and were published between 2020 and 2022. For those familiar with the space, names like LayoutLM/v2/v3, TiLT/LiLT, SelfDoc, StructuralLM, Longformer/Reformer/Performer/Linformer and many more.
This result convinced me that trying a multitude of these models would be a better use of our time than trying to decide which was the best among them. As such, I decided to pick one and use the experience of fine-tuning it as a proof-of-concept to build a reusable ML pipeline the rest of my team could use. The goal was to reduce the time to perform an experiment from weeks to a day or two. This would allow us to experiment with many of the models quickly to decide which are the best for our use case.
The result of this work was an interface where an experimenter writes a single config file and five user defined functions that kick off data reconciliation, data preparation, training or tuning and inference testing automatically.
When I set out on that proof-of-concept (pre-ML Pipeline), it took over a month to collect and clean the data, prepare the model, perform inference and get everything working on Sagemaker using distribution. Since building the ML Pipeline, we’ve used it repeatedly to quickly experiment with new models, retrain existing models on new data, and compare the performance of multiple models. The time required to perform a new experiment is about half a day to a day on average. This has enabled us to iterate incredibly fast, getting models in production in our Document Intelligence platform quickly.
What follows is a description of the above Pipeline; I hope that it will save you from some of the multi-day pitfalls I encountered building it.
ML Experiment Setup
An important architectural decision we made at the beginning was to keep experiments isolated and easily reproducible. Everytime an experiment is performed, it has its own set of raw data, encoded data, docker files, model files, inference test results etc. This makes it easy to trace a given experiment across repos/S3/metrics tools and where it came from once it is in production. However, one trade off worth noting is that training data is copied separately for every experiment; for some orgs this simply may be infeasible and a more centralized solution is necessary. With that said, what follows is the process of creating an experiment.
An experiment is created in an experiments repo and tied to a ticket (e.g. JIRA) like
EXP-3333-longformer. This name will follow the experiment across services; for us, all storage occurs on S3, so in the experiment’s bucket, objects will be saved under the
EXP-3333-longformer parent directory. Furthermore, in wandb (our tracker), the top level group name will be
Next, example stubbed files are copied in and modified to the particulars of the experiment. This includes the config file and user defined function stubs mentioned above. Also included are two docker files; one dockerfile represents the dependencies required to run the pipeline, the other represents the dependencies required to run 4 different stages on AWS Sagemaker: data preparation, training or tuning and inference. Both of these docker files are made simple by extending from base docker files maintained in the ML pipeline library; the intent is that they only need to include extra libraries required by the experiment. This follows the convention established by AWS’s Deep Learning Containers (DLCs) and, in fact, our base sagemaker container starts by extending one of these DLCs.
There is an important trade off here: we use one monolithic container to run three different steps on Sagemaker. We preferred a simpler setup for experimenters (one dockerfile) versus having to create a different container per Sagemaker step. The downside is that for a given step, the container will likely contain some unnecessary dependencies which make it larger. Let’s look at an example to solidify this.
In our base Sagemaker container, we extend:
This gives us pytorch 1.10.2 with cuda 11.3 bindings, transformers 4.17, python 3.8 and ubuntu all ready to run on the GPU. You can see available DLCs here. We then add
wandb. Now when an experimenter goes to extend this image, they only need to worry about any extra dependencies their model might need. For example, a model might depend on detectron2 which is an unlikely dependency among other experiments. So the experimenter would only need to think about extending the base sagemaker container and installing detectron2 and be done worrying about dependencies.
With the base docker containers in place, the files needed for the start of an experiment would look like:
In brief, these files are:
- settings.ini: A single (gitignored) configuration file that takes all settings for every step of the ML pipeline (copied into the dockerfiles)
- sagemaker.Dockerfile: Extends the base training container discussed above and adds any extra model dependencies. In many cases the base container itself will suffice.
- run.Dockerfile: Extends the base run container discussed above and adds any extra run dependencies the experimenter needs. In many cases the base container itself will suffice.
- run.sh: A shell script that builds and runs run.Dockerfile.
- build_and_push.sh: A shell script that builds and pushes sagemaker.Dockerfile to ECR.
- user_defined_funcs.py: Contains the five user defined functions that will be called by the ML pipeline at various stages (copied into the dockerfiles). We will discuss these in detail later.
These files represent the necessary and sufficient requirements for an experimenter to run an experiment on the ML pipeline. As we discuss the ML pipeline, we will examine these files in more detail. Before that discussion, however, let’s look at the interface on S3 and wandb. Assume that we’ve set up and run the experiment as shown above. The resulting directories on S3 will look like:
The run_number will increment with each subsequent run of the experiment. This run number will be replicated in wandb and also prefixed to any deployed endpoint for production so the exact run of the experiment can be traced through training, metrics collection and production. Finally, let’s look at the resulting wandb structure:
I hope that getting a feel for the interface of the experimenter will make it easier to understand the pipeline itself.
The ML pipeline
The ML pipeline will (eventually) expose some generics that specific use cases can extend to modify the pipeline for their purposes. Since it was recently developed in the context of one use case, we will discuss it in that context; however, below I will show what it might look like with multiple:
Let’s focus in on
environment folder will house the files for building the base containers we spoke of earlier, one for running the framework and one for any code that executes on Sagemaker (preprocessing, training/tuning, inference). These are named using the same conventions as AWS DLCs so it is simple to create multiple versions of them with different dependencies. We will ignore the test folder for the remainder of this blog.
lib directory houses our implementation of the ML pipeline. Let’s zoom in again on just that directory.
Let’s start with
run_framework.py since that will give us an eagle eye view of what is going on. The skeleton of
run_framework will look like this:
settings.ini file a user defines for an experiment will be copied into the same dir (
BASE_PACKAGE_PATH) inside each docker container and parsed into an object called
MLPipelineConfig(). In our case, we chose to use Python Decouple to handle config management. In this config file, the initial settings are: RUN_RECONCILIATION/PREPARATION/TRAINING/TUNING/INFERENCE so the pipeline is flexible to exactly what an experimenter is looking for. These values constitute the conditionals above.
Note the importlib line. This line allows us to import use-case specific functions and pass them into the steps (shown here is just data reconciliation) using an experimenter-set config value for use case.
The moment the config file is parsed, we want to run validation to identify misconfigurations now instead of in the middle of training. Without getting into too much detail on the validation step, here is what the function might look like:
_validate_funcs function ensures that functions with those definitions exist and that they are not defined as
pass (i.e. a user has created them and defined them). The
user_defined_funcs.py file above simply defines them as pass, so a user must overwrite these to execute a valid run.
_validate_run_num throws an exception if the settings.ini-defined
RUN_NUM already exists on s3. This saves us from common pitfalls that could occur an hour into a training run.
We’ve gotten to the point now where we can look at each pipeline step in detail. You can jump to the second and third post via these links: Part Two: The Data Steps, Part Three: Training and Inference.