A step-by-step guide to rapidly deploying a full-stack machine learning application

Introduction

We’ve built a fantastic machine learning model. It performs beautifully in notebooks, passes all tests, and promises significant business value.

But a model isn’t truly valuable until it’s in production, serving real users and solving real problems.

In this article, I’ll explore a step-by-step guide to transitioning development notebooks to a production-ready ML application.

As a practical example, I’ll build and deploy a dynamic pricing system for an online retailer that predicts optimal price points using machine learning models:

Feedforward Neural Network (DFN),
LightGBM, and
Simple Linear Regression,

all hosted on containerized serverless architectures.

What We Build — The Machine Learning Application

Our final goal is to recommend price points based on predictions from the dynamic pricing system.

Then, results are displayed on a client-side user interface (UI) to demonstrate the prediction serving process:

To build this quickly, we first define the scope and architectures of Minimum Viable Models (MVMs).

Defining Minimum Viable Models (MVMs)

Minimum Variable Models (MVMs) refer to the simplest possible version of models to achieve the goals with minimal resource investment.

MVMs must focus on a single, most impactful prediction to solve the core problem, while leveraging a simple data access and achieving “good enough” performance.

In our case:

The core problem:

Let us imagine a small online retailer facing decline in sales due to fierce price competitions.

The business owner aims to pinpoint optimal pricing strategies that maximize individual product sales.

The prediction:

To solve this core problem, MVMs solely focus on predicting logged values of sales volumes by product at a specific price point.

Price points are distributed evenly across an estimated price range of the selected product.

The system will then recommend the price with the highest predicted sales volume.

For instance:

When the system estimates that Product A has a price range of $ 100 to $ 200.

The MVMs predict logged sales volumes at multiple price points in the range:

$ 100 — predicted logged sales: 2.71
$ 150 — predicted logged sales: 3.10
$ 200 — predicted logged sales: 2.30

The system recommends $150 as an optimal price for Product A because it has the highest volume.

Here, directly predicting price points is not optimal because it leads to causality mismatch where the model that predicts price treats the price as an output given by other factors including sales, but in a true causal relationship, the price directly influences sales.

The simple data access:

For demonstration, I’ll use the online retail data from the UC Irvine Machine Learning Repository:

The data is preprocessed and stored in the feature store.

In practice, data can be sourced directly from a server or even a simple Excel file.

The “good enough” performance:

MVMs don’t need to achieve the best performance as their priority is accessibility, not accuracy.

I’ll leverage three metrics to assess if the performance is “good enough”:

Mean Squared Error (MSE) as a primary metric for logged sales values,
Root Mean Squared Log Error (RMSLE) as a secondary metric for actual sales values to assess the magnitude of the deviation from the true sales, and
Mean Absolute Error (MAE) as a support metric for actual sales values to understand the actual deviation from the true sales in dollars.

RMSLE serves as a primary metric for actual sales because MAE cannot distinguish the magnitude of the error.

For example, MAE returns the same value (100) when an error is $100 to the true sales of A. $100,000 or B. $200.

But in reality, B is more impactful because the magnitude of the error is disproportionately large to the true sales value.

RMSLE can measure the ratio between predicted and actual sales, indicating that the prediction is off by a multiplicative factor of e^RMSLE(X) from the true sales:

A: RMSLE(A) = 0.001, implies that the prediction is approximately 1.001 times of the actual sales.
B: RMSLE(B) = 0.68, implies that the prediction is approximately 1.974 times of the actual sales

Learn More: A Comprehensive Guide on Loss Functions in Machine Learning

The models:

To set performance benchmarks, I’ll train three models:

Deep Feedforward Network (DFN) (on PyTorch) as primary model for robust prediction,
Light GBM (on Scikit-learn) as secondary model for backup in case the DFN fails, and
Linear Regression (on Scikit-learn) as a performance benchmark (I’ll use Elastic Net to add regularization).

This is all for the scope of MVMs.

Next, I’ll define system architectures of the MVM application.

Defining Architectures of MVMs

The pricing system initially requires only rapid access to pricing recommendations.

I’ll leverage a microservice architecture where an independent service is developed solely for providing price estimations.

This design enables quick horizontal scaling to handle increased traffic and features in the future.

For immediate accessibility, I’ll deploy the application as a single interface on a serverless platform, AWS Lambda, for real-time inference.

This approach enables dynamic adjustments in the future, which is more suitable for the system than the batch inference that serves pre-stored predictions.

The entire system architecture is illustrated in the diagram below:

Fig. System architecture for the MVM application

In this architecture, the application is containerized on Docker to ensure universal accessibility.

Then, the Docker container image is pushed to a container repository (AWS Elastic Container Registry (ECR) where a serverless function (AWS Lambda) fetches necessary models, data, and other functions.

Predictions are then served to a React-based frontend application via API Gateway’s REST API endpoints.

Notably, cloud ecosystems from other major providers like Google Cloud Platform (GCP) and Microsoft Azure offer comprehensive alternatives to AWS. Building a frontend application or connecting online feature stores are optional.

Prioritizing System Design Techniques

Next, I’ll choose system design techniques to implement.

Although many choices are still manual in this stage, to demonstrate a comprehensive approach, I’ll list MVM focuses on key components.

1. Data Management

Effective data management is the foundation of any ML system.

The process involves data pipeline automation, quality management, feature engineering, and versioning.

1–1) Data Ingestion Pipelines:

Implements automation techniques like full-fledged ETL/ELT frameworks or Kafka/Pulsar for streaming data.

MVM Focus: Python scripts for loading and engineering data.

1–2) Data Quality Management:

Proactively catches data drift using anomaly detection, data versioning frameworks, and schema enforcement.

MVM Focus: Python scripts for data cleaning and validation.

1–3) Feature Store:

Stores versioned, reusable features using dedicated platforms like Feast, database systems like Redis, or data warehouses like Snowflake.

MVM Focus: AWS ElastiCache for Redis. Connect the Lambda function using standard Redis client library redis-py.

1–4) Data Versioning:

Versions and tracks data updates and quality issues.

MVM Focus: Timestamped files to version raw data.

2. Model Development and Serving

This section includes techniques to deploy trained models.

2–1) Reproducible training environment:

Ensures that models can be retrained consistently anywhere.

MVM Focus: Docker for containerization and uv for dependency locking and environment management.

2–2) Distributed Training:

Splits data, models, or training processes into multiple workers and GPUs (data, model, pipeline parallelisms).

MVM Focus: Neither is applicable. Prioritize training on a single node.

2–3) Experiment Tracking:

Logs model metrics, parameters, code versions, using dedicated tools like MLflow, Weights & Biases, and Comet ML.

MVM Focus: Manual tracking with ad-hoc folder structures.

2–4) Hyperparameter Optimization:

Finds optimal hyperparameters to minimize generalization errors.

MVM Focus: Runs Grid Search and Bayesian Optimization.

2–5) Model Versioning:

Secures centralized repositories for registered models.

MVM Focus: Save in the S3 bucket with unique, timestamped names.

3. System Performance

This section includes techniques to improve the overall system performance.

3–1) Scalable Inference:

Handles high volumes of concurrent requests with low latency and high throughput, leveraging techniques like horizontal scaling, load balancing, and orchestration.

MVM Focus: Neither is priority. Focus on a single instance.

3–2) Latency Throughput Optimization:

Balances latency with the number of requests processed per second (throughput), leveraging techniques like model compression, caching, batching, and asynchronous inference.

MVM Focus: Asynchronous inference using a response stream.

3–3) Model Version Updates:

Ensures reliable and observable deployment of new model versions via blue/green deployments, canary deployments or A/B testing.

MVM Focus: Neither is priority. Focus on direct deployment.

3–4) API Gateway/Endpoint Design:

Provides a single, secure, and managed entry point for clients.

MVM Focus: AWS API Gateway for authentication, rate limiting, monitoring, and SSL termination.

4. Fault Tolerance

Fault tolerance ensures that the system continues to function despite failures of some of its components.

4–1) Backup Instances:

Runs multiple active instances behind a load balancer (active-active setup) or runs one primary instance with standby (active-passive setup).

MVM Focus: Neither is priority. Focus on a single instance.

4–2) Fallback & Graceful Degradation:

Serves a simpler model or a cached prediction if the primary model fails.

MVM Focus: Implement the graceful degradation:

Primary serving: DFN,
Secondary serving: LightGBM,
Backup serving: Elastic Net,
System failure or API gateway timeout: default values in local.

4–3) Timeouts & Retries:

Handles transient failures in external calls.

MVM Focus: Triggers the graceful degradation.

Other advanced fault handling techniques like circuit breakers, redundancy & replication, or idempotency are skipped.

5. Post-Deployment Operation

This section covers operation techniques after the deployment.

5–1) Monitoring & Alerting:

Proactively detects issues in data, model performance, or infrastructure.

MVM Focus: AWS CloudWatch, live log tails.

5–2) CI/CD for MLOps Pipelines:

Automates the entire ML lifecycle from data ingestion to monitoring setup.

MVM Focus: Neither is priority. Automation comes in when the system is stablized and frequent iterations are required.

Frontend-Backend Interaction — CAP Theorem

Lastly, I’ll define the node interactions using the CAP theorem to guide the design principle.

The CAP theorem is a system design principal where a distributed system can deliver only two of the desired characteristics, either:

“C” Consistency: All clients see the same data at the same time.
“A” Availability: All working nodes return response for any request.
“P” Partition Tolerance: The system continues to work despite any communication breakdowns (partitions) between nodes.

In our case:

Nodes refers to:

API Gateway instances,
Individual Lambda execution environments, and
Underlying database instances that store shared mutable state; information exchanged among nodes, such as prediction results, client requests, other feature data.

And network partitions can be any network failure between the client (React app) and the server like:

User goes offline (no network),
Spotty Wi-Fi,
Server-side network issues, or
Mobile data dropping in and out.

When a network partition occurs, the client loses connectivity to the server, facing a CAP theorem decision;

Prioritizing Consistency (called “CP”)

Shows an error message like “Please try again later” without displaying any prediction.
Trade-off: The user experience is interrupted as the app becomes unresponsive.
In an ML context: Typical for batch inferences where prediction accuracy and data integrity are prioritized than real-time accessibility.

Prioritizing Availability (“AP”):

Shows some response from cache or fallback models.
Trade-off: The user gets a response, but it is not up-to-date, introducing potential inconsistency that needs to be reconciled later.
In an ML context: Typical for real-time inferences to ensure real-time feature lookups.

I’ll prioritize AP for the system, where backup responses are served when the system fails.

From Notebook to Production: Shipping Machine Learning Projects This Weekend