MLOps end-to-end Technical Blueprint
Status | Authors | Coach | DRIs | Owning Stage | Created |
---|---|---|---|---|---|
proposed |
a_akgun
fdegier
|
igor.drozdov
|
sean_carroll
|
devops modelops | 2025-01-30 |
This blueprint describes GitLab end-to-end MLOps platform architecture, designed to support the complete machine learning lifecycle from experimentation to production deployment. This initiative supports our SaaS instance and self-managed instances while maintaining our “single application” philosophy.
Summary
GitLab MLOps is an integrated platform that provides end-to-end machine learning lifecycle management capabilities within GitLab’s single application. It extends GitLab’s existing CI/CD and registry capabilities to support ML workflows from experimentation to production and observability.
Motivation
Organizations face several key challenges when operationalizing ML:
- Reproducibility: Data scientists struggle to track experiments and recreate results
- Collaboration: Disconnect between data science, engineering and governance teams slows development
- Deployment: Manual, error-prone processes for moving models to production
- Governance: Difficulty maintaining oversight of model development, deployment and impact
These challenges often result in:
- Extended time-to-production for ML models
- Inconsistent development practices
- Security and compliance risks
- Resource inefficiencies
Goals
- Provide end-to-end ML lifecycle management integrated with existing GitLab DevOps workflows
- Provide a Model Registry - a place to store model versions, runs, metadata and artifacts
- Enable deployment of model versions from the model registry using CI/CD pipelines
- Enable importing models from Vertex and Huggingface to GitLab model registry
- Limited compatability with MLflow client for model experiments and registry
Non-Goals
- Providing extensive computation resources for model training beyond GPU runners
- Providing a model serving infrastructure
- Implementing feature stores
- Implementing data stores and becoming a dataops platform
- Developing a full-fledged MLflow server by achieving 100% MLflow API compatibility
- Model monitoring and tracing
Proposal
GitLab will provide a comprehensive MLOps platform built on top of existing GitLab infrastructure, leveraging and extending our CI/CD capabilities, package registry for artifact storage. The platform will support the full ML lifecycle through dedicated components while maintaining GitLab single application philosophy.
Design and Implementation Details
Component Architecture
graph TB subgraph DevPhase["Development Phase"] direction TB A1[Experiment Tracking]:::ongoing A2[Model Registry]:::ongoing A3[GPU Runner Management]:::ongoing A4[Code repository]:::completed end subgraph CiCd["CI/CD Pipeline"] B4[Deployment Pipeline]:::new end A1 --> A2 DevPhase --> CiCd A3 --> A1 A4 --> A1 CiCd --> DevPhase %% Define styles for different statuses classDef completed fill:#a3cfbb,stroke:#178344,color:black classDef new fill:#ffdebd,stroke:#ff8c00,color:black classDef ongoing fill:#b8d0ff,stroke:#0066cc,color:black %% Place the legend at the bottom with right alignment subgraph Legend[" "] direction LR L3[Ongoing]:::ongoing L2[New]:::new L1[Completed]:::completed end %% Position the legend at the bottom right style Legend fill:none,stroke:none
Diagram Notes
- Code Repository: This is the Git repository either remote or locally.
- Experiment tracking: Code produces runs, artifacts, metrics etc. the metadata is stored centrally in Experiment Tracking
- Model Registry: Uses Package Registry to store artifacts
- Deployment pipeline: These are triggered either via Model Registry or via Git triggers.
Core Components
1. Experiment Tracking (existing feature)
The experiment management system will track ML training runs and their parameters:
- Experiment tracking with metadata storage
- Metric logging and visualization
- Storing artifacts
- Compatibility with MLflow client
- Access control and security policies for model experiments based on existing roles, custom roles and model registry read and write permissions. See Roles and permissions for model registry and experiments
- Data stored as in the code repository for smaller data sets using git or larger sets using git LFS.
2. Model Registry (existing feature)
Central repository for ML model management: Model registry docs.
- Model versioning and tagging (link to docs)
- Model metadata and lineage tracking
- Model approval workflows using GitLab labels for models and versions
- Integration with CI/CD pipelines to allow training and deployment
- Access control and security policies for model registry based on existing roles, custom roles and model registry read and write permissions.
- Compatibility with MLflow client
- Model cards with freeform markdown descriptions
- Governance instruments
- Users can store large data files in model registry too next to their model version artifacts for example.
- Integration with GCP Vertex AI model registry
3. Connection to GPU resources (existing feature)
Link to GPU runners docs.
- Maintain compatibility with GitLab runner
- Ensure ease of use with GPU runners
4. Model Deployment
Automated model deployment pipeline:
- Container-based deployment
- Multi-variate testing support
- Canary deployments
- Rollback capabilities
- Environment management
- Integration with GCP Vertex AI for deployment
5. API Clients
- Gitlab MLOps client for Python
- Limited MLflow client support: Logging of metrics, artifacts. Creation of models, versions and runs.
- Command-line (cURL) support with the existing API
Integration Points
-
GitLab CI/CD Integration
- Provide training, evaluation and validation CI/CD templates for ML workflows using GitLab (GPU) runners.
- Predefined variables for ML operations
- ML-specific CI/CD stages
-
Issue Tracking Integration
- Model development issues
- Approval workflows
-
GitLab Package registry
- Used for storage of model artifacts
Deployment Options
MLOps will support self-managed installation, including support for air-gapped environments and GitLab.com deployment and GitLab Dedicated.
Development Guidelines
No additional need beyond GDK. You might need MLflow client and GitLab MLOps Python Client
Out of scope
- Full MLflow client compatibility
- LLMOps
- AgentOps
- Model Governance, Security and Compliance
- Container Registry Integration
Conclusion
This technical blueprint provides a framework for implementing a comprehensive MLOps platform within GitLab. The proposed architecture leverages GitLab existing strengths while adding ML-specific capabilities that enable organizations to effectively manage their ML workflows at scale.
eef3c341
)