MLOps Engineer
2 days ago
We are seeking a talented MLOps Engineer to take full ownership of the AI pipelines that power our computer vision models. You'll design and operate a containerized training/inference stack that runs both on a local GPU workstation cluster (multiple workstations with multiple GPUs each) and in Google Cloud. Your mission is to streamline the entire model lifecycle—from data ingestion and feature build, through training, evaluation, packaging, deployment, and monitoring—so researchers and engineers can iterate quickly and ship reliable models to production.
You will build robust orchestration and observability around our pipelines, implement resource-aware scheduling for heterogeneous queues, and lead the rollout of model/experiment tracking and performance analytics. You'll also own the evolution of our documentation to ensure the platform is easy to understand, extend, and support.
Responsibilities- Own the end-to-end ML pipeline for computer vision: data prep, training, evaluation, model packaging, artifact/version management, deployment, and monitoring (local GPU cluster + GCP).
- Design and maintain containerized workflows for multi-GPU training and distributed workloads (e.g., PyTorch DDP, Ray, or similar).
- Build and operate orchestration (e.g., Airflow/Argo/Kubeflow/Ray Jobs) for scheduled and on-demand pipelines across on-prem and cloud.
- Implement and tune resource allocation strategies based on current and upcoming task queues (GPU/CPU/memory-aware scheduling; preemption/priority; autoscaling).
- Introduce and integrate monitoring/telemetry for:
- job health and failure analysis (retry, backoff, alerts),
- data/feature drift and model performance (precision/recall, latency, throughput),
- infra metrics (GPU utilization, memory, I/O, cost).
- Harden GCP environments (permissions, networks, registries, storage) and optimize for reliability, performance, and cost (spot/managed instance groups, autoscaling).
- Establish model governance: experiment tracking, model registry, promotion gates, rollbacks, and audit trails.
- Standardize CI/CD for ML (data/feature pipelines, model builds, tests, and canary/blue-green rollouts).
- Collaborate with CV researchers/engineers to productionize new models and improve training throughput & inference SLAs.
- Continuously improve documentation: update existing pipeline docs and produce concise runbooks, diagrams, and "how-to" guides.
Requirements
- Hands-on MLOps experience building and running ML pipelines at scale (preferably computer vision) across on-prem GPUs and a public cloud (GCP preferred).
- Strong with Docker and Docker Compose in local and cloud environments; solid understanding of image build optimization and artifact caching.
- GitLab CI/CD expertise (modular templates, YAML optimization, build/test stages for ML, environment promotion).
- Proficiency with Python and Bash for pipeline tooling, glue code, and automation; Terraform for infra-as-code (GCP resources, IAM, networking, storage).
- Experience with orchestration: one or more of Airflow, Argo Workflows, Kubeflow, Ray, or Prefect.
- Experience operating GPU workloads: NVIDIA driver/CUDA stack, container runtimes, device plugins (k8s), multi-GPU training, utilization tuning.
- Observability & monitoring for ML and infra: Prometheus/Grafana, OpenTelemetry/Loki (or similar) for metrics, logs, traces; alerting and SLOs.
- Experiment tracking / model registry with tools like MLflow or Weights & Biases (runs, params, artifacts, metrics, registry/promotion).
- Data versioning & validation: DVC/lakeFS (or similar), Great Expectations/whylogs, schema checks, drift detection.
- Cloud services: GCP (Compute Engine, GKE or Autopilot, Cloud Run, Artifact Registry, Cloud Storage, Pub/Sub). Equivalent AWS/Azure experience is acceptable.
- Security & compliance for ML stacks: secrets management, SBOM/image scanning, least-privilege IAM, network policies, artifact signing.
- Solid understanding of containerized deployment patterns (blue-green/canary), rollout strategies, and rollback safety.
- Kubernetes & Helm in production; NVIDIA GPU Operator, node labeling/taints, and MIG partitioning.
- Ray/Dask for distributed training/inference and hyperparameter sweeps.
- Feature stores (e.g., Feast) and streaming features (Pub/Sub/Kafka).
- Inference serving frameworks: TorchServe, Triton Inference Server, FastAPI + Uvicorn/Gunicorn, or Vertex AI endpoints.
- Batch & real-time pipelines: Apache Beam/Dataflow, Spark, or Flink.
- Cost optimization playbook on GCP: preemptibles/spot, autoscaling policies, right-sizing, per-project budget alerts.
- Testing for ML: pytest fixtures for data/model tests, golden datasets, regression tests, property-based tests.
- Experience with service proxies (Traefik/Nginx), DNS management, certificate management, and SSL/TLS automation.
- Familiarity with Edge/embedded deployments for CV models a plus.
Benefits
We believe great work starts with feeling valued and supported. That's why we are building an thoughtful, competitive benefits and perks to help you thrive — professionally and personally — through every step of your Career with us. You will be eligible for:
- Salary from 2,500 EUR to 5,500 EUR per month (before Taxes)
- A Birthday Gift
After Probationary Period
- Health Insurance
- Health Recovery Days (which can be taken as you need)
- Paid Study Leave
- Funding for the purchase of Vision Glasses after one (1) year of service
Join us in Building a Cleaner, Smarter Future — one quality process improvement at a time.
-
6042 Senior Data Scientist
2 days ago
Riga, Rīga, Latvia Bonapolia Full timeWe are looking for a Senior Data Scientist: • Working Time Zone: CET • Start: asap • Planned Work Duration: 12+ months Customer Description: A platform providing ride-hailing, delivery, and mobility-related services Project Description: This team focuses on developing and optimising address search and geocoding services, ensuring that users can...
-
Python and Kubernetes Software Engineer
7 days ago
Riga, Rīga, Latvia Canonical - Jobs Full time €40,000 - €80,000 per yearCanonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is very widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation and IoT. Our customers include the world's leading public cloud and silicon providers,...
-
Python and Kubernetes Software Engineer
7 days ago
Riga, Rīga, Latvia Canonical - Jobs Full time €30,000 - €60,000 per yearCanonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is very widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation and IoT. Our customers include the world's leading public cloud and silicon providers,...
-
Data Architect
2 weeks ago
Riga, Rīga, Latvia If P&C Insurance Ltd Full time €4,500 - €6,400About the roleIT Data&Analytics Architecture unit is happy to welcome a new team member. As a Data Architect, you will provide the end-to-end technical vision and hands-on guidance, ensure that data assets, internal and external, are stored, moved, transformed, and exposed in a way that is, secure, governed, cost-efficient, and ready for analytics, AI/ML,...
-
Junior Product Manager
7 days ago
Riga, Rīga, Latvia Canonical - Jobs Full time €80,000 - €120,000 per yearCanonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is very widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation and IoT. Our customers include the world's leading public cloud and silicon providers,...
-
Product Manager
7 days ago
Riga, Rīga, Latvia Canonical - Jobs Full time €40,000 - €80,000 per yearCanonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is very widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation and IoT. Our customers include the world's leading public cloud and silicon providers,...