Computer VisionAWSMLOpsOpen Source

Scaling Computer Vision Workflows on AWS: Lessons from Building SkyNewt

Eric Garcia, PhD

Eric Garcia, PhD

February 5, 2026 · 10 min read

Computer vision workloads have a unique infrastructure problem: they're expensive to run, bursty in nature, and require specialized hardware. After building CV pipelines for biotech applications, we started developing SkyNewt—open-source tooling to make CV deployment on AWS less painful.

The Problem with CV at Scale

When you're processing thousands of images—whether for biotech research, quality inspection, or content moderation—you hit infrastructure walls fast:

  • **GPU costs explode**: Running inference GPUs 24/7 gets expensive quickly
  • **Batch processing is complex**: Orchestrating jobs across spot instances requires careful engineering
  • **Model versioning gets messy**: Which model processed which batch of images?
  • **Results need to be traceable**: In regulated industries, you need audit trails

Most CV tutorials stop at "run inference on a single image." Real production CV systems need much more.

Why We're Building SkyNewt

SkyNewt is our answer to these challenges. It's designed for teams that need to:

  • Process large batches of images cost-effectively
  • Scale up and down based on demand
  • Track which model version produced which results
  • Integrate with existing data pipelines

Core Design Principles

1. Spot Instance First

GPU spot instances are 60-90% cheaper than on-demand. SkyNewt is built around spot instance patterns from the ground up—graceful interruption handling, checkpoint resumption, and automatic requeuing.

2. Separation of Concerns

The orchestration layer (job management, scaling, monitoring) is separate from the inference layer (model loading, prediction). This means you can swap models without touching infrastructure code.

3. Cost Visibility

Every job tracks its compute cost. You know exactly what each batch of images cost to process before you get the AWS bill.

Architecture Overview

``` ┌─────────────────────────────────────────────────────┐ │ SkyNewt Core │ ├─────────────┬─────────────┬─────────────┬──────────┤ │ Job Queue │ Orchestrator │ Model Registry │ Results │ │ (SQS) │ (ECS/Batch) │ (S3 + DynamoDB)│ (S3) │ └─────────────┴─────────────┴─────────────┴──────────┘ ↓ ↓ ↓ Spot Instances Model Artifacts Output Data (GPU Workers) (Versioned) (Traceable) ```

The key insight: treat CV inference like a data pipeline, not like a web service. Most CV workloads aren't real-time. They're batch jobs that can tolerate latency in exchange for cost savings.

Lessons Learned So Far

1. Checkpoint Everything

Spot instances can be interrupted with 2 minutes notice. If you're processing 10,000 images and get interrupted at image 9,500, you don't want to start over. Checkpoint after every N images and resume from the last checkpoint.

2. Pre-warm Model Loading

Model loading time dominates inference time for many CV models. A large segmentation model can take 30-60 seconds to load. Pre-warm instances with models loaded before sending jobs.

3. Right-size Your Batch

There's a sweet spot between "too few images per batch" (high overhead) and "too many images per batch" (checkpoint intervals too large). We've found 100-500 images per job works well for most use cases.

4. Monitor GPU Utilization, Not Just Job Completion

Low GPU utilization means you're paying for compute you're not using. Data loading, preprocessing, and result writing can leave GPUs idle. Profile your pipeline end-to-end.

What's Next

SkyNewt is still in active development. We're using it internally for biotech CV projects and refining the API based on real-world usage. The goal is to release it as open source once the core patterns are stable.

If you're building CV pipelines and want to follow along, we'll be posting updates here and eventually on GitHub.

The Bigger Picture

CV infrastructure shouldn't be a competitive moat. The moat should be your models, your data, and your domain expertise—not your ability to wrangle AWS services. SkyNewt is our contribution to making the infrastructure layer table stakes.

If you're building computer vision systems and want help with the infrastructure side, [let's talk](/contact).

Eric Garcia, PhD

Eric Garcia, PhD

Founder & Principal Consultant

PhD in Machine Learning from UW. Former Spotify ML engineer. 15+ years building production ML systems.

Learn more about Eric →

Need Help with Production ML?

We help companies build ML systems that actually work.

Learn About Our ML Consulting