Computer VisionBiotechImage AnalysisProduction ML

Computer Vision in Biotech: Building Imaging Pipelines That Scale

Eric Garcia, PhD

Eric Garcia, PhD

February 3, 2026 · 8 min read

Biotech imaging presents unique challenges that most CV tutorials don't cover. We're currently building computer vision systems for biological research, and the lessons transfer to any domain dealing with scientific or industrial imaging.

What Makes Biotech CV Different

1. Ground Truth is Expensive

In web-scale CV, you can crowdsource labels. In biotech, labeling requires domain experts (biologists, pathologists) who cost $100+/hour. Every labeled image is precious.

This changes your entire approach:

  • Active learning becomes essential—prioritize labeling the most informative samples
  • Semi-supervised methods earn their keep
  • Transfer learning from adjacent domains matters more

2. Images Are Weird

Biotech images aren't like ImageNet photos:

  • **Multi-channel**: Fluorescence microscopy can have 4+ channels, not RGB
  • **High dynamic range**: 16-bit images are common
  • **Strange artifacts**: Bubbles, focus issues, illumination variations
  • **Variable scale**: Same structure looks different at different magnifications

Pre-trained ImageNet models need careful adaptation. Sometimes you need to train from scratch.

3. Interpretability Is Required

When your model flags a sample as "abnormal," a biologist needs to understand why. Black-box predictions aren't acceptable in most biotech contexts.

This means:

  • Attention maps and saliency are standard outputs
  • Simpler architectures often win over marginally better complex ones
  • Uncertainty quantification is non-negotiable

4. Reproducibility Is Everything

In research contexts, you need to reproduce results exactly. This requires:

  • Deterministic inference (set all random seeds, use deterministic algorithms)
  • Version control for models AND preprocessing code
  • Audit trails for every prediction

Pipeline Architecture

Here's the general architecture we've found works well:

``` Raw Images → Quality Check → Preprocessing → Inference → Post-processing → Results + Provenance ↓ ↓ Flagged for Full audit trail re-acquisition (model version, preprocessing params, timestamp) ```

Stage 1: Quality Check

Before running expensive inference, check for:

  • Focus quality (variance of Laplacian is a quick proxy)
  • Illumination issues (histogram analysis)
  • Expected structure present (quick classifier)

Reject bad images early. Don't waste GPU cycles on garbage.

Stage 2: Preprocessing

Standardize images for consistent model input:

  • Normalize intensity distributions
  • Handle multi-channel appropriately
  • Resize/crop to expected input dimensions
  • Apply augmentation at inference time (test-time augmentation for better predictions)

Stage 3: Inference

Run your model(s). Key considerations:

  • Ensemble multiple models for critical applications
  • Output uncertainty estimates (MC dropout, deep ensembles)
  • Generate interpretability outputs alongside predictions

Stage 4: Post-processing

Transform raw model outputs into biologically meaningful results:

  • Convert segmentation masks to measurements (area, perimeter, intensity)
  • Apply domain-specific rules (minimum size thresholds, morphological constraints)
  • Aggregate cell-level to well-level to experiment-level

Stage 5: Provenance

Every result gets metadata:

  • Which model version made this prediction
  • What preprocessing parameters were used
  • When was this processed
  • What was the raw image hash

This isn't optional overhead—it's a regulatory requirement in many contexts.

Practical Tips

Start with Classical Methods

Before reaching for deep learning, try classical CV approaches:

  • Otsu thresholding for segmentation
  • Watershed for cell separation
  • Template matching for structured objects

Classical methods are interpretable, fast, and often "good enough." Use deep learning when you've proven classical methods don't work.

Invest in Labeling Infrastructure

If you need to label data, build good tooling first:

  • Integrated with your pipeline (show context, not just isolated images)
  • Support for uncertain labels (biologists should be able to say "not sure")
  • Quality control (inter-annotator agreement tracking)

Good labeling infrastructure pays for itself quickly.

Test on Held-Out Experiments

Standard train/test splits don't work well in biology. Images from the same experiment are correlated in ways your model will exploit.

Split by experiment, not by image. If your model works on experiment A's images, will it work on experiment B's images?

The Path Forward

Biotech CV is a maturing field. The tools are getting better, and transfer learning is reducing the need for massive labeled datasets. But it still requires more domain expertise than applying CV to natural images.

If you're building imaging pipelines for biotech or life sciences, [we'd love to hear about your challenges](/contact).

Eric Garcia, PhD

Eric Garcia, PhD

Founder & Principal Consultant

PhD in Machine Learning from UW. Former Spotify ML engineer. 15+ years building production ML systems.

Learn more about Eric →

Need Help with Production ML?

We help companies build ML systems that actually work.

Learn About Our ML Consulting