Back to home

Feb 2026 • 11 min read

Building Optimus in Public

Why I built an autonomous homelab SRE, how it works under the hood, and what I learned from real incidents.

optimusautomation

When people ask what Optimus is, the short answer is: it is my autonomous homelab SRE.

The longer answer is more interesting. Optimus is the system I built because I got tired of treating operations as a sequence of interruptions. I wanted something that could observe, reason, act, and learn without waiting for me to wake up, log in, and manually stitch context together.

This post is the technical version of that story — why I built it, how the architecture evolved, what contracts keep it simple, and what real outages taught me.

I built Optimus to make reliability feel boring again.

If you're curious about what I ship outside infra automation too, I keep a running summary on my projects page.

Why I built an autonomous homelab SRE

My homelab stopped being a toy a while ago. It grew into a real distributed system:

Once you're at that scale, "I'll notice when it breaks" stops working. Incidents overlap. Symptoms lie. By the time I checked dashboards, the root cause was already buried under second-order failures.

I also noticed a bigger pattern in my own behavior: I was good at fixing incidents, but terrible at preserving the learning. I'd solve the problem, move on, and relearn the same lesson three weeks later.

So the goal became clear:

  1. Make detection fast and consistent.
  2. Encode operational knowledge so it compounds.
  3. Prefer autonomous remediation over human-in-the-loop prompts.
  4. Escalate only when automation reaches a boundary.

That's the origin of Optimus.

The architecture: FastAPI + bash skills + React dashboard

Optimus v2 was my "don't overthink it" rewrite.

At the center is a Python 3.11 + FastAPI control plane. Around it are bash skills that do sensing and action, and a React dashboard that gives me one place to inspect health, incidents, and remediation history.

At a high level:

I intentionally chose this stack for leverage, not novelty. FastAPI gives me predictable service ergonomics. Bash lets me integrate with literally anything installed on the box. React gives me an operational UI that is fast to iterate.

The skill contract that keeps complexity in check

The most important design decision in Optimus is also the simplest one: every skill returns a first-line status contract.

STATUS:OK
Everything is healthy. Optional details below.

STATUS:PROBLEM
Something is broken right now. Include symptom + impact + likely scope.

STATUS:ACTION
Not broken yet, but requires intervention soon.

That line is what turns a random shell script into a composable building block.

It means I can write a skill in any language, call any tool, and still feed a uniform pipeline. The orchestrator does not care how the skill gathers data. It only cares about the contract.

In practice, this contract gives me three big wins:

  1. Predictable orchestration: cooldowns, routing, and escalation become straightforward.
  2. Cross-domain consistency: storage checks and network checks look the same upstream.
  3. Faster incident triage: severity starts as machine-readable status, then enriches with narrative.

A tiny deployment loop still matters

Even with autonomous repair, I still keep critical loops explicit. This was in my earliest post draft and it's still true today.

const deploy = async () => {
  await validateClusterState();
  await applyGitOpsChanges();
  await runPostDeployChecks();
};

That sequence is boring on purpose. Every autonomous system needs boring, legible boundaries.

I also run kubectl rollout status after each apply so failures fail fast, close to the change that caused them.

The self-healing loop: cooldown -> playbook -> droid -> escalation

When a skill reports STATUS:PROBLEM, Optimus enters a staged response loop.

  1. Cooldown gate suppresses repeat noise from known incidents.
  2. Correlation layer groups related failures and avoids duplicate spam.
  3. Playbook engine executes deterministic runbooks from YAML.
  4. Droid invocation handles ambiguous situations with contextual reasoning.
  5. Escalation triggers when impact or frequency crosses configured thresholds.

I think about this as a reliability funnel: deterministic first, adaptive second, human last.

# pseudo-flow
if skill_status == "PROBLEM"; then
  if in_cooldown(skill, target); then
    suppress_incident
  elif runbook_exists(skill, signature); then
    execute_runbook
  elif legacy_playbook_matches(skill_output); then
    execute_legacy_playbook
  else
    invoke_factory_droid_with_context
  fi

  if incidents_last_hour(target) >= 3; then
    escalate_urgent "@here"
  fi
fi

This flow is where Optimus went from "alert bot" to "operations teammate." The point isn't just to report. The point is to close loops.

The intelligence layer (v3): anomaly detection, forecasting, correlation

Once basic remediation was stable, I added an intelligence layer that helps answer: is this just noisy, or is this the beginning of something expensive?

1) Anomaly detection

I use adaptive baselines with z-score-style scoring for metrics where static thresholds are brittle. Baselines reset after known infra changes so planned shifts don't register as incidents.

2) Forecasting

Simple linear forecasting has been surprisingly effective for storage and memory pressure. I track slope, confidence, and estimated time-to-threshold, then convert that into STATUS:ACTION before it becomes an outage.

3) Correlation

Correlation is my favorite module because it kills alert fatigue at the source. If five symptoms come from one upstream fault, I want one incident narrative with linked evidence, not five disconnected pages.

4) Change awareness

I attach Git and Flux context to incidents. "What changed recently?" is the first human question in most incidents, so the system should answer it by default.

5) Capacity and SLA awareness

I track scheduling headroom and error-budget burn. This turns reliability from reactive firefighting into an explicit budget conversation.

None of these modules are individually exotic. The value comes from composition.

How Factory AI droids fit into incident response

Factory droids are not a gimmick in this stack; they are a deliberate escalation layer for ambiguity.

Deterministic runbooks work great when the failure signature is known. But real incidents often include partial failure, misleading symptoms, or cross-domain side effects. That's where I hand context to a droid and let it reason over the system state.

The key is structured context injection. Every invocation includes identity, memory, recent incidents, and operator preferences.

SYSTEM CONTEXT INJECTION

Load and include:
- SOUL.md      # identity, constraints, hard safety rules
- MEMORY.md    # distilled operational learnings
- HEARTBEAT.md # current system snapshot
- last_5_incidents.json
- recent_changes.log

Operator preference:
"Sagar prefers full autonomy — fix things, don't ask."

Required output:
1) diagnosis
2) actions taken (or blocked)
3) post-action verification
4) rollback path if risk increases

This gives droids persistent operational memory without inventing a complicated memory service. The files are plain markdown, git-friendly, and auditable.

Incident story #1: the power outage chain reaction

One night around 23:41Z I lost power across the lab. All four Proxmox hosts dropped and then rebooted together.

Most workloads came back quickly, which looked like success on the surface. But one critical dependency — the NAS — didn't auto-boot. Anything depending on that NFS path entered weird half-alive states.

This was a classic distributed-systems trap: "recovered" at layer A, degraded at layer B, silent damage at layer C.

Optimus did three things that mattered:

  1. Detected host recovery and service degradation as separate events.
  2. Correlated stuck pods to storage dependency, not app-level regressions.
  3. Force-deleted zombie pods once storage recovered, then filed follow-up issues.

Without that correlation, I would've wasted time blaming workloads that were just victims.

Incident story #2: the NIC hang saga

Another memorable one: repeated NIC hangs on pve4 (Intel e1000e), three times in roughly twenty-two hours.

First incident looked transient. Second suggested pattern. Third demanded architecture-level mitigation.

Optimus detected each crash, migrated VMs where possible, cleaned up orphaned workloads, and tracked recurrence frequency. By incident three, escalation rules tripped hard and I shipped a permanent mitigation set:

The useful part wasn't just automation of each response. It was the accumulated memory that converted "huh, weird" into "this is now a known unstable signature with a canonical fix path."

Incident story #3: CephFS + SQLite corruption

This one hurt because it was subtle and repetitive: an app using SQLite on CephFS corrupted five separate times.

Every isolated incident felt like bad luck. The pattern made the real lesson obvious: this wasn't an app bug, it was a storage/workload mismatch.

The fix was structural:

That rule now lives in SOUL.md, and droids actively enforce it when proposing remediation. This is exactly what I mean by compounded operational knowledge.

The three design choices I would absolutely repeat

1) Skills as bash scripts

Bash is not trendy, but it is unbeatable for ops glue.

When a check is 30 lines of shell and returns STATUS:*, iteration speed is incredible.

2) Memory as files

SOUL.md, MEMORY.md, and HEARTBEAT.md keep the system grounded across invocations.

I considered fancy vector stores. Plain files won because they are simple, transparent, and version-controlled. During a postmortem, I can diff what the system "believed" before and after an incident.

3) Full autonomy by default

My preference is explicit: if a fix is safe and reversible, execute it.

Human approval is expensive during incidents. Autonomy buys mean time to recovery. Safety rules and rollback checks contain risk.

The self-update mechanism (with a test gate)

Autonomous systems rot if they can't safely update themselves. Optimus has a guarded self-update flow: pull, test, restart; rollback on failure.

#!/usr/bin/env bash
set -euo pipefail

cd /opt/optimus

git fetch origin main
PREV_COMMIT="$(git rev-parse HEAD)"
git reset --hard origin/main

if ! .venv/bin/python -m pytest -q; then
  echo "Tests failed after update; rolling back"
  git reset --hard "$PREV_COMMIT"
  exit 1
fi

systemctl restart optimus
echo "STATUS:OK"
echo "Optimus updated and restarted at $(date -u +%FT%TZ)"

That tiny test gate has saved me from shipping broken automation into the thing responsible for fixing breakage.

What I still don't automate (on purpose)

Even with aggressive autonomy, I keep hard guardrails:

Autonomy is only useful if it's trustworthy under stress.

How this changed my day-to-day engineering

Before Optimus, I spent too much attention on "is everything okay right now?"

After Optimus, I spend more attention on "what should be impossible in the future?"

That's the best shift an ops platform can create. It moves you from reactive debugging to system design.

It also changed how I write software generally. I now bias toward explicit contracts, machine-readable outputs, and post-incident memory updates in every project.

For ecosystem details and frontend patterns I'm experimenting with, I still reference the React docs often, but the philosophy comes from operating real infra under real failure.

Where Optimus is going next

I'm focusing on five directions:

  1. Richer causality graphs for multi-layer incidents.
  2. Safer autonomous change windows tied to confidence and blast radius.
  3. Predictive maintenance skills for hardware drift (especially storage and networking).
  4. Better operator narratives that summarize "what changed, what was tried, what worked" in one thread.
  5. Cross-system federation so Optimus can coordinate with adjacent control planes cleanly.

Long term, I want this pattern to generalize: a practical architecture where autonomous agents are accountable, observable, and useful in production-like environments.

If you've built anything similar — even a much smaller version — I highly recommend starting with a strict output contract and a boring remediation loop. You can always add intelligence later. It's much harder to add reliability to a system that started as pure cleverness.

And yes, I'm still building this in public, because operational systems improve faster when the design decisions are explicit and critiqueable.

Deployment loop

const deploy = async () => {
  await validateClusterState();
  await applyGitOpsChanges();
  await runPostDeployChecks();
};

The routine above keeps the flow explicit and easy to troubleshoot. I also run kubectl rollout status after each apply so failures surface immediately.

Routing decisions

I usually jump to the projects page when I want a quick high-level summary. For ecosystem details, I rely on the React docs.