Model monitoring is the continuous measurement of deployed model behaviour, output quality, and risk signals after launch. It is categorically different from system monitoring. System monitoring — CPU usage, memory, API uptime, error rates — tells you whether the service is running. Model monitoring tells you whether the model is still doing what it was designed to do, at the quality it was designed to do it.
This distinction matters because AI systems fail in a way that traditional software does not. A conventional API either responds or it throws an error. An AI model can respond to every request with no errors, no latency spikes, and no infrastructure alerts — while producing increasingly wrong, biased, or policy-violating outputs. The infrastructure looks healthy. The model is quietly degrading. Without monitoring at the output quality level, you will not know until users complain or business KPIs fall.
Four Signal Categories
A production model monitoring programme covers four signal categories. Most teams start with reliability signals and stop there. That is insufficient.
Reliability Signals
Request volume, error rate, latency (P50, P95, P99), fallback usage rate, timeout rate. These are the signals traditional observability tools already capture. They are necessary. An error rate above baseline or P95 latency doubling are incidents worth paging someone for. But reliability signals alone tell you nothing about output quality. Alert on sustained deviations from your established baseline — define that baseline in the first two weeks of production operation.
Output Quality Signals
Task accuracy on sample evaluations, hallucination rate, format adherence rate, sentiment consistency, policy violation rate. These signals do not come from logs. They require active sampling and evaluation. A representative sample of outputs — typically 1–5% for high-volume systems, 100% for low-volume or high-stakes systems — must be evaluated against defined quality criteria.
LLM-as-judge is the practical method for automated output quality evaluation at scale: use a capable model (GPT-4o or equivalent) to score sampled outputs against a rubric, then track scoring distributions over time. Human sampling — weekly review of a random output sample by someone with domain expertise — remains valuable for catching failure modes that automated scoring misses.
Business Impact Signals
The KPIs the model is supposed to move. Booking conversion rate, support resolution rate, cost per handled interaction, escalation rate, lead qualification accuracy. These are the signals that tell you whether monitoring is catching real problems. A model that shows stable output quality scores but a declining business metric is a signal to investigate. Business impact signals require instrumentation at the application level, not the model level.
Risk Signals
Safety guideline violations, sensitive topic detection, unusual output patterns, data access anomalies. For customer-facing AI, policy violation rate is a must-monitor metric — even one widely distributed policy violation can create significant reputational or regulatory risk. For regulated sectors — finance, healthcare, legal — risk signal monitoring is often a compliance requirement, not just a best practice.
Monitoring Infrastructure Options
LangSmith: Tracing and evaluation platform for LLM applications. Captures every prompt, completion, tool call, and chain step. Built-in support for human annotation, LLM-as-judge evaluation, and dataset management. The standard choice for teams building on the LangChain ecosystem. Docs: smith.langchain.com.
Helicone: Lightweight LLM observability. Minimal integration — a proxy layer that captures requests and responses without SDK dependencies. Faster to set up than LangSmith. Good for teams that want basic tracing and cost tracking without framework lock-in.
Arize AI: Full ML observability platform covering drift detection, output quality, reliability, and performance monitoring. Appropriate for teams with multiple models in production, strict SLAs, and dedicated ML engineers to manage the platform. Docs: arize.com/docs.
Weights & Biases (W&B): Strong for experiment tracking that extends into production monitoring. If your team already uses W&B for training, the Weave product extends tracing and evaluation into production deployments with minimal additional tooling.
Datadog LLM Observability: For teams already running infrastructure observability on Datadog who want to extend unified monitoring to include model quality. Less purpose-built for LLM quality evaluation than LangSmith or Arize, but sufficient for teams that prioritise consolidation.
Custom pipeline: Sampling a percentage of model inputs and outputs to a data warehouse (BigQuery, Snowflake, Redshift) and running quality evaluations as a scheduled job is effective and low-dependency. More engineering work upfront, lower ongoing tool cost, no vendor lock-in.
Alerting Strategy
Alert fatigue kills monitoring programs. Teams that alert on everything stop responding to alerts. Define thresholds for the metrics that matter:
- Error rate above X% sustained for 5 minutes
- P95 latency above Y ms sustained for 10 minutes
- Policy violation rate above Z% in a 1-hour rolling window
- Output quality score below your defined floor for 24 hours
- Business KPI deviation greater than 10% from 30-day baseline
Route alerts to the owner who has authority to act. A monitoring alert that goes to a shared channel and requires a meeting to assign ownership is a monitoring alert that will not be acted on quickly enough. Define incident ownership before launch.
Evaluation Cadence
Automated sampling with LLM-as-judge should run continuously and produce daily quality trend data. Human review of a random output sample — 20–50 outputs reviewed by someone with domain expertise — should run weekly for high-stakes systems and at minimum monthly for lower-stakes internal tools.
Quarterly, run a formal quality review comparing performance against your launch baseline. Use this review to decide whether retraining or prompt updates are needed, and to update monitoring thresholds as the system matures.
Common Failure Modes
Monitoring only infrastructure, not quality: The most common failure. Infrastructure dashboards look green. Output quality is degrading. Users are complaining. Build output quality monitoring from day one, not as an afterthought.
Alert thresholds set without baselines: You cannot set a meaningful threshold without knowing what normal looks like. Spend the first two weeks of production operation establishing baseline distributions before setting alert thresholds.
No clear incident owner: Monitoring alerts mean nothing if there is no named person who is responsible for responding. Assign model monitoring to a specific owner — not a team, a person — with a defined response SLA.
Treating all outputs as equally important: High-stakes outputs (anything that directly affects a customer transaction, a medical recommendation, or a regulatory decision) deserve 100% sampling and review. Low-stakes outputs can tolerate 1–5% sampling. Calibrate your monitoring intensity to the risk level.
References
- LangSmith docs: smith.langchain.com
- Arize AI docs: arize.com/docs
- Evidently AI docs: docs.evidentlyai.com
- Google MLOps guidance: cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
Talk to an AI Implementation Expert
If you want help designing a monitoring programme for your production AI systems, book a working session.
Book a call: https://calendly.com/ai-creation-labs/30-minute-chatgpt-leads-discovery-call
During the call we can cover:
- monitoring architecture for your specific model types and stack
- output quality evaluation design
- alerting threshold strategy and incident ownership
- tooling selection and integration approach