How Auravio works.

A trust-aware clinical AI system. Speech-to-text, translation, multi-signal trust scoring, structured summarization, and FHIR-ready extraction, built on the premise that knowing when you're wrong matters more than sounding fluent.

{
  "trust_score": 0.87,
  "risk_band": "low",
  "flags": [
    {"code": "untranslated_term", "message": "...", "severity": "medium"}
  ],
  "recommendation": "proceed",
  "explanation": "Translation is consistent with source; one untranslated term noted."
}

The Output

The Pipeline

Auravio combines speech recognition, medical translation, trust analysis, and structured outputs so cross-language care is safer, clearer, and easier to document.

Speech-to-text

Audio from the speaker is transcribed in the source language with timestamps and segment-level confidence.

Translation

The transcript is translated into the target language by a purpose-prompted medical-translation model. The translation model can flag terms it couldn't confidently render. These are passed forward as signals to the trust layer rather than silently corrected.

Trust scoring

The translation is evaluated against the source across multiple independent signals: some deterministic, some model-based. Each signal produces a normalized score and, where relevant, a discrete flag.

Banding and recommendation

Scores and flags combine into a risk band (green / yellow / red) and a concrete recommendation (proceed, verify, clarify, or escalate to a human interpreter).

Summarization and extraction

At session end, Auravio produces a clinician-facing summary, a patient-friendly summary, and structured extractions (symptoms, medications, allergies, onset, etc.) mapped toward FHIR resources.

The Trust Layer

Auravio produces an output, a confidence score, a list of specific flags, and a recommendation. The design starts from a claim about how clinical errors actually happen. This happens not from globally bad translations, but from one precise, localized failure in an otherwise fluent sentence.

Auravio uses 7 deliberately combined signals (five deterministic rules run synchronously, then two asynchronous signals run in parallel).

Medication name integrity

A dictionary of 263 medication names (English, Spanish, Italian, plus 18 synonym groups like acetaminophen ↔ paracetamol ↔ tylenol) is checked against both source and translation. If a medication appears in the source and is missing from the translation, this fires as a high-severity flag. Weight in the composite score: 0.25 — the highest of any signal.

Numeric and dosage fidelity

Frequency phrases ("dos veces al día", "twice daily") are canonicalized to a shared representation first, then raw numbers are compared separately. Missing either triggers a high-severity flag. This catches dosage errors that slip past fluency-focused translation.

Negation preservation

Negation markers are counted per language, with Spanish double-negation ("no tiene nada" = one semantic negation, not two) handled as a special case. Any mismatch of one or more is a high-severity flag. Clinically, dropped negation is one of the most dangerous failure modes in medical translation.

Translation length ratio

Translations under 50% or over 200% of source length flag as medium severity. A blunt signal, but reliably catches truncation and fabrication.

Untranslated term count

If the translation model flagged terms it couldn't confidently render, each produces a medium-severity flag.

Back-translation similarity

The translation is sent back through the same model in reverse (target → source) and compared to the original. A deliberate simplification, character-level similarity will underscore paraphrase equivalence, but it catches significant drift reliably.

LLM judge scores

A separate Claude call, running concurrently with back-translation, not sequentially, scores semantic consistency, hallucination risk, and clinical nuance preservation. The judge is prompted as a "clinical translation auditor" with a specific schema, and its output is fused into the composite score.

Each signal is normalized to 0.0–1.0. The overall trust score is a weighted average, with weights tuned to match clinical risk. Medication integrity is the highest-weighted signal because it's the most dangerous to miss; length ratio contributes less because it's a blunt signal. The tuning is the product of testing against real clinical-style transcripts.

Most ML scoring systems let a high overall score paper over individual failures. In a system with five signals, if four signals look great and one critical signal fails, you'd still land in the "acceptable" band. That's backwards for clinical safety.

Auravio separates continuous scoring from discrete safety-critical flags. Four specific failure modes, missing medication names, missing numeric values, negation mismatches, and missing frequency phrases, hardcode an override: if any of these fire, the risk band is red regardless of the composite score, and the recommendation routes to escalation rather than just verification.

A translation with a 0.95 overall trust score and a missing medication name is still red. A 0.95 with a dropped "no" in front of "known drug allergies" is still red. The score is a summary; the flags are the safety net.

Score and flags are not the same thing

The explanation is clinician-facing , one sentence, no ML internals. The full signal breakdown is retained internally for audit and model improvement, but not surfaced to the clinician. A clinician needs to know "proceed," "verify," or "escalate" and why in plain English, not a 7-dimensional score vector.