How Auravio works.
A trust-aware clinical AI system. Speech-to-text, translation, multi-signal trust scoring, structured summarization, and FHIR-ready extraction, built on the premise that knowing when you're wrong matters more than sounding fluent.
The Output
01
Speech-to-text
Audio from the speaker is transcribed in the source language with timestamps and segment-level confidence.
02
Translation
The transcript is translated into the target language by a purpose-prompted medical-translation model. The translation model can flag terms it couldn't confidently render. These are passed forward as signals to the trust layer rather than silently corrected.
03
Trust scoring
The translation is evaluated against the source across multiple independent signals: some deterministic, some model-based. Each signal produces a normalized score and, where relevant, a discrete flag.
04
Banding and recommendation
Scores and flags combine into a risk band (green / yellow / red) and a concrete recommendation (proceed, verify, clarify, or escalate to a human interpreter).
05
Summarization and extraction
At session end, Auravio produces a clinician-facing summary, a patient-friendly summary, and structured extractions (symptoms, medications, allergies, onset, etc.) mapped toward FHIR resources.
The Trust Layer
Auravio produces an output, a confidence score, a list of specific flags, and a recommendation. The design starts from a claim about how clinical errors actually happen. This happens not from globally bad translations, but from one precise, localized failure in an otherwise fluent sentence.
Auravio uses 7 deliberately combined signals (five deterministic rules run synchronously, then two asynchronous signals run in parallel).
01
Medication name integrity
A dictionary of 263 medication names (English, Spanish, Italian, plus 18 synonym groups like acetaminophen ↔ paracetamol ↔ tylenol) is checked against both source and translation. If a medication appears in the source and is missing from the translation, this fires as a high-severity flag. Weight in the composite score: 0.25 — the highest of any signal.
02
Numeric and dosage fidelity
Frequency phrases ("dos veces al día", "twice daily") are canonicalized to a shared representation first, then raw numbers are compared separately. Missing either triggers a high-severity flag. This catches dosage errors that slip past fluency-focused translation.
03
Negation preservation
Negation markers are counted per language, with Spanish double-negation ("no tiene nada" = one semantic negation, not two) handled as a special case. Any mismatch of one or more is a high-severity flag. Clinically, dropped negation is one of the most dangerous failure modes in medical translation.
04
Translation length ratio
Translations under 50% or over 200% of source length flag as medium severity. A blunt signal, but reliably catches truncation and fabrication.
05
Untranslated term count
If the translation model flagged terms it couldn't confidently render, each produces a medium-severity flag.
06
Back-translation similarity
The translation is sent back through the same model in reverse (target → source) and compared to the original. A deliberate simplification, character-level similarity will underscore paraphrase equivalence, but it catches significant drift reliably.
07
LLM judge scores
A separate Claude call, running concurrently with back-translation, not sequentially, scores semantic consistency, hallucination risk, and clinical nuance preservation. The judge is prompted as a "clinical translation auditor" with a specific schema, and its output is fused into the composite score.

