Question Details

No question body available.

Tags

azure openai-api azure-openai azure-cost-calculation

Answers (1)

April 1, 2026 Score: -3 Rep: 71 Quality: Low Completeness: 80%

The discrepancy you are seeing is primarily due to the difference between Model Inference Tokens (reported by the SDK) and Billing Unit Tokens (used by Azure Cost Management).

In the Azure OpenAI Realtime API, audio is not billed based on the semantic "tokens" the model uses to process sound, but rather on a fixed duration-to-token conversion rate.

1. The Audio "Duration" Conversion (The ~6x Gap)

While the response.done event reports the model's internal representation of the audio (110 tokens), Azure bills audio based on the actual duration of the stream.

  • The Math: For the Realtime models, Azure/OpenAI typically bills 1 minute of audio as 3,600 tokens (which is 60 tokens per second).

  • Your Test: You sent ~11 seconds of audio.

  • Calculation: $11 \text{ seconds} \times 60 \text{ tokens/sec} = \mathbf{660 \text{ tokens}}$.

  • Result: Your billing meter shows 640. This is an almost exact match to the duration-based calculation (the slight difference likely comes from VAD silence trimming).

2. The Whisper "Double-Bill" (Text Input Gap)

You have AudioTranscriptionOptions enabled. This creates a two-step billing scenario:

  1. Audio Input Meter: You are billed for the raw audio duration (the 640 tokens above).

  2. Text Input Meter: The resulting Whisper transcript is automatically injected into the conversation history as a text item. The model then "reads" this text to generate its response. This text transcript, plus the internal default system instructions and your Session Configuration JSON, are billed as standard Text Input tokens.

  • The response.done usage often only counts the "new" tokens in the current turn's delta, whereas the billing meter counts the entire active session context required to generate that response.

3. The Audio Output Gap

Similar to input, output audio is billed by duration:

  • Your Billing: 631 tokens.

  • Duration Calculation: $631 / 60 \text{ tokens/sec} \approx \mathbf{10.5 \text{ seconds}}$.

  • The assistant transcript you provided is approximately 65 words. At a natural speaking rate, 65 words take exactly 10–11 seconds to synthesize. The 420 tokens in your SDK output are the model's internal audio tokens; the 631 is the billed duration.

4. Missing "Cached" Meters

In many Azure regions (including Sweden Central as of early 2026), the billing backend does not yet break out "Cached" tokens into a separate line item in the Cost Management UI. Instead:

  • Cached tokens are often rolled into the Standard Input meter.

  • A discounted rate is applied to those specific units, or they are simply counted as "0" in the specific cached meter while being bundled into the primary one.

The SDK usage object is for monitoring model performance/latency; the Azure Cost Management report is the only source of truth for billing, as it applies the duration-based commercial logic.