The difference cache performance between Google Vertex and AI Studio

The Curious Case of Cache Misses: A Deep Dive into Google's Dual Gateway Mystery

Google Vertex and AI Studio
Google Vertex and AI Studio
Google Vertex and AI Studio
Date

Dec 17, 2025

Author

Andrew Zheng

Prologue: When Monitoring Reveals the Unexpected

It was a typical Tuesday morning at OneRouter headquarters. Our SRE team was conducting routine health checks on AI provider endpoints—a mundane but critical task that ensures our routing infrastructure maintains optimal performance across dozens of LLM providers.

Sarah, our lead monitoring engineer, was scrutinizing through the dashboard when she noticed something odd in the metrics visualization. The graph showed two lines representing cache hit rates for Google's gemini-2.5-flash-preview-09-2025 model, but instead of tracking closely together as expected, they diverged dramatically.

"Hey, take a look at this," she called out to the team. "Why would the same model have such different cache performance?"

The chart was clear: Google AI Studio was achieving cache hit rates around 78-82%, while Google Vertex AI plateaued at a concerning 15-22%. For identical requests to the same underlying model, this discrepancy made no sense.

What began as a routine monitoring task was about to turn into a fascinating technical investigation.


Chapter 1: Formulating the Hypothesis

Our first instinct was to assume instrumentation error. Perhaps our telemetry was miscategorized, or we were comparing apples to oranges—different workload patterns, different request distributions, different times of day.

But after triple-checking our metrics pipeline, the data stood firm:

{
  "provider": "google-ai-studio",
  "model": "gemini-2.5-flash-preview-09-2025",
  "cache_hit_rate": 0.801,
  "sample_size": 45672
}

{
  "provider": "google-vertex",
  "model": "gemini-2.5-flash-preview-09-2025",
  "cache_hit_rate": 0.287,
  "sample_size": 44891
}

The sample sizes were comparable. The temporal distribution was identical. The user prompts? Routed through the same OneRouter gateway with identical preprocessing.

Our hypothesis crystallized: Google AI Studio and Google Vertex AI, despite serving the same model, implement fundamentally different token caching mechanisms.


Chapter 2: The Investigation

To validate this hypothesis, we designed a controlled experiment. The methodology was straightforward but rigorous:

Experimental Design

Test Setup:

  • Model: gemini-2.5-flash-preview-09-2025

  • Providers: Google AI Studio vs. Google Vertex AI

  • Request Pattern: Identical sequence of 1,000 prompts with varying prefix overlap

  • Control Variables: Same API keys, same geographic region (us-central1), same time window

  • Measurement: Cache hit indicators from response headers and billing metadata

Test Prompts Structure:

# Pattern designed to maximize cache opportunity
prompts = [
    {
        "system": LONG_SHARED_CONTEXT,  # 15K tokens, identical across all requests
        "user": f"Question {i}: {generate_unique_query()}"  # 200-500 tokens, unique
    }
    for i in range(1000)
]

Execution

We instrumented both endpoints with detailed logging:

import time
import hashlib

def test_cache_behavior(provider, prompts):
    results = []
  
    for idx, prompt in enumerate(prompts):
        request_hash = hashlib.sha256(
            prompt['system'].encode()
        ).hexdigest()[:16]
      
        response = call_llm_api(
            provider=provider,
            model="gemini-2.5-flash-preview-09-2025",
            messages=[
                {"role": "system", "content": prompt['system']},
                {"role": "user", "content": prompt['user']}
            ]
        )
      
        cache_hit = detect_cache_usage(response)
      
        results.append({
            "request_id": idx,
            "context_hash": request_hash,
            "cache_hit": cache_hit,
            "latency_ms": response.latency,
            "tokens_cached": response.metadata.get('cached_tokens', 0)
        })
      
        time.sleep(0.1)  # Rate limiting
  
    return results

The Results

After 72 hours of testing across multiple time zones and request patterns, the data was unambiguous:

Metric

Google AI Studio

Google Vertex AI

Cache Hit Rate

79.3%

28.1%

Avg Latency (cache hit)

340ms

385ms

Avg Latency (cache miss)

1240ms

1190ms

Cost per 1M tokens (with cache)

$0.42

$1.185

The evidence was overwhelming. But why?

Chapter 3: Understanding the Architecture

To understand the discrepancy, we needed to map the architectural differences between the two services.

Google AI Studio: Developer-First Design

AI Studio appears optimized for interactive development workflows:

  • Shared cache pool across API keys from the same project

  • Longer cache TTL (time-to-live) for context prefixes

  • Aggressive cache matching using semantic similarity, not just exact byte matching

  • Single-region deployment which likely reduces cache fragmentation

Google Vertex AI: Enterprise Multi-Tenancy

Vertex, designed for production enterprise workloads, takes a different approach:

  • Isolated cache per service account (security boundary)

  • Shorter cache TTL to ensure consistency across distributed deployments

  • Stricter cache invalidation policies

  • Multi-region load balancing causing cache fragmentation

This explained everything. Vertex's architectural choices—perfectly reasonable for enterprise security and consistency—resulted in lower cache efficiency for workloads with repeated context.


Chapter 4: Real-World Impact

We've conducted our own comparison between

  • OneRouter - model="google-ai-studio/gemini-2.5-flash-preview-09-2025"

  • Google AI Studio - model="gemini-2.5-flash-preview-09-2025"

Then, we conducted another comparison between

  • OneRouter - model="google-vertex/gemini-2.5-flash-preview-09-2025"

  • Google Vertex - model="gemini-2.5-flash-preview-09-2025"

Google Vertex exhibited a significantly lower cache hit rate.


Epilogue: Lessons Learned

This investigation reinforced several principles that guide our work at OneRouter:

1. Monitor Everything, Assume Nothing

The cache discrepancy would have gone unnoticed without comprehensive telemetry. Instrumentation isn't overhead—it's insight.

2. Same API ≠ Same Behavior

Just because two providers expose OpenAI-compatible endpoints doesn't mean they behave identically at the infrastructure level. Abstract carefully, but measure always.

3. Give Users Control, With Guardrails

The best abstraction layer provides sensible defaults but allows expert users to optimize. Our provider-prefix syntax strikes this balance.

4. Resilience Through Redundancy

No provider achieves 100% uptime. Multi-provider fallback isn't a luxury—it's table stakes for production AI applications.

If your application involves sessions with lots of repetitive context, then AI Studio is definitely your best bet. However, since AI Studio is only experimental and can't provide enterprise-level SLA guarantees, I recommend a dual approach if you want both cost efficiency and stability. You could configure OneRouter to primarily route requests to AI Studio, while enabling automatic fallback. This way, when you encounter around 1% of 429 errors, it'll automatically route to Vertex instead. This approach shouldn't significantly increase your overall costs.

5. Transparency Builds Trust

By exposing routing decisions and cache performance in response metadata, we empower users to understand and optimize their applications.


Open Questions

Our investigation also raised interesting questions for future research:

  1. Does Vertex's cache isolation improve security sufficiently to justify the cost trade-off?

  2. Can semantic cache matching (AI Studio style) be implemented client-side for any provider?

  3. What is the optimal cache TTL for different application archetypes?

We're exploring these in ongoing research.


Conclusion

What started as a curious anomaly in our monitoring dashboard led to a comprehensive investigation that ultimately benefited all OneRouter users. By understanding the nuanced differences between Google's two API gateways, we were able to build routing intelligence that optimizes for both performance and reliability.

The lesson? In the rapidly evolving landscape of AI infrastructure, details matter. A 30-point difference in cache hit rates isn't just a technical curiosity—it's thousands of dollars in cost savings and measurably better user experiences.

OneRouter continues to monitor, investigate, and optimize across all AI providers, so you don't have to.

Try OneRouter today: https://onerouter.pro
Documentation: https://docs.onerouter.pro/features/provider-routing-and-fallbacks
Questions? Reach us at support@onerouter.pro

Prologue: When Monitoring Reveals the Unexpected

It was a typical Tuesday morning at OneRouter headquarters. Our SRE team was conducting routine health checks on AI provider endpoints—a mundane but critical task that ensures our routing infrastructure maintains optimal performance across dozens of LLM providers.

Sarah, our lead monitoring engineer, was scrutinizing through the dashboard when she noticed something odd in the metrics visualization. The graph showed two lines representing cache hit rates for Google's gemini-2.5-flash-preview-09-2025 model, but instead of tracking closely together as expected, they diverged dramatically.

"Hey, take a look at this," she called out to the team. "Why would the same model have such different cache performance?"

The chart was clear: Google AI Studio was achieving cache hit rates around 78-82%, while Google Vertex AI plateaued at a concerning 15-22%. For identical requests to the same underlying model, this discrepancy made no sense.

What began as a routine monitoring task was about to turn into a fascinating technical investigation.


Chapter 1: Formulating the Hypothesis

Our first instinct was to assume instrumentation error. Perhaps our telemetry was miscategorized, or we were comparing apples to oranges—different workload patterns, different request distributions, different times of day.

But after triple-checking our metrics pipeline, the data stood firm:

{
  "provider": "google-ai-studio",
  "model": "gemini-2.5-flash-preview-09-2025",
  "cache_hit_rate": 0.801,
  "sample_size": 45672
}

{
  "provider": "google-vertex",
  "model": "gemini-2.5-flash-preview-09-2025",
  "cache_hit_rate": 0.287,
  "sample_size": 44891
}

The sample sizes were comparable. The temporal distribution was identical. The user prompts? Routed through the same OneRouter gateway with identical preprocessing.

Our hypothesis crystallized: Google AI Studio and Google Vertex AI, despite serving the same model, implement fundamentally different token caching mechanisms.


Chapter 2: The Investigation

To validate this hypothesis, we designed a controlled experiment. The methodology was straightforward but rigorous:

Experimental Design

Test Setup:

  • Model: gemini-2.5-flash-preview-09-2025

  • Providers: Google AI Studio vs. Google Vertex AI

  • Request Pattern: Identical sequence of 1,000 prompts with varying prefix overlap

  • Control Variables: Same API keys, same geographic region (us-central1), same time window

  • Measurement: Cache hit indicators from response headers and billing metadata

Test Prompts Structure:

# Pattern designed to maximize cache opportunity
prompts = [
    {
        "system": LONG_SHARED_CONTEXT,  # 15K tokens, identical across all requests
        "user": f"Question {i}: {generate_unique_query()}"  # 200-500 tokens, unique
    }
    for i in range(1000)
]

Execution

We instrumented both endpoints with detailed logging:

import time
import hashlib

def test_cache_behavior(provider, prompts):
    results = []
  
    for idx, prompt in enumerate(prompts):
        request_hash = hashlib.sha256(
            prompt['system'].encode()
        ).hexdigest()[:16]
      
        response = call_llm_api(
            provider=provider,
            model="gemini-2.5-flash-preview-09-2025",
            messages=[
                {"role": "system", "content": prompt['system']},
                {"role": "user", "content": prompt['user']}
            ]
        )
      
        cache_hit = detect_cache_usage(response)
      
        results.append({
            "request_id": idx,
            "context_hash": request_hash,
            "cache_hit": cache_hit,
            "latency_ms": response.latency,
            "tokens_cached": response.metadata.get('cached_tokens', 0)
        })
      
        time.sleep(0.1)  # Rate limiting
  
    return results

The Results

After 72 hours of testing across multiple time zones and request patterns, the data was unambiguous:

Metric

Google AI Studio

Google Vertex AI

Cache Hit Rate

79.3%

28.1%

Avg Latency (cache hit)

340ms

385ms

Avg Latency (cache miss)

1240ms

1190ms

Cost per 1M tokens (with cache)

$0.42

$1.185

The evidence was overwhelming. But why?

Chapter 3: Understanding the Architecture

To understand the discrepancy, we needed to map the architectural differences between the two services.

Google AI Studio: Developer-First Design

AI Studio appears optimized for interactive development workflows:

  • Shared cache pool across API keys from the same project

  • Longer cache TTL (time-to-live) for context prefixes

  • Aggressive cache matching using semantic similarity, not just exact byte matching

  • Single-region deployment which likely reduces cache fragmentation

Google Vertex AI: Enterprise Multi-Tenancy

Vertex, designed for production enterprise workloads, takes a different approach:

  • Isolated cache per service account (security boundary)

  • Shorter cache TTL to ensure consistency across distributed deployments

  • Stricter cache invalidation policies

  • Multi-region load balancing causing cache fragmentation

This explained everything. Vertex's architectural choices—perfectly reasonable for enterprise security and consistency—resulted in lower cache efficiency for workloads with repeated context.


Chapter 4: Real-World Impact

We've conducted our own comparison between

  • OneRouter - model="google-ai-studio/gemini-2.5-flash-preview-09-2025"

  • Google AI Studio - model="gemini-2.5-flash-preview-09-2025"

Then, we conducted another comparison between

  • OneRouter - model="google-vertex/gemini-2.5-flash-preview-09-2025"

  • Google Vertex - model="gemini-2.5-flash-preview-09-2025"

Google Vertex exhibited a significantly lower cache hit rate.


Epilogue: Lessons Learned

This investigation reinforced several principles that guide our work at OneRouter:

1. Monitor Everything, Assume Nothing

The cache discrepancy would have gone unnoticed without comprehensive telemetry. Instrumentation isn't overhead—it's insight.

2. Same API ≠ Same Behavior

Just because two providers expose OpenAI-compatible endpoints doesn't mean they behave identically at the infrastructure level. Abstract carefully, but measure always.

3. Give Users Control, With Guardrails

The best abstraction layer provides sensible defaults but allows expert users to optimize. Our provider-prefix syntax strikes this balance.

4. Resilience Through Redundancy

No provider achieves 100% uptime. Multi-provider fallback isn't a luxury—it's table stakes for production AI applications.

If your application involves sessions with lots of repetitive context, then AI Studio is definitely your best bet. However, since AI Studio is only experimental and can't provide enterprise-level SLA guarantees, I recommend a dual approach if you want both cost efficiency and stability. You could configure OneRouter to primarily route requests to AI Studio, while enabling automatic fallback. This way, when you encounter around 1% of 429 errors, it'll automatically route to Vertex instead. This approach shouldn't significantly increase your overall costs.

5. Transparency Builds Trust

By exposing routing decisions and cache performance in response metadata, we empower users to understand and optimize their applications.


Open Questions

Our investigation also raised interesting questions for future research:

  1. Does Vertex's cache isolation improve security sufficiently to justify the cost trade-off?

  2. Can semantic cache matching (AI Studio style) be implemented client-side for any provider?

  3. What is the optimal cache TTL for different application archetypes?

We're exploring these in ongoing research.


Conclusion

What started as a curious anomaly in our monitoring dashboard led to a comprehensive investigation that ultimately benefited all OneRouter users. By understanding the nuanced differences between Google's two API gateways, we were able to build routing intelligence that optimizes for both performance and reliability.

The lesson? In the rapidly evolving landscape of AI infrastructure, details matter. A 30-point difference in cache hit rates isn't just a technical curiosity—it's thousands of dollars in cost savings and measurably better user experiences.

OneRouter continues to monitor, investigate, and optimize across all AI providers, so you don't have to.

Try OneRouter today: https://onerouter.pro
Documentation: https://docs.onerouter.pro/features/provider-routing-and-fallbacks
Questions? Reach us at support@onerouter.pro

Scale without limits

Seamlessly integrate OneRouter with just a few lines of code and unlock unlimited AI power.

Scale without limits

Seamlessly integrate OneRouter with just a few lines of code and unlock unlimited AI power.

Scale without limits

Seamlessly integrate OneRouter with just a few lines of code and unlock unlimited AI power.