When to move your OLAP workloads off Postgres

I know what you're thinking. Oh no, here's another 3,000-word article from Ariel about Postgres performance. But don't worry - this one's shorter because I've already covered most of these warning signs in previous articles. I'm just putting it all together here and focusing on helping you plan your exit before things get messy. In my last article, I showed you how to push your Postgres analytics to its limits - from columnar storage extensions to BRIN indexes and vertical partitioning. And if you've implemented those optimizations, your analytics are probably running smoother than ever. But how do you know when you've hit the ceiling? When do the gains from tuning start to flatline and the workarounds become more painful than beneficial? After years of scaling Postgres for analytics and seeing many teams go through similar struggles, I've learned to watch for two types of signals. First, there are the technical metrics - query times, resource usage, and performance degradation. But just as important are the organizational signs - when your team spends more time maintaining complex optimizations than building features, when simple schema changes require careful planning to avoid outages, or when your on-call rotation becomes a source of burnout. Technical Signs You've Hit the Wall I covered how to what to look for as technical warning signs - query performance, replication lag, resource utilization, and others - in the previous articles in the series. I’m now introducing you to a new metric we haven't discussed: query complexity evolution to help you track how much gnarlier your analytics queries are getting. The below queries extend what I’ve covered before to help you identify trends in these metrics and what to look for. It’s not lost on me that I’m helping you solve the problem of too much analytics by building in analytics for your analytics! I can’t help but replay the following Xzibit meme in the back of my head on a loop: Now, on to how to build the necessary analytics: Query Response Time WITH monthly_stats AS ( SELECT date_trunc('month', calls.timestamp) as month, sum(total_time)/sum(calls) as avg_time, percentile_cont(0.95) WITHIN GROUP (ORDER BY mean_time) as p95_time FROM pg_stat_statements_history calls GROUP BY 1 ORDER BY 1 ) SELECT month, avg_time, p95_time, (avg_time - lag(avg_time) over (order by month))/ lag(avg_time) over (order by month) * 100 as time_increase_pct FROM monthly_stats; Warning thresholds: Month-over-month increase in average query time >20% P95 response times 3x higher than average Sustained increases for 3+ consecutive months Storage and VACUUM Health WITH weekly_stats AS ( SELECT date_trunc('week', current_timestamp - interval '1 week' * g) AS week, schemaname, relname, COALESCE(n_dead_tup, 0) as dead_tuples, COALESCE(n_live_tup, 0) as live_tuples, last_vacuum, last_autovacuum, pg_total_relation_size(pgc.oid) as total_bytes, pg_table_size(pgc.oid) as table_bytes, pg_indexes_size(pgc.oid) as index_bytes FROM generate_series(0, 52) g CROSS JOIN pg_stat_user_tables pgst JOIN pg_class pgc ON pgc.relname = pgst.relname ), vacuum_storage_stats AS ( SELECT week, schemaname, relname, dead_tuples::float / NULLIF(dead_tuples + live_tuples, 0) as dead_ratio, EXTRACT( EPOCH FROM (week - GREATEST(last_vacuum, last_autovacuum)) ) as seconds_since_vacuum, total_bytes::numeric / 1024 / 1024 as total_mb, table_bytes::numeric / 1024 / 1024 as table_mb, index_bytes::numeric / 1024 / 1024 as index_mb FROM weekly_stats ) SELECT week, schemaname, relname, dead_ratio, seconds_since_vacuum / 3600 as hours_since_vacuum, total_mb, table_mb, index_mb, total_mb - lag(total_mb) OVER ( PARTITION BY relname ORDER BY week ) as mb_growth, dead_ratio - lag(dead_ratio) OVER ( PARTITION BY relname ORDER BY week ) as dead_ratio_changeFROM vacuum_storage_statsORDER BY week DESC, total_mb DESC; Warning signs to look out for: Weekly storage growth >20% Index size growing faster than table size Total size growing non-linearly with data ingest rate Dead ratio increasing >5% week-over-week Hours since last VACUUM exceeding 24 Dead ratio change accelerating over consecutive weeks Query Complexity Evolution WITH query_stats AS ( SELECT query, regexp_count (lower(query), 'join') as join_count, regexp_count (lower(query), 'with') as cte_count, regexp_count (lower(query), 'over\s*\(') as window_count, length(query) as query_length, total_time / calls as avg_time, calls, date_trunc('week', start_time) as week FROM pg_stat_statements WHERE query ILIKE '%SELECT%' ) SELECT week, avg(j

Feb 7, 2025 - 18:13

When to move your OLAP workloads off Postgres

I know what you're thinking. Oh no, here's another 3,000-word article from Ariel about Postgres performance. But don't worry - this one's shorter because I've already covered most of these warning signs in previous articles. I'm just putting it all together here and focusing on helping you plan your exit before things get messy.

In my last article, I showed you how to push your Postgres analytics to its limits - from columnar storage extensions to BRIN indexes and vertical partitioning. And if you've implemented those optimizations, your analytics are probably running smoother than ever.

But how do you know when you've hit the ceiling? When do the gains from tuning start to flatline and the workarounds become more painful than beneficial?

After years of scaling Postgres for analytics and seeing many teams go through similar struggles, I've learned to watch for two types of signals. First, there are the technical metrics - query times, resource usage, and performance degradation. But just as important are the organizational signs - when your team spends more time maintaining complex optimizations than building features, when simple schema changes require careful planning to avoid outages, or when your on-call rotation becomes a source of burnout.

Technical Signs You've Hit the Wall

I covered how to what to look for as technical warning signs - query performance, replication lag, resource utilization, and others - in the previous articles in the series. I’m now introducing you to a new metric we haven't discussed: query complexity evolution to help you track how much gnarlier your analytics queries are getting.

The below queries extend what I’ve covered before to help you identify trends in these metrics and what to look for. It’s not lost on me that I’m helping you solve the problem of too much analytics by building in analytics for your analytics! I can’t help but replay the following Xzibit meme in the back of my head on a loop:

Now, on to how to build the necessary analytics:

Query Response Time

WITH monthly_stats AS (
  SELECT
    date_trunc('month', calls.timestamp) as month,
    sum(total_time)/sum(calls) as avg_time,
    percentile_cont(0.95)
       WITHIN GROUP (ORDER BY mean_time) as p95_time
  FROM pg_stat_statements_history calls
  GROUP BY 1  ORDER BY 1
)
SELECT
  month,
  avg_time,
  p95_time,
  (avg_time - lag(avg_time) over (order by month))/
    lag(avg_time) over (order by month) * 100 as time_increase_pct
FROM monthly_stats;

Warning thresholds:

Month-over-month increase in average query time >20%
P95 response times 3x higher than average
Sustained increases for 3+ consecutive months

Storage and VACUUM Health

WITH
  weekly_stats AS (
    SELECT
      date_trunc('week', current_timestamp - interval '1 week' * g) AS week,
      schemaname,
      relname,
      COALESCE(n_dead_tup, 0) as dead_tuples,
      COALESCE(n_live_tup, 0) as live_tuples,
      last_vacuum,
      last_autovacuum,
      pg_total_relation_size(pgc.oid) as total_bytes,
      pg_table_size(pgc.oid) as table_bytes,
      pg_indexes_size(pgc.oid) as index_bytes
    FROM
      generate_series(0, 52) g
      CROSS JOIN pg_stat_user_tables pgst
      JOIN pg_class pgc ON pgc.relname = pgst.relname
  ),
  vacuum_storage_stats AS (
    SELECT
      week,
      schemaname,
      relname,
      dead_tuples::float / NULLIF(dead_tuples + live_tuples, 0) as dead_ratio,
      EXTRACT(
        EPOCH
        FROM
          (week - GREATEST(last_vacuum, last_autovacuum))
      ) as seconds_since_vacuum,
      total_bytes::numeric / 1024 / 1024 as total_mb,
      table_bytes::numeric / 1024 / 1024 as table_mb,
      index_bytes::numeric / 1024 / 1024 as index_mb
    FROM
      weekly_stats
  )
SELECT
  week,
  schemaname,
  relname,
  dead_ratio,
  seconds_since_vacuum / 3600 as hours_since_vacuum,
  total_mb,
  table_mb,
  index_mb,
  total_mb - lag(total_mb) OVER (
    PARTITION BY
      relname
    ORDER BY
      week
  ) as mb_growth,
  dead_ratio - lag(dead_ratio) OVER (
    PARTITION BY
      relname
    ORDER BY
      week
  ) as dead_ratio_changeFROM vacuum_storage_statsORDER BY week DESC,
  total_mb DESC;

Warning signs to look out for:

Weekly storage growth >20%
Index size growing faster than table size
Total size growing non-linearly with data ingest rate
Dead ratio increasing >5% week-over-week
Hours since last VACUUM exceeding 24
Dead ratio change accelerating over consecutive weeks

Query Complexity Evolution

WITH
  query_stats AS (
    SELECT
      query,
      regexp_count (lower(query), 'join') as join_count,
      regexp_count (lower(query), 'with') as cte_count,
      regexp_count (lower(query), 'over\s*\(') as window_count,
      length(query) as query_length,
      total_time / calls as avg_time,
      calls,
      date_trunc('week', start_time) as week
    FROM pg_stat_statements
    WHERE query ILIKE '%SELECT%'
  )
SELECT
  week,
  avg(join_count) as avg_joins,
  avg(cte_count) as avg_ctes,
  avg(window_count) as avg_windows,
  avg(query_length) as avg_length,
  count(*) as query_count
FROM query_stats
GROUP BY week
ORDER BY week;

Warning signs to look out for:

Average joins per query >5
CTEs or window functions in >30% of queries
Query length growing >10% month over month
Consistent increase in complexity metrics over 3+ months

Replication Health

WITH weekly_replication_stats AS (
  SELECT
    date_trunc('week', current_timestamp - interval '1 week' * g) AS week,
    application_name,
    -- Current WAL location for each week
    pg_current_wal_lsn() as current_lsn,
    -- Get replay location 
    replay_lsn,
    -- Calculate lag in bytes and seconds
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as lag_bytes,
    EXTRACT (
      EPOCH FROM (now() - pg_last_xact_replay_timestamp())
    ) as lag_seconds,
    -- Track disconnections via pg_stat_activity
    (
      SELECT count(*)
      FROM pg_stat_activity
      WHERE backend_type = 'walsender'
      AND state = 'active'
    ) as active_connections
  FROM generate_series(0, 52) g
  CROSS JOIN pg_stat_replication
)
SELECT
  week,
  application_name,
  round(lag_bytes / 1024 / 1024, 2) as lag_mb,
  round(lag_seconds, 2) as lag_seconds,
  active_connections,
  round(lag_bytes / 1024 / 1024 - lag(lag_bytes / 1024 / 1024)
    OVER (PARTITION BY application_name ORDER BY week), 2) as mb_lag_change,
  round(lag_seconds - lag(lag_seconds)
    OVER (PARTITION BY application_name ORDER BY week), 2) as seconds_lag_change
FROM weekly_replication_stats
ORDER BY week DESC, lag_mb DESC;

Warning signs to look out for:

Week-over-week increase in lag_mb >50%
Consistent growth in seconds_lag_change
Active connections dropping below expected replica count
Sustained growth in both MB and time lag for several weeks
Replication lag >5 minutes during business hours
Consistent lag growth week over week
Frequent replica disconnections

The Human Cost Behind the Metrics

Engineering productivity metrics can tell some pretty compelling stories about database scaling challenges. A B2B SaaS company I was working with saw their deployment frequency slowly dropping over the past six months, effectively being cut in half. After digging a bit, they identified the root cause: their analytics queries had grown so complex that every deployment required extensive testing and coordination windows.

The pattern became even clearer when looking at their SPACE metrics. Their engineering team's satisfaction scores were dropping in parallel with the deployment slowdown. "Our sprint velocity looks fine on paper," their tech lead explained, "but we're spending more time maintaining database optimizations than building the insights dashboards our customers are asking for."

A DevOps startup tracking millions of daily log events hit this wall last year. "We used to estimate features in story points," their VP of Engineering told me. "Now every ticket gets a 'database complexity multiplier' based on how much it touches our analytics queries." Their team had become experts at Postgres optimization - but that expertise came at the cost of shipping customer-facing features.

These aren't isolated cases. When reviewing engineering team health metrics across companies scaling their Postgres analytics, common patterns emerge in DORA metrics:

Deployment frequency has a sustained drop, especially for changes involving the database
Change failure rate increases, especially for analytics-related changes
Mean time to recovery doubles for database-related incidents
Lead time for changes grows exponentially for queries touching large tables

Planning Your Analytics Migration Runway

Don't wait for a crisis. Track both technical and team health indicators monthly:

Technical Metrics to Monitor:

Query response time trends
Storage growth vs. query performance correlation
Dead tuple ratio on critical tables
Replication lag patterns
Query complexity evolution

Team Health Indicators:

Deployment frequency trends
Percentage of capacity on DB optimization
Code review duration for DB changes
On-call incident patterns
Team satisfaction taking a hit
Knowledge distribution across team

Set clear thresholds and review monthly. Most teams need 3-6 months to plan and execute an analytics migration. The key is spotting the signals early enough to make the transition on your terms, not during a crisis.

Making the Call

Moving off Postgres for analytics isn't admitting defeat - it's recognizing that different tools serve different purposes. Your OLTP workload will thank you, and your team can focus on building features instead of maintaining increasingly complex optimizations.

In the next article, I’ll help you think through how to evaluate different OLAP solutions based on your specific needs. But for now, use these signals to make an informed decision about when to start planning your transition. Sometimes, the best engineering decision is knowing when to stop optimizing and start migrating.

Need to move your analytics off Postgres?

Tinybird is a real-time data platform optimized for low-latency, high-concurrency analytics. You can query your Postgres tables with SQL and publish queries as dynamic, scalable REST API Endpoints to power in-product reporting, real-time dashboards, and more.

You can try Tinybird for free with no time limit. Sign up here