When to move your OLAP workloads off Postgres
I know what you're thinking. Oh no, here's another 3,000-word article from Ariel about Postgres performance. But don't worry - this one's shorter because I've already covered most of these warning signs in previous articles. I'm just putting it all together here and focusing on helping you plan your exit before things get messy. In my last article, I showed you how to push your Postgres analytics to its limits - from columnar storage extensions to BRIN indexes and vertical partitioning. And if you've implemented those optimizations, your analytics are probably running smoother than ever. But how do you know when you've hit the ceiling? When do the gains from tuning start to flatline and the workarounds become more painful than beneficial? After years of scaling Postgres for analytics and seeing many teams go through similar struggles, I've learned to watch for two types of signals. First, there are the technical metrics - query times, resource usage, and performance degradation. But just as important are the organizational signs - when your team spends more time maintaining complex optimizations than building features, when simple schema changes require careful planning to avoid outages, or when your on-call rotation becomes a source of burnout. Technical Signs You've Hit the Wall I covered how to what to look for as technical warning signs - query performance, replication lag, resource utilization, and others - in the previous articles in the series. I’m now introducing you to a new metric we haven't discussed: query complexity evolution to help you track how much gnarlier your analytics queries are getting. The below queries extend what I’ve covered before to help you identify trends in these metrics and what to look for. It’s not lost on me that I’m helping you solve the problem of too much analytics by building in analytics for your analytics! I can’t help but replay the following Xzibit meme in the back of my head on a loop: Now, on to how to build the necessary analytics: Query Response Time WITH monthly_stats AS ( SELECT date_trunc('month', calls.timestamp) as month, sum(total_time)/sum(calls) as avg_time, percentile_cont(0.95) WITHIN GROUP (ORDER BY mean_time) as p95_time FROM pg_stat_statements_history calls GROUP BY 1 ORDER BY 1 ) SELECT month, avg_time, p95_time, (avg_time - lag(avg_time) over (order by month))/ lag(avg_time) over (order by month) * 100 as time_increase_pct FROM monthly_stats; Warning thresholds: Month-over-month increase in average query time >20% P95 response times 3x higher than average Sustained increases for 3+ consecutive months Storage and VACUUM Health WITH weekly_stats AS ( SELECT date_trunc('week', current_timestamp - interval '1 week' * g) AS week, schemaname, relname, COALESCE(n_dead_tup, 0) as dead_tuples, COALESCE(n_live_tup, 0) as live_tuples, last_vacuum, last_autovacuum, pg_total_relation_size(pgc.oid) as total_bytes, pg_table_size(pgc.oid) as table_bytes, pg_indexes_size(pgc.oid) as index_bytes FROM generate_series(0, 52) g CROSS JOIN pg_stat_user_tables pgst JOIN pg_class pgc ON pgc.relname = pgst.relname ), vacuum_storage_stats AS ( SELECT week, schemaname, relname, dead_tuples::float / NULLIF(dead_tuples + live_tuples, 0) as dead_ratio, EXTRACT( EPOCH FROM (week - GREATEST(last_vacuum, last_autovacuum)) ) as seconds_since_vacuum, total_bytes::numeric / 1024 / 1024 as total_mb, table_bytes::numeric / 1024 / 1024 as table_mb, index_bytes::numeric / 1024 / 1024 as index_mb FROM weekly_stats ) SELECT week, schemaname, relname, dead_ratio, seconds_since_vacuum / 3600 as hours_since_vacuum, total_mb, table_mb, index_mb, total_mb - lag(total_mb) OVER ( PARTITION BY relname ORDER BY week ) as mb_growth, dead_ratio - lag(dead_ratio) OVER ( PARTITION BY relname ORDER BY week ) as dead_ratio_changeFROM vacuum_storage_statsORDER BY week DESC, total_mb DESC; Warning signs to look out for: Weekly storage growth >20% Index size growing faster than table size Total size growing non-linearly with data ingest rate Dead ratio increasing >5% week-over-week Hours since last VACUUM exceeding 24 Dead ratio change accelerating over consecutive weeks Query Complexity Evolution WITH query_stats AS ( SELECT query, regexp_count (lower(query), 'join') as join_count, regexp_count (lower(query), 'with') as cte_count, regexp_count (lower(query), 'over\s*\(') as window_count, length(query) as query_length, total_time / calls as avg_time, calls, date_trunc('week', start_time) as week FROM pg_stat_statements WHERE query ILIKE '%SELECT%' ) SELECT week, avg(j
I know what you're thinking. Oh no, here's another 3,000-word article from Ariel about Postgres performance. But don't worry - this one's shorter because I've already covered most of these warning signs in previous articles. I'm just putting it all together here and focusing on helping you plan your exit before things get messy.
In my last article, I showed you how to push your Postgres analytics to its limits - from columnar storage extensions to BRIN indexes and vertical partitioning. And if you've implemented those optimizations, your analytics are probably running smoother than ever.
But how do you know when you've hit the ceiling? When do the gains from tuning start to flatline and the workarounds become more painful than beneficial?
After years of scaling Postgres for analytics and seeing many teams go through similar struggles, I've learned to watch for two types of signals. First, there are the technical metrics - query times, resource usage, and performance degradation. But just as important are the organizational signs - when your team spends more time maintaining complex optimizations than building features, when simple schema changes require careful planning to avoid outages, or when your on-call rotation becomes a source of burnout.
Technical Signs You've Hit the Wall
I covered how to what to look for as technical warning signs - query performance, replication lag, resource utilization, and others - in the previous articles in the series. I’m now introducing you to a new metric we haven't discussed: query complexity evolution to help you track how much gnarlier your analytics queries are getting.
The below queries extend what I’ve covered before to help you identify trends in these metrics and what to look for. It’s not lost on me that I’m helping you solve the problem of too much analytics by building in analytics for your analytics! I can’t help but replay the following Xzibit meme in the back of my head on a loop:
Now, on to how to build the necessary analytics:
Query Response Time
WITH monthly_stats AS (
SELECT
date_trunc('month', calls.timestamp) as month,
sum(total_time)/sum(calls) as avg_time,
percentile_cont(0.95)
WITHIN GROUP (ORDER BY mean_time) as p95_time
FROM pg_stat_statements_history calls
GROUP BY 1 ORDER BY 1
)
SELECT
month,
avg_time,
p95_time,
(avg_time - lag(avg_time) over (order by month))/
lag(avg_time) over (order by month) * 100 as time_increase_pct
FROM monthly_stats;
Warning thresholds:
- Month-over-month increase in average query time >20%
- P95 response times 3x higher than average
- Sustained increases for 3+ consecutive months
Storage and VACUUM Health
WITH
weekly_stats AS (
SELECT
date_trunc('week', current_timestamp - interval '1 week' * g) AS week,
schemaname,
relname,
COALESCE(n_dead_tup, 0) as dead_tuples,
COALESCE(n_live_tup, 0) as live_tuples,
last_vacuum,
last_autovacuum,
pg_total_relation_size(pgc.oid) as total_bytes,
pg_table_size(pgc.oid) as table_bytes,
pg_indexes_size(pgc.oid) as index_bytes
FROM
generate_series(0, 52) g
CROSS JOIN pg_stat_user_tables pgst
JOIN pg_class pgc ON pgc.relname = pgst.relname
),
vacuum_storage_stats AS (
SELECT
week,
schemaname,
relname,
dead_tuples::float / NULLIF(dead_tuples + live_tuples, 0) as dead_ratio,
EXTRACT(
EPOCH
FROM
(week - GREATEST(last_vacuum, last_autovacuum))
) as seconds_since_vacuum,
total_bytes::numeric / 1024 / 1024 as total_mb,
table_bytes::numeric / 1024 / 1024 as table_mb,
index_bytes::numeric / 1024 / 1024 as index_mb
FROM
weekly_stats
)
SELECT
week,
schemaname,
relname,
dead_ratio,
seconds_since_vacuum / 3600 as hours_since_vacuum,
total_mb,
table_mb,
index_mb,
total_mb - lag(total_mb) OVER (
PARTITION BY
relname
ORDER BY
week
) as mb_growth,
dead_ratio - lag(dead_ratio) OVER (
PARTITION BY
relname
ORDER BY
week
) as dead_ratio_changeFROM vacuum_storage_statsORDER BY week DESC,
total_mb DESC;
Warning signs to look out for:
- Weekly storage growth >20%
- Index size growing faster than table size
- Total size growing non-linearly with data ingest rate
- Dead ratio increasing >5% week-over-week
- Hours since last VACUUM exceeding 24
- Dead ratio change accelerating over consecutive weeks
Query Complexity Evolution
WITH
query_stats AS (
SELECT
query,
regexp_count (lower(query), 'join') as join_count,
regexp_count (lower(query), 'with') as cte_count,
regexp_count (lower(query), 'over\s*\(') as window_count,
length(query) as query_length,
total_time / calls as avg_time,
calls,
date_trunc('week', start_time) as week
FROM pg_stat_statements
WHERE query ILIKE '%SELECT%'
)
SELECT
week,
avg(join_count) as avg_joins,
avg(cte_count) as avg_ctes,
avg(window_count) as avg_windows,
avg(query_length) as avg_length,
count(*) as query_count
FROM query_stats
GROUP BY week
ORDER BY week;
Warning signs to look out for:
- Average joins per query >5
- CTEs or window functions in >30% of queries
- Query length growing >10% month over month
- Consistent increase in complexity metrics over 3+ months
Replication Health
WITH weekly_replication_stats AS (
SELECT
date_trunc('week', current_timestamp - interval '1 week' * g) AS week,
application_name,
-- Current WAL location for each week
pg_current_wal_lsn() as current_lsn,
-- Get replay location
replay_lsn,
-- Calculate lag in bytes and seconds
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as lag_bytes,
EXTRACT (
EPOCH FROM (now() - pg_last_xact_replay_timestamp())
) as lag_seconds,
-- Track disconnections via pg_stat_activity
(
SELECT count(*)
FROM pg_stat_activity
WHERE backend_type = 'walsender'
AND state = 'active'
) as active_connections
FROM generate_series(0, 52) g
CROSS JOIN pg_stat_replication
)
SELECT
week,
application_name,
round(lag_bytes / 1024 / 1024, 2) as lag_mb,
round(lag_seconds, 2) as lag_seconds,
active_connections,
round(lag_bytes / 1024 / 1024 - lag(lag_bytes / 1024 / 1024)
OVER (PARTITION BY application_name ORDER BY week), 2) as mb_lag_change,
round(lag_seconds - lag(lag_seconds)
OVER (PARTITION BY application_name ORDER BY week), 2) as seconds_lag_change
FROM weekly_replication_stats
ORDER BY week DESC, lag_mb DESC;
Warning signs to look out for:
- Week-over-week increase in lag_mb >50%
- Consistent growth in seconds_lag_change
- Active connections dropping below expected replica count
- Sustained growth in both MB and time lag for several weeks
- Replication lag >5 minutes during business hours
- Consistent lag growth week over week
- Frequent replica disconnections
The Human Cost Behind the Metrics
Engineering productivity metrics can tell some pretty compelling stories about database scaling challenges. A B2B SaaS company I was working with saw their deployment frequency slowly dropping over the past six months, effectively being cut in half. After digging a bit, they identified the root cause: their analytics queries had grown so complex that every deployment required extensive testing and coordination windows.
The pattern became even clearer when looking at their SPACE metrics. Their engineering team's satisfaction scores were dropping in parallel with the deployment slowdown. "Our sprint velocity looks fine on paper," their tech lead explained, "but we're spending more time maintaining database optimizations than building the insights dashboards our customers are asking for."
A DevOps startup tracking millions of daily log events hit this wall last year. "We used to estimate features in story points," their VP of Engineering told me. "Now every ticket gets a 'database complexity multiplier' based on how much it touches our analytics queries." Their team had become experts at Postgres optimization - but that expertise came at the cost of shipping customer-facing features.
These aren't isolated cases. When reviewing engineering team health metrics across companies scaling their Postgres analytics, common patterns emerge in DORA metrics:
- Deployment frequency has a sustained drop, especially for changes involving the database
- Change failure rate increases, especially for analytics-related changes
- Mean time to recovery doubles for database-related incidents
- Lead time for changes grows exponentially for queries touching large tables
Planning Your Analytics Migration Runway
Don't wait for a crisis. Track both technical and team health indicators monthly:
Technical Metrics to Monitor:
- Query response time trends
- Storage growth vs. query performance correlation
- Dead tuple ratio on critical tables
- Replication lag patterns
- Query complexity evolution
Team Health Indicators:
- Deployment frequency trends
- Percentage of capacity on DB optimization
- Code review duration for DB changes
- On-call incident patterns
- Team satisfaction taking a hit
- Knowledge distribution across team
Set clear thresholds and review monthly. Most teams need 3-6 months to plan and execute an analytics migration. The key is spotting the signals early enough to make the transition on your terms, not during a crisis.
Making the Call
Moving off Postgres for analytics isn't admitting defeat - it's recognizing that different tools serve different purposes. Your OLTP workload will thank you, and your team can focus on building features instead of maintaining increasingly complex optimizations.
In the next article, I’ll help you think through how to evaluate different OLAP solutions based on your specific needs. But for now, use these signals to make an informed decision about when to start planning your transition. Sometimes, the best engineering decision is knowing when to stop optimizing and start migrating.
Need to move your analytics off Postgres?
Tinybird is a real-time data platform optimized for low-latency, high-concurrency analytics. You can query your Postgres tables with SQL and publish queries as dynamic, scalable REST API Endpoints to power in-product reporting, real-time dashboards, and more.
You can try Tinybird for free with no time limit. Sign up here