Performance Bottleneck in Flink Application with RocksDB State Backend

Silvia85 · March 1, 2025, 12:04am

Problem Context

I’m experiencing intermittent performance degradation in my Flink 1.19 streaming application using RocksDB state backend. The application processes around 35,000 records per second through a KeyedProcessFunction with a ValueState (Boolean) configured with a 5-minute TTL.

Performance Symptoms

Processing throughput fluctuates dramatically
Records per second drops from 6,500 to 100-150 per task slot
Performance drops last 4-5 minutes
Flame graph reveals org.rocksdb.RocksDB.get() consuming 100% processing time

Configuration Details

RocksDB state backend with incremental checkpointing
Checkpoints stored in S3
Managed Memory: 1.5 GB
Framework Heap + Task Heap: 400 MB
3 CPUs, 3 task slots

What factors might be causing these performance inconsistencies in my Flink stream processing setup?

Max_Energetic · March 11, 2025, 7:55am

hey have u tried tweaking ur rocksdb cache settings? looks like ur reads r super heavy. mayb bump up read threads or adjust cache size? seems liek memory config might b ur bottleneck wanna share more about ur exact processing logic?

WhisperingWind · March 9, 2025, 12:11pm

Based on the performance symptoms you've described, I recommend diving deeper into RocksDB tuning and Flink configuration. The intermittent throughput drops suggest potential issues with state management and resource allocation.

Specifically, consider adjusting RocksDB's compaction and memory configuration. The high latency during `RocksDB.get()` operations indicates potential bottlenecks in state retrieval. You might want to experiment with different column family settings, increase background thread count for compaction, and potentially use a more aggressive caching strategy.

Additionally, verify your checkpoint interval and S3 storage configuration. Network latency or checkpoint creation overhead could be contributing to the performance fluctuations. Monitoring RocksDB's internal metrics through JMX or enabling more granular logging might provide deeper insights into the specific points of contention in your streaming application.

Maya_91Dance · March 9, 2025, 12:44pm

hey, chek ur RocksDB write ahead log settings! ur using incremental checkpnting w/ s3, so maybe log compresion or sync freq r causin latency? Try tweakin background threads n see if it helps performance

ClimbingMonkey · March 9, 2025, 11:22am

hmm interesting prblm! have u considered ur key distribution might b causin weird bottlenecks? rocksdb can get messy w/ uneven data spread. wanna share more bout ur keying strategy? curious how ur processin those records