Performance Bottleneck in Flink Application with RocksDB State Backend

Problem Context

I’m experiencing intermittent performance degradation in my Flink 1.19 streaming application using RocksDB state backend. The application processes around 35,000 records per second through a KeyedProcessFunction with a ValueState (Boolean) configured with a 5-minute TTL.

Performance Symptoms

  • Processing throughput fluctuates dramatically
  • Records per second drops from 6,500 to 100-150 per task slot
  • Performance drops last 4-5 minutes
  • Flame graph reveals org.rocksdb.RocksDB.get() consuming 100% processing time

Configuration Details

  • RocksDB state backend with incremental checkpointing
  • Checkpoints stored in S3
  • Managed Memory: 1.5 GB
  • Framework Heap + Task Heap: 400 MB
  • 3 CPUs, 3 task slots

What factors might be causing these performance inconsistencies in my Flink stream processing setup?

hey have u tried tweaking ur rocksdb cache settings? looks like ur reads r super heavy. mayb bump up read threads or adjust cache size? seems liek memory config might b ur bottleneck :thinking: wanna share more about ur exact processing logic?

Based on the performance symptoms you've described, I recommend diving deeper into RocksDB tuning and Flink configuration. The intermittent throughput drops suggest potential issues with state management and resource allocation.

Specifically, consider adjusting RocksDB's compaction and memory configuration. The high latency during `RocksDB.get()` operations indicates potential bottlenecks in state retrieval. You might want to experiment with different column family settings, increase background thread count for compaction, and potentially use a more aggressive caching strategy.

Additionally, verify your checkpoint interval and S3 storage configuration. Network latency or checkpoint creation overhead could be contributing to the performance fluctuations. Monitoring RocksDB's internal metrics through JMX or enabling more granular logging might provide deeper insights into the specific points of contention in your streaming application.

hey, chek ur RocksDB write ahead log settings! ur using incremental checkpnting w/ s3, so maybe log compresion or sync freq r causin latency? Try tweakin background threads n see if it helps performance :rocket:

hmm interesting prblm! have u considered ur key distribution might b causin weird bottlenecks? rocksdb can get messy w/ uneven data spread. wanna share more bout ur keying strategy? curious how ur processin those records :thinking: