StreamShield: A Production-Proven Resiliency Solution for Apache Flink at ByteDance
Abstract
Distributed Stream Processing Systems (DSPSs) form the backbone of real-time processing and analytics at ByteDance, where Apache Flink powers one of the largest production clusters worldwide. Ensuring resiliency, the ability to withstand and rapidly recover from failures, together with operational stability, which provides consistent and predictable performance under normal conditions, is essential for meeting strict Service Level Objectives (SLOs). However, achieving resiliency and stability in large-scale production environments remains challenging due to the cluster scale, business diversity, and significant operational overhead. In this work, we present StreamShield, a production-proven resiliency solution deployed in ByteDance's Flink clusters. Designed along complementary perspectives of the engine and cluster, StreamShield introduces key techniques to enhance resiliency, covering runtime optimization, fine-grained fault-tolerance, hybrid replication strategy, and high availability under external systems. Furthermore, StreamShield proposes a robust testing and deployment pipeline that ensures reliability and robustness in production releases. Extensive evaluations on a production cluster demonstrate the efficiency and effectiveness of techniques proposed by StreamShield.
Growth and citations
This paper is currently showing No growth state computed yet..
Citation metrics and growth state from academic sources (e.g. Semantic Scholar). See About for details.
Cited by (0)
No citing papers yet
Papers that cite this one will appear here once data is available.
View citations page →References (0)
No references in DB yet
References for this paper will appear here once ingested.
Related papers in Databases
- A Pipeline for ADNI Resting-State Functional MRI Processing and Quality Control0 citations
- Skill-Based Autonomous Agents for Material Creep Database Construction0 citations
- BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish0 citations
Growth transitions
No transitions recorded yet
Growth state transitions will appear here once computed.