• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2023, Vol. 45 ›› Issue (02): 237-245.

• High Performance Computing • Previous Articles     Next Articles

A dynamic watermark adjustment strategy in Flink cluster

Lv He-xuan1,2,3,HUANG Shan1,2,3,Alkam·Zabibul1,2,3,WU Si-heng1,2,3,DUAN Xiao-dong1,2,3    

  1. (1.College of Computer Science and Engineering,Dalian Minzu University,Dalian 116600;
    2.State Ethnic Affairs Commission Key Laboratory of Big Data Applied Technology,Dalian 116600;
    3.Dalian Key Laboratory of Digital Technology for National Culture,Dalian 116600,China)
  • Received:2022-09-14 Revised:2022-10-28 Accepted:2023-02-25 Online:2023-02-25 Published:2023-02-15

Abstract: Two of the most important task metrics that measure data-mining performance specific to big data: one is real-time and the other is accuracy. The stream data flows from data generation to message queue and then into Flink through data source for calculation. In this process, due to different network transmission speed and different computing performance of different nodes, the sequence of stream data entering the computing framework and the time sequence of events generated by data will be partially out of order. The traditional watermark mechanism for window-facing operations cannot consider the real-time performance and accuracy of the operation results in the case of streaming data with uncertain out-of-order degree. To solve this problem, a stream data microcluster model is established. Based on the local out-of-order degree of stream data event time, the out-of-order degree of stream data representing the current moment is calculated by the local out-of-order degree algorithm. A dynamic watermark adjustment strategy is designed to adjust the watermark dynamically according to the degree of flow data disorder. Finally, the dynamic watermark adjustment strategy based on event time window is implemented in Apache Flink framework. Experimental results show that the dynamic watermark adjustment strategy based on event time window can effectively consider the accuracy and real-time performance of window operation under the condition of elastic or uncertain chaotic flow data. 

Key words: Apache Flink, watermark, heterogeneous environment, event time