Flink水位线动态调整策略

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (02): 237-245.

Flink水位线动态调整策略

吕鹤轩1,2，3,黄山1，2，3,艾力卡木·再比布拉1，2，3,吴思衡1，2，3,段晓东1，2，3

(1.大连民族大学计算机科学与工程学院，辽宁大连 116600；2.大数据应用技术国家民委重点实验室，辽宁大连 116600；
3.大连市民族文化数字技术重点实验室，辽宁大连 116600)

收稿日期:2022-09-14 修回日期:2022-10-28 接受日期:2023-02-25 出版日期:2023-02-25 发布日期:2023-02-15
基金资助:
国家重点研发计划(2018YFB1004402)

A dynamic watermark adjustment strategy in Flink cluster

Lv He-xuan1,2,3,HUANG Shan1,2,3,Alkam·Zabibul1,2,3,WU Si-heng1,2,3,DUAN Xiao-dong1,2,3

(1.College of Computer Science and Engineering,Dalian Minzu University,Dalian 116600;
2.State Ethnic Affairs Commission Key Laboratory of Big Data Applied Technology,Dalian 116600;
3.Dalian Key Laboratory of Digital Technology for National Culture,Dalian 116600,China)

Received:2022-09-14 Revised:2022-10-28 Accepted:2023-02-25 Online:2023-02-25 Published:2023-02-15

摘要/Abstract

摘要： 衡量大数据的数据挖掘性能有2个最重要的任务指标:一是实时性,二是准确性。流数据从数据产生到消息队列再通过数据源流入Flink进行计算，这个过程中因为网络传输速度不同，不同节点的计算性能不同等原因，流数据进入计算框架的先后顺序和数据产生的事件时间顺序会有局部乱序的现象。面对窗口作业的传统水位线机制在不确定乱序程度的流数据情况下无法同时兼顾作业结果的实时性和准确性。针对这个问题，建立了流数据微簇模型。通过局部乱序度算法，根据流数据微簇的流数据事件时间局部乱序程度计算出可以代表当前时刻流数据的乱序度。设计了水位线动态调整策略，使水位线根据流数据的乱序程度动态调整大小。最后，在Apache Flink框架中对基于事件时间窗口的水位线动态调整策略进行了实现。实验结果表明，弹性或不确定乱序流数据条件下，基于事件时间窗口的水位线动态调整策略可以有效地同时兼顾窗口作业的准确性和实时性。

关键词: Apache Flink, 水位线；乱序流数据, 事件时间

Abstract: Two of the most important task metrics that measure data-mining performance specific to big data: one is real-time and the other is accuracy. The stream data flows from data generation to message queue and then into Flink through data source for calculation. In this process, due to different network transmission speed and different computing performance of different nodes, the sequence of stream data entering the computing framework and the time sequence of events generated by data will be partially out of order. The traditional watermark mechanism for window-facing operations cannot consider the real-time performance and accuracy of the operation results in the case of streaming data with uncertain out-of-order degree. To solve this problem, a stream data microcluster model is established. Based on the local out-of-order degree of stream data event time, the out-of-order degree of stream data representing the current moment is calculated by the local out-of-order degree algorithm. A dynamic watermark adjustment strategy is designed to adjust the watermark dynamically according to the degree of flow data disorder. Finally, the dynamic watermark adjustment strategy based on event time window is implemented in Apache Flink framework. Experimental results show that the dynamic watermark adjustment strategy based on event time window can effectively consider the accuracy and real-time performance of window operation under the condition of elastic or uncertain chaotic flow data.

Key words: Apache Flink, watermark, heterogeneous environment, event time

吕鹤轩, 黄山, 艾力卡木·再比布拉, 吴思衡, 段晓东, . Flink水位线动态调整策略[J]. 计算机工程与科学, 2023, 45(02): 237-245.

Lv He-xuan, HUANG Shan, Alkam·Zabibul, WU Si-heng, DUAN Xiao-dong, . A dynamic watermark adjustment strategy in Flink cluster[J]. Computer Engineering & Science, 2023, 45(02): 237-245.

参考文献［17］

［1］	Xin J, Wang Z,Chen C,et al.ELM*:Distributed extreme learning machine with MapReduce［J］.World Wide Web,2014,17:1189-1204.
［2］	Dean J,Ghemawat S.MapReduce:A flexible data processing tool［J］.Communications of the ACM,2010,53(1):72-77.
［3］	Shvachko K,Kuang H,Radia S,et al.The Hadoop distributed file system［C］∥Proc of 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies,2010:1-10.
［4］	Zaharia M,Chowdhury M,Franklin M J,et al.Spark:Cluster computing with working sets［C］∥Proc of the 2nd USENIX Workshop on Hot Topics in Cloud Computing,2010:1-8.
［5］	Carbone P,Katsifodimos A,Ewen S,et al.Apache FlinkTM:Stream and batch processing in a single engine［J］.Bulletin of the IEEE Computer Society Technical Committee on Data Engineering,2015,36(4):28-38.
［6］	Kleppmann M,Kreps J.Kafka,Samza and the Unix philosophy of distributed data［J］.IEEE Data Engineering Bulletin, 2015,38(4):4-14.
［7］	Carle S,Provencher S,Surprenant C.Real time aggregation and filtering of local data feeds:U.S.Patent Application 12/790,568［P］.2011-8-18.
［8］	Li M,Liu M,Ding L P,et al.Event stream processing with out-of-order data arrival［C］∥Proc of the 27th International Conference on Distributed Computing Systems Workshops,2007:67.
［9］	Shukla A K. Real time data stream aggregation and window- ing［J］. International Journal of Advanced Research in Computer Science, 2017, 8(7):816-819.
［10］	Affetti L,Tommasini R,Margara A,et al.Defining the execution semantics of stream processing engines［J］.Journal of Big Data,2017,4(1):1-24.
［11］	Akidau T, Bradshaw R,Chambers C,et al.The dataflow model:A practical approach to balancing correctness,latency,and cost in massive-scale,unbounded,out-of-order data processing［J］.Proceedings of the ULDB Endowment,2015,8(2):1792-1803.
［12］	Bhatt N,Thakkar A.Experimental analysis on processing of unbounded data［J］.International Journal of Innovative Technology and Exploring Engineering,2019,8(9):2226-2230.
［13］	Bhatt N,Thakkar A.Big data stream processing:Latency and throughput［J］.Big Data Stream Analytics, 2019,28:1429-1435.
［14］	Gao Zi-juan,Zhu Yu-quan,Chen Geng. Streaming data cluster algorithm based on changeable sliding window［J］. Application Research of Computers, 2011,28(2):551-553.（in Chinese）
［15］	Xu Jiang,Zhang Hong-yu,Li Jun-huai,et al. Parallel processing method of stream data based on sliding window［J］. Heavy Machinery,2021(1):29-36.（in Chinese）
［16］	Wu Li-xian,Lin Yu-jie,Chen Hao-sheng,et al. Research on real-time flow processing technology based on massive machine data［J］. Microcomputer Applications,2021,37(11):185-187.（in Chinese）
［17］	Ma Qing-yun,Ji Hang-xu,Zhao Yu-hai,et al. An efficient data partitioning method in distributed heterogeneous bandwidth environment［J］. Journal of Computer Research and Development,2020,57(12):2683-2693.（in Chinese）
	附中文参考文献：
［14］	高自娟,朱玉全,陈耿.基于变尺度滑动窗口的流数据聚类算法［J］.计算机应用研究,2011,28(2):551-553.
［15］	徐江,张鸿宇,李军怀,等.基于滑动窗口的流数据并行处理方法［J］.重型机械,2021(1):29-36.
［16］	吴丽贤,林钰杰,陈灏生,等.基于海量机器数据的实时流处理技术研究［J］.微型电脑应用,2021,37(11):185-187.
［17］	马卿云,季航旭,赵宇海,等.一种分布式异构带宽环境下的高效数据分区方法［J］.计算机研究与发展,2020,57(12):2683-2693.