• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (02): 237-245.

• 高性能计算 • 上一篇    下一篇

Flink水位线动态调整策略

吕鹤轩1,2,3,黄山1,2,3,艾力卡木·再比布拉1,2,3,吴思衡1,2,3,段晓东1,2,3    

  1. (1.大连民族大学计算机科学与工程学院,辽宁 大连 116600;2.大数据应用技术国家民委重点实验室,辽宁 大连 116600;
    3.大连市民族文化数字技术重点实验室,辽宁 大连 116600)

  • 收稿日期:2022-09-14 修回日期:2022-10-28 接受日期:2023-02-25 出版日期:2023-02-25 发布日期:2023-02-15
  • 基金资助:
    国家重点研发计划(2018YFB1004402)

A dynamic watermark adjustment strategy in Flink cluster

Lv He-xuan1,2,3,HUANG Shan1,2,3,Alkam·Zabibul1,2,3,WU Si-heng1,2,3,DUAN Xiao-dong1,2,3    

  1. (1.College of Computer Science and Engineering,Dalian Minzu University,Dalian 116600;
    2.State Ethnic Affairs Commission Key Laboratory of Big Data Applied Technology,Dalian 116600;
    3.Dalian Key Laboratory of Digital Technology for National Culture,Dalian 116600,China)
  • Received:2022-09-14 Revised:2022-10-28 Accepted:2023-02-25 Online:2023-02-25 Published:2023-02-15

摘要: 衡量大数据的数据挖掘性能有2个最重要的任务指标:一是实时性,二是准确性。流数据从数据产生到消息队列再通过数据源流入Flink进行计算,这个过程中因为网络传输速度不同,不同节点的计算性能不同等原因,流数据进入计算框架的先后顺序和数据产生的事件时间顺序会有局部乱序的现象。面对窗口作业的传统水位线机制在不确定乱序程度的流数据情况下无法同时兼顾作业结果的实时性和准确性。针对这个问题,建立了流数据微簇模型。通过局部乱序度算法,根据流数据微簇的流数据事件时间局部乱序程度计算出可以代表当前时刻流数据的乱序度。设计了水位线动态调整策略,使水位线根据流数据的乱序程度动态调整大小。最后,在Apache Flink框架中对基于事件时间窗口的水位线动态调整策略进行了实现。实验结果表明,弹性或不确定乱序流数据条件下,基于事件时间窗口的水位线动态调整策略可以有效地同时兼顾窗口作业的准确性和实时性。

关键词: Apache Flink, 水位线;乱序流数据, 事件时间

Abstract: Two of the most important task metrics that measure data-mining performance specific to big data: one is real-time and the other is accuracy. The stream data flows from data generation to message queue and then into Flink through data source for calculation. In this process, due to different network transmission speed and different computing performance of different nodes, the sequence of stream data entering the computing framework and the time sequence of events generated by data will be partially out of order. The traditional watermark mechanism for window-facing operations cannot consider the real-time performance and accuracy of the operation results in the case of streaming data with uncertain out-of-order degree. To solve this problem, a stream data microcluster model is established. Based on the local out-of-order degree of stream data event time, the out-of-order degree of stream data representing the current moment is calculated by the local out-of-order degree algorithm. A dynamic watermark adjustment strategy is designed to adjust the watermark dynamically according to the degree of flow data disorder. Finally, the dynamic watermark adjustment strategy based on event time window is implemented in Apache Flink framework. Experimental results show that the dynamic watermark adjustment strategy based on event time window can effectively consider the accuracy and real-time performance of window operation under the condition of elastic or uncertain chaotic flow data. 

Key words: Apache Flink, watermark, heterogeneous environment, event time