• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (09): 1554-1565.

• 高性能计算 • 上一篇    下一篇

面向深度行情因子挖掘的分布式训练关键技术研究

赵鑫博1,2,3,陆忠华2   

  1. (1.中国刑事警察学院公安信息技术与情报学院,辽宁 沈阳 110854;2.中国科学院计算机网络信息中心,北京 100083;
    3.中国科学院大学,北京 100049)

  • 收稿日期:2023-10-30 修回日期:2023-12-12 接受日期:2024-09-25 出版日期:2024-09-25 发布日期:2024-09-19
  • 基金资助:
    北京市自然科学基金(4232039)

Research on key technologies of distributed training for Level2 market quotation factor mining

ZHAO Xin-bo1,2,3,LU Zhong-hua2   

  1. (1.School of Public Security Information Technology and Intelligence,
    Criminal Investigation Police University of China,Shenyang 110854;
    2.Computer Network Information Center,Chinese Academy of Sciences,Beijing 100083;
    3.University of Chinese Academy of Sciences,Beijing 100049,China) 
  • Received:2023-10-30 Revised:2023-12-12 Accepted:2024-09-25 Online:2024-09-25 Published:2024-09-19

摘要: 深度行情数据是沪深交易所的新一代实时行情数据产品,是普通基础行情数据的升级版,是目前国内信息密度最高、蕴含信息量最大、挖掘最不充分的行情数据,对挖掘证券市场潜在风险具有重要价值。但是,现有研究缺少基于深度行情数据面向证券市场的风险度量和计算分析,且全市场深度行情数据规模大,用于提取信息的深度学习模型也越来越复杂,尽管当下硬件的计算能力也在一直不断地发展与提高,但仍然无法解决训练耗时长、效率低等问题。因此,基于沪深300成分股深度行情数据,利用深度学习等方法挖掘高频波动率因子,构建了基于TabNet与LightGBM的高频波动率预测模型。同时,提出了一种基于并行差分进化的分布式训练算法Parallel_DE,用于模型分布式训练过程中的参数计算,并详细阐述了其场景映射方案与整体流程设计。针对上述2项工作基于自有分布式训练平台进行充分验证,实验结果表明,高频波动率预测模型可以对已实现波动率进行高精度预测,且效果相较于其他方法具有一定优越性;Parallel_DE算法可以在一定程度保留参数多样性的同时,有效减少本地参数在测试集上的误差,从而高效率分布式地训练出性能优良的深度学习模型,为证券市场的风险识别提供了面向深度行情数据的相关技术与方法。

关键词: 深度行情, 已实现波动率, 分布式训练, 差分进化

Abstract: Level2 market quotation data is the new generation of real-time market data products from the Shanghai and Shenzhen Stock Exchanges. Serving as an enhanced version of basic market data, it currently has the highest information density, the greatest amount of information, and the most insufficient mining in China. The data is of significant value in identifying potential risks in the securities market, but existing research lacks risk measurement and analysis based on it. Moreover, the scale of Level2 market quotation data in the entire market is large, and the deep learning models used to extract information are becoming increasingly complex. Although hardware computing power is constantly developing and improving, it still cannot solve problems such as long training time and low efficiency. Therefore, based on Level2 market quotation data of CSI 300, deep learning and other methods are used to mine high-frequency volatility factors, and builds a high-frequency volatility prediction model based on TabNet and LightGBM. At the same time, a distributed training algorithm Parallel_DE based on parallel differential evolution is proposed for parameter calculation in the process of model distributed training, its scene mapping scheme and overall process design are elaborated. The above two work are fully verified based on the proposed distributed training platform. The experimental results show that the high-frequency volatility prediction model can predict the realized volatility with high precision, and the effect has certain advantages compared with other methods; the Parallel_DE algorithm can effectively reduce the error of local parameters on the test set while retaining the diversity of parameters to a certain extent, so as to efficiently and distributedly train a deep learning model with excellent performance. This paper provides valuable technologies and methodologies for leveraging Level2 market quotation data in risk identification within the securities market.


Key words: Level2 market quotation, realized volatility, distributed training, differential evolution