• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (08): 1383-1392.

• 高性能计算 • 上一篇    下一篇

海量地铁乘客轨迹相似性连接方法:以深圳地铁为例

王星苏1,熊文1,张瑞2   

  1. (1.云南师范大学信息学院,云南 昆明 650500;2.深圳北斗应用技术研究院有限公司,广东 深圳 518000)
  • 收稿日期:2022-06-09 修回日期:2022-09-23 接受日期:2023-08-25 出版日期:2023-08-25 发布日期:2023-08-18
  • 基金资助:
    国家自然科学基金(61862066)

A massive subway passenger trajectory similarity connection method:A case study of Shenzhen metro 

WANG Xing-su1,XIONG Wen1,ZHANG Rui2   

  1. (1.School of Information Science and Technology,Yunnan Normal University,Kunming 650500;
    2.Shenzhen Institute of Beidou Applied Technology,Shenzhen 518000,China)
  • Received:2022-06-09 Revised:2022-09-23 Accepted:2023-08-25 Online:2023-08-25 Published:2023-08-18

摘要: 当前主要的轨迹相似性连接方法都以GPS轨迹为研究对象。针对GPS轨迹的优化方法无法直接用于解决地铁乘客轨迹相似性连接的问题,充分利用地铁乘客轨迹的时空特征,借助轨迹的重复性和对称性,将轨迹从点序列转化为OD序列。以OD序列为基础的轨迹,长度是原轨迹的一半,对应的索引空间变小,后续的计算量也随之减少。着重研究了基于PPJoin+的轨迹连接算法在Spark平台上的设计与实现。在一个13结点Spark集群和一个包含500万个乘客轨迹集合(5.6亿条刷卡记录)的超大规模数据集上验证了该算法的有效性。实验结果显示,基于OD序列的PPJoin+算法的执行时间为14.0 min,比默认的点序列轨迹连接算法的节约了62.5%,比Dima连接算法的节约了78.2%,并表现出了良好的可扩展性。

关键词: 轨迹相似性, 地铁系统, PPJoin+, Spark

Abstract: The current main trajectory similarity connection methods are based on GPS trajectories. Optimization methods for GPS trajectories cannot be directly applied to the problem of connecting subway passenger trajectories. By fully utilizing the spatiotemporal characteristics of subway passenger tra- jectories and leveraging the trajectorys repetitiveness and symmetry, the trajectory is transformed from a point sequence to an OD sequence to reduce the trajectory length and save storage space. This paper focuses on the design and implementation of the trajectory connection algorithm based on PPJoin+ on the Spark platform. The method is validated on a 13-node Spark cluster and a large-scale dataset contain- ing 5 million passenger trajectories (560 million tap records collected in two consecutive months). The experimental results show that the PPJoin+ algorithm based on OD sequence only takes 14.0 minutes, which saves 62.5% of the execution time compared to the default point sequence trajectory connection algorithm and 78.2% of the execution time compared to the Dima connection algorithm, and exhibits good scalability.

Key words: trajectory similarity, metro system, PPJoin+, Spark ,