• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2026, Vol. 48 ›› Issue (3): 521-530.

• 图形与图像 • 上一篇    下一篇

一种融合语义图卷积与自注意力机制的三维人体姿态估计方法

童立靖,英溢卓,曹楠   

  1. (北方工业大学人工智能与计算机学院,北京 100144)

  • 出版日期:2026-03-25 发布日期:2026-03-25
  • 基金资助:
    北京市属高校青年拔尖人才培养计划(CIT&TCD201904009)

A 3D human pose estimation method integrating semantic graph convolutional network and self-attention mechanism

TONG Lijing,YING Yizhuo,CAO Nan   

  1. (School of Artificial Intelligence and Computer Science,North China University of Technology,Beijing 100144,China)
  • Online:2026-03-25 Published:2026-03-25

摘要: 针对三维人体姿态估计不易捕捉人体关节序列的全局特征、估计精度不高的问题,提出了一种融合语义图卷积与自注意力机制的三维人体姿态估计方法。首先,为提升从二维人体姿态序列映射到三维人体姿态序列过程中的特征提取效果,在语义图卷积网络中融入自注意力机制,进行基于局部特征与全局特征相融合的空间特征提取;其次,对MLP-Mixer网络的通道混合模块加以改进,引入了语义图卷积网络与U型MLP结构进行时序特征的提取;最后,基于二维人体图像的融合特征与提取的时序特征进行三维人体姿态估计。在三维人体姿态估计数据集Human3.6M上进行实验,将所提出的方法与当前主流的三维人体姿态估计方法进行对比,实验结果表明该方法在平均误差指标MPJPE和PA-MPJPE上相比次优方法分别下降约4.5 mm和0.2 mm,实验结果验证了所提出方法的有效性。

关键词: 三维人体姿态估计, 语义图卷积, MLP-Mixer模型, 自注意力机制

Abstract: Aiming at the problem that it is difficult to capture the global characteristics of human joint sequences and the estimation accuracy is not high, a 3D human pose estimation method combining semantic graph convolutional network and self-attention mechanism is proposed. Firstly, in order to improve the feature extraction effect in the process of mapping from two-dimensional human pose sequence to three-dimensional human pose sequence, self-attention mechanism is integrated into semantic graph convolutional network to carry out spatial feature extraction based on the integration of local features and global features. Secondly, the channel-mixing module of the MLP-Mixer network is improved by introducing a semantic graph convolutional network and a U-shaped MLP structure for temporal feature extraction. Finally, 3D human pose estimation is performed based on the fused features from 2D human images and the extracted temporal features.  Experimental evaluations on the Human3.6M dataset for 3D human pose estimation demonstrate that, compared with current mainstream 3D human pose estimation methods, the proposed method reduces the average error metrics MPJPE and PA-MPJPE by approximately 4.5 mm and 0.2 mm compared with the suboptimal method, respectively. The experimental results validate the effectiveness of the proposed method.

Key words: 3D human pose estimation;semantic graph convolutional network;MLP-Mixer model;self-attention , mechanism