Computer Engineering & Science

NM-SpMM:A semi-structured sparse matrix multiplication algorithm for domestic heterogeneous vector processors

JIANG Jing-fei, HE Yuan-hong, XU Jin-wei, XU Shi-yao, QIAN Xi-fu

2024, 46(07): 1141-1150. doi:

Abstract ( 411 )

PDF (1262KB) ( 696 ) 　　

Deep neural networks have achieved excellent results in natural language processing, computer vision and other fields. Due to the growth of the scale of data processed by intelligent applications and the rapid development of large models, the inference performance of deep neural networks is increasingly demanding. N∶M semi-structured sparse scheme has become one of the hot technologies to balance the computing power demand and application effect. The domestic heterogeneous vector processor FT-M7032 provides more space for data parallelism and instruction parallelism development in intelligent model processing. In order to address the challenges of N∶M semi-structured sparse model computation with various sparse patterns, a flexible configurable sparse matrix multiplication algorithm NM-SpMM is proposed for FT-M7032. NM-SpMM designs an efficient compressed offset address sparse encoding format COA, which avoids the impact of semi-structured parameter configuration on sparse data access. Based on the COA, NM-SpMM performs fine-grained optimization of sparse matrix multiplication in different dimensions. The experimental results on FT-M7032 single core show that NM-SpMM can obtain 1.73~21.00 times speedup compared to dense matrix multiplication, and 0.04~1.04 times speedup compared to NVIDIA V100 GPU with CuSPARSE.

A codelet model based on MLIR

LI Jin-xi, YIN Shou-yi, WEI Shao-jun, HU Yang

2024, 46(07): 1151-1157. doi:

Abstract ( 277 )

PDF (601KB) ( 661 ) 　　

Thanks to the instruction set architecture (ISA), the software community and the hardware community has been developing independently for years. However, with the advent of multi-core accelerators, the sequential programming model based on the Von Neumann architecture is confronted with troubles. Based on sequential execution model, ISA lacks support for parallel multi-core hardware. Thus, merely using ISA cannot decouple software and hardware. A new program execution model (PXM) is required to accomplish end-to-end compilation from neural networks to interface with sequentially executed programming platforms and parallel multi-core hardware backends, further exploring the optimization opportunities provided by parallel hardware. This paper proposes a codelet model as a new PXM, providing a general abstraction for the process of downloading sequentially executed programs onto parallel hardware. It further decouples the software frontend and hardware backend based on the instruction set. To ensure the reusability of the project, this paper implements the codelet model in the form of a codelet dialect within the MLIR compiler framework proposed by Google. MLIR aims to integrate fragmented compiler ecosystems and improve the reusability of frontend-to-backend integration processes. The codelet model implemented in MLIR in this paper can further enhance the reusability of the MLIR system.

MiniBranRAP:A minimizing branch parallel algorithm of the coarse matrix computation in AMG solver

DU Hao, MAO Run-zhang, DENG Yun-tong, HUANG Si-lu, XU Xiao-wen

2024, 46(07): 1158-1166. doi:

Abstract ( 193 )

PDF (1586KB) ( 587 ) 　　

Algebraic multi-grid (AMG) is one of the most commonly used algorithms for solving large-scale sparse linear algebra equations in the field of scientific engineering computing and industrial simulation. For each grid layer in the Setup phase, AMG needs to calculate the coarse grid matrix Ac=RAP through the product of three sparse matrix based on the restriction operator R, the current fine grid matrix A, and the interpolation operator P, which has become the main bottleneck in the parallel performance of AMG. This paper first discovers that the performance bottleneck of the RAP parallel algorithm in mainstream AMG solvers is caused by the quadratic complexity of branch judgments. Then, utilize the row-based order characteristics of the sparse matrix format CSR, and propose a RAP parallel algorithm called MiniBranRAP with linear complexity of branch judgment counts. The algorithm is integrated into the JXPAMG solver, and the effectiveness of the algorithm is verified through practical examples. The numerical test results show that, for 6 typical examples from practical applications, compared with the latest version of Hypre's BoomerAMG solver, the JXPAMG solver based on MiniBranRAP can speed up the computation efficiency of the Setup phase by an average of 3.3 times and a maximum of 9.3 times on 28 processors.

Exploration of memory page size for high-density flash memory

YU Ding-cui, LUO Long-fei, SONG Yun-peng, LI Wen-tong, SHI Liang

2024, 46(07): 1167-1174. doi:

Abstract ( 198 )

PDF (981KB) ( 491 ) 　　

In recent years, solid state drives (SSDs) have witnessed rapid development towards higher bandwidth and larger capacity. To expand SSD capacity, the size of flash memory pages has increased from 4 KB to 16 KB. However, operating systems still issue read/write requests to SSDs with a 4 KB memory page granularity, making it difficult for applications to fully utilize the high bandwidth of SSDs. Increasing the size of memory pages to align the granularity of I/O requests issued by the operating system with the SSD's flash memory read/write operations could be a potential solution. This paper delves into the effects of memory page size on system I/O performance and SSD lifetime for the first time, including setting the memory page size to 16 KB, running benchmark tests, and comparing the results with those obtained using 4 KB memory pages. The key findings are as follows: (1) 16 KB memory pages exhibit better read performance; (2) the write granularity of applications determines the performance of 16 KB memory pages; (3) 16 KB memory pages amplify the impact of invalid data within pages on SSD lifetime.

An irregular sparse matrix SpMV method

SHI Yu, DONG Pan, ZHANG Li-jun

2024, 46(07): 1175-1184. doi:

Abstract ( 311 )

PDF (1108KB) ( 626 ) 　　

Sparse matrix-vector multiplication (SpMV) is one of the key operators in the field of high performance computing and also has significant applications in emerging deep learning domains. Existing research on SpMV often focuses on square sparse matrices, while there is still a lack of in-depth exploration for irregularly shaped sparse matrices (with unequal numbers of rows and columns). The characteristic of unequal numbers of rows and columns results in different storage features for these sparse matrices compared to square sparse matrices, leaving room for further optimization. Therefore, this paper establishes an SpMV performance model for irregularly shaped sparse matrices with unequal rows and columns, and analyzes that the performance bottleneck is caused by insufficient bandwidth for data exchange between cache and memory. At the same time, this paper carried out the following two optimization tasks: (1) Based on the commonly used CSR storage format for sparse matrices, a new RCSR storage format is proposed, which transforms and compresses a performance-limiting array in the CSR storage format, making SpMV more efficient; (2) An optimized SpMV algorithm based on the RCSR format is designed in conjunction with the SIMD instruction set extension of domestic processors. This paper tests regular and irregular sparse matrices on domestic Phytium processors. For regular sparse matrices, the comprehensive application of RCSR storage format, SIMD instructions, and OpenMP parallelization technology increases GFLOPS by 83.35% on average. For irregular sparse matrices, the performance improvement is related to the row-to-column ratio, and when the row-to- column ratio is not equal, the optimization effect is more obvious.

ROB compression method based on RISC-V superscalar processor

WANG Jie, FU Dan-yang,

2024, 46(07): 1185-1192. doi:

Abstract ( 328 )

PDF (3984KB) ( 530 ) 　　

RISC-V instruction set has the advantages of flexibility and scalability, and vector extension is one of its extended instruction sets. When implementing vector extention, it is necessary to split the vector instruction into multiple microinstructions. If each microinstruction occupies a reordering buffer (ROB) entry, there will be certain information redundancy, and will reduce the number of instructions executed in parallel (in-flight instructions) in the CPU, affecting processor performance. Based on the method of decoupling the storage of instructions and microinstructions in ROB, a new queue RAB is used to store information such as the renaming mapping relationship of the destination register of each microinstruction, and each ROB stores only the common information of the microinstructions derived from its corresponding instruction. ROB and RAB respectively control the commit and walk of instructions and microinstructions, which reduces the redundancy of stored information and alleviates the problem caused by too many microinstructions for vector instruction splitting. On the basis of the above method, this paper implements the ROB compression of scalar instructions at the same time, increasing the maximum number of in-flight instructions with the same number of ROB entries. The final simulation results show that this method effectively improves the performance of the processor.

Learning indexing method for massive high-dimensional data based on partitioned hierarchical graph

HUA Yue-lin, ZHOU Xiao-lei, FAN Qiang, WANG Fang-xiao, YAN Hao,

2024, 46(07): 1193-1201. doi:

Abstract ( 245 )

PDF (971KB) ( 442 ) 　　

Learning to index is the key to solving the problem of approximate nearest neighbor search in massive high-dimensional data. However, existing learning to index techniques are limited to individual partitions and rely on the construction of neighborhood graph. As the dimensionality and scale of data grow, indexing struggles to accurately judge boundary data of partitions, leading to increased construction time complexity and challenges in scalability. To address the above problems, a learn to index method based on partitioned hierarchical graphs, PBO-HNSW is proposed. The method redistributes partition boundary data and constructs distributed graph index structures in parallel. It effectively addresses the challenges faced by the approximate nearest neighbor search problem. Experimental results show that PBO-HNSW method is able to achieve millisecond index construction on millions of massive high-dimensional data. When the recall is 0.93, the construction time of the PBO-HNSW method is only 36.4% of baseline methods.

DSP design for 56 Gb/s high-speed SerDes receiver

HU Xiao-yue, , WANG Qiang, Lv Fang-xu, XU Chao-long, ZHANG Jin

2024, 46(07): 1202-1209. doi:

Abstract ( 426 )

PDF (1841KB) ( 750 ) 　　

The high-speed serial interface chip is an important IP in high-performance interconnect network communication. This paper proposes a DSP design for 56 Gb/s high-speed Serdes receivers, in response to the problem of high bit error rate caused by severe channel attenuation over long transmission distances in high-performance interconnect network backplane communication using 56 Gb/s four pulse amplitude modulation (PAM4) signals. The DSP adopts a 64-channel parallel structure and processes the digitized signal from the receiver through a 16-Tap feed forward equalizer (FFE) and a decision feedback equalizer (DFE). By using the K-means clustering algorithm to generate dynamically changing DFE decision levels and combining it with the least mean square (LMS) algorithm, it can handle the equalization problem under different channel attenuation of 15~35 dB. To verify the performance of the algorithm, an experimental verification platform based on analog frontend chips and field programmable gate arrays (FPGA) was constructed. The experimental results indicate that the channel attenuation is 15~35 dB@14 GHz at a speed of 56 Gb/s, the error rate is less than 5e-10.

A switch method of model inference serving oriented to serverless computing

WEN Xin, ZENG Tao, LI Chun-bo, XU Zi-chen

2024, 46(07): 1210-1217. doi:

Abstract ( 164 )

PDF (1730KB) ( 549 ) 　　

The development of large-scale models has led to the widespread application of model inference services. Constructing a stable and reliable architectural support for model inference services has become a focus for cloud service providers. Serverless computing is a cloud service computing paradigm with fine-grained resource granularity and high abstraction level. It offers advantages such as on-demand billing and elastic scalability, which can effectively improve the computational efficiency of model inference services. However, the multi-stage nature of model inference service workflows makes it challenging for independent serverless computing frameworks to ensure optimal execution of each stage. Therefore, the key problem to be addressed is how to leverage the performance characteristics of different serverless computing frameworks to achieve online switching of model inference service workflows and reduce the overall execution time. This paper discusses the switching problem of model inference ser- vices on different serverless computing frameworks. Firstly, a pre-trained model is used to construct model inference service functions and derive the performance characteristics of heterogeneous serverless computing frameworks. Secondly, a machine learning technique is employed to build a binary classification model that combines the performance characteristics of heterogeneous serverless computing frameworks, enabling online switching of the model inference service framework. Finally, a testing platform is established to generate model inference service workflows and evaluate the performance of the online switching framework prototype. Preliminary experimental results indicate that compared with the independent serverless computing framework, the online switching framework prototype can reduce the execution time of model inference service workflows by up to 57％.

A method for constructing performance analysis model of high performance application based on random forest classifier

CHAI Xu-qing, QIAO Yi-hang, FAN Li-lin,

2024, 46(07): 1218-1228. doi:

Abstract ( 185 )

PDF (2017KB) ( 457 ) 　　

Traditional performance analysis methods for high performance applications have shortcomings such as additional overhead during the analysis process and inaccurate analysis results, resulting in users spending more time and domain knowledge. To address these issues, this paper transforms the problem of program performance analysis into a multi-classification problem of unbalanced small sample datasets under high-dimensional features. By collecting 500 pieces of performance data that include seven types of metrics such as the number of process switches, memory utilization, and disk I/O load during program runtime, after data preprocessing such as PCA dimensionality reduction, a program performance problem analysis model is trained using a random forest classifier. Experimental validation shows that the model can identify five types of performance issues, including excessive memory utilization and heavy disk I/O load. To evaluate the effectiveness of the models guidance, this paper collects performance data generated by the HotSpot3D program and the LU-Decomposition program during runtime. Based on the models output guidance, the two validation programs are optimized at the runtime level and the compilation level. Experimental results indicate that the proposed method can effectively guide the optimization of program performance, with speedup ratios of 1.056 and 5.657 for the two programs, respectively.

An anti-forensic detection model based on causality calculation

DU Fang, JIAO Jian, JIAO Li-bo

2024, 46(07): 1229-1236. doi:

Abstract ( 185 )

PDF (713KB) ( 422 ) 　　

In modern network attacks, attackers often use various anti-forensics techniques to conceal their tracks. The harm of data erasure in anti-forensics technology is significant. Attackers can use this attack to delete or destroy data, thereby destroying attack evidence and disrupting the forensics process. Due to the concealment of the erasure activity itself, it is difficult to detect. This paper proposes an anti-forensics check module (AFCM) using causal relationship based traceability technology. The model generates an alert traceability graph based on alert information, and calculates anomaly scores for each path in the graph through attack behavior characteristics. Through further filtering and aggregation calculations, the attack path is ultimately generated. The experimental results show that this model can effectively achieve traceability tracking of anti-forensics erasure activities and improve the identification between anti data erasure attack activities and normal activities.

A data heterogeneity processing method based on asynchronous hierarchical federated learning

GUO Chang-hao, TANG Xiang-yun, WENG Yu

2024, 46(07): 1237-1244. doi:

Abstract ( 324 )

PDF (1083KB) ( 780 ) 　　

In the era of ubiquitous Internet of Things devices, a vast amount of data with varying distributions and volumes is continuously generated, leading to pervasive data heterogeneity. Addressing the challenges of federated learning for intelligent devices in the IoT landscape, traditional synchronous federated learning mechanisms fall short in effectively tackling the NON-IID data distribution problem. Moreover, they are plagued by issues such as single-point failures and the complexity of maintaining a global clock. However, asynchronous mechanisms may introduce additional communication overhead and obsolescence due to NON-IID data distribution. To offer a more flexible solution to these chal- lenges, an asynchronous hierarchical federated learning method is proposed. Initially, the BIRCH algorithm is employed to analyze the data distribution across various IoT nodes, leading to the formation of clusters. Subsequently, data within these clusters is dissected and validated to identify nodes with high data quality. Nodes from high-quality clusters are then disaggregated and reorganized into lower-quality clusters, forming new, optimized clusters. Finally, a two-stage model training is conducted, involving both intra-cluster and global aggregation. Additionally, our proposed approach is evaluated using the MNIST dataset. The results show that, compared to the baseline set by the classical FedAVG method, the proposed approach achieves faster convergence on NON-IID datasets and improves model accuracy by more than 15%.

A network traffic prediction model based on improved northern goshawk optimization for stochastic configuration network

WANG Kun, LI Shao-bo, HE Ling, ZHOU Peng

2024, 46(07): 1245-1255. doi:

Abstract ( 215 )

PDF (1506KB) ( 483 ) 　　

Network traffic prediction, as a critical technology, can assist in achieving rational allocation of network resources, optimizing network performance, and providing efficient network services. With the evolution and development of network environments, the diversity and complexity of network traffic have increased. To improve the accuracy of network traffic prediction, a network traffic prediction model based on improved northern goshawk optimization for stochastic configuration network (CNGO-SCN) is proposed. Stochastic configuration network, as a supervised incremental model, has significant advantages in addressing large-scale data regression and prediction problems. However, the accuracy of the stochastic configuration network is influenced by the selection of some hyperparameters. To address this issue, the northern goshawk optimization algorithm is used to optimize the regularization parameters and scaling factors that affect the performance of the stochastic configuration network, obtaining the optimal values. As the initial distribution of the population in the northern goshawk optimization algorithm leads to poor individual quality, chaos logic mapping is introduced to improve the quality of initial solutions. The optimized model is applied to real traffic datasets from the UK academic network, the core network of a European city, and a network collaborative manufacturing cloud platform interface established by a cooperative enterprise. It is compared with various neural network models to verify the network traffic prediction capability of the proposed method. Experimental results show that the model has higher prediction accuracy compared to other neural networks, exhibiting superior predictive capability when dealing with complex data in practical scenarios. The prediction error of the model decreases by 0.9% to 99.7%.

Transversal cameras relocation for moving object based on metric learning

KANG Yu, SHI Ke-hao, CHEN Jia-yi, CAO Yang, XU Zhen-yi,

2024, 46(07): 1256-1268. doi:

Abstract ( 200 )

PDF (2473KB) ( 535 ) 　　

In recent years, the pollution from diesel vehicle exhaust emissions in China has become increasingly severe. In order to improve the atmospheric environment, it is necessary to monitor diesel vehicles emitting black smoke. However, in urban traffic road scenarios, the detection of black smoke vehicles is often difficult to determine through rear-view videos due to factors such as mutual obstruction between vehicles. Additionally, the severe lack of relevant data greatly limits the effectiveness of the data. To address the above problems, this paper proposes a black smoke diesel vehicle re-identification model under the cross-camera scene. By introducing the IBN module to construct a feature extraction network, the adaptability of the network model to changes in the appearance of diesel vehicle images is enhanced. A loss function based on the Hausdorff distance metric learning is designed to measure the feature differences, increasing inter-class distance and reducing the impact of occluded samples during the optimization process. Then, benchmark datasets for diesel vehicle repositioning across multiple scenarios are constructed, and the proposed method is experimented on this dataset. The experimental results show that the proposed method achieves a relative accuracy of 83.79%, demonstrating high accuracy.

Low-altitude remote sensing image object detection based on improved YOLOv7 network

ZHANG Yong-zhi, HE Ke-ren, GE Jue

2024, 46(07): 1269-1277. doi:

Abstract ( 266 )

PDF (1304KB) ( 534 ) 　　

To address the bottlenecks caused by issues such as small scales, complex and variable backgrounds, and limited computing resources in low-altitude remote sensing image object detection, a new low-altitude remote sensing image object detection method, named SimAM_YOLOv7, is proposed, based on improved YOLOv7 network. Firstly, based on tensor train decomposition, redundant parameters are minimized. Secondly, a non-parametric attention module is introduced to enhance the network's ability to focus on targets. Then, an efficient intersection over union (EIoU) is utilized to optimize the positioning loss, reducing the positional offset between the target box and the prior box. Furthermore, the classification loss is improved based on Focal Loss to overcome the imbalance between positive and negative samples. Experiments conducted on a real-world low-altitude remote sensing dataset demonstrate that, compared to the YOLOv7 baseline, the proposed method increases mAP50 by 4.63% and increases mAP50:95 by 3.94% while the number of parameters is reduced by 3.27M, fully validating its effectiveness and superiority.

Pedestrian detection based on multi-scale features and mutual supervision

XIAO Zhen-jiu, LI Si-qi, QU Hai-cheng

2024, 46(07): 1278-1285. doi:

Abstract ( 179 )

PDF (1320KB) ( 466 ) 　　

Aiming at the high false negative rate and low accuracy in crowded scenes, a pedestrian detection network based on multi-scale features and mutual supervision is proposed. To effectively extract pedestrian feature information in complex scenes, a network combining PANet pyramid network and mixed dilated convolutions is used to extract feature information. Then, a mutual supervision detection network for head-body detection is designed, which utilizes the mutual supervision of head bounding boxes and full-body bounding boxes to obtain more accurate pedestrian detection results. The proposed network achieves 13.5% MR-2 performance on CrowdHuman dataset, with an improvement of 3.6% compared to the YOLOv5 network, and a simultaneous improvement of 3.5% in average precision (AP). On CityPersons dataset, it achieves 48.2% MR-2 performance, with 2.3% improvement compared to the YOLOv5 network, and a simultaneous improvement of 2.8% in AP. The results indicate that the proposed network demonstrates good detection performance in densely crowded scenes.

One-off three-way sequential patterns mining

YANG Shi-qi, WU You-xi, GENG Meng, LI Yan

2024, 46(07): 1286-1295. doi:

Abstract ( 164 )

PDF (848KB) ( 460 ) 　　

One-off sequential pattern mining aims to mining repetitive sequential patterns with gap constraints from sequence. However, current methods do not consider the users’ degree of interest, and treat each character in the sequence equally, which leads to mining many redundant patterns that are uninteresting to users. In order to solve this problem, proposed the one-off three-way sequential pattern (OTP) mining problem by introducing the concept of three-way decision and its efficient solution algorithm OTPM. In terms of support calculation, OTPM algorithm is based on the depth-first search and backtracking strategy, and combines the characteristics of three-way patterns to efficiently solve the support of patterns. In the generation of candidate patterns, OTPM algorithm uses a pattern join strategy to reduce the number of candidate patterns. In addition, a parallelization scheme also is used in OTPM algorithm, improve the mining efficiency of the algorithm by taking full advantage of the multi-core performance of modern processors. Finally, the experimental results verify the significance of studying the OTP mining problem and the efficiency of the OTPM algorithm.

A review of named entity recognition research

DING Jian-ping, LI Wei-jun, LIU Xue-yang, CHEN Xu

2024, 46(07): 1296-1310. doi:

Abstract ( 572 )

PDF (946KB) ( 1108 ) 　　

Named entity recognition (NER), as a core task in natural language processing, finds extensive applications in information extraction, question answering systems, machine translation, and more. Firstly, descriptions and summaries are provided for rule-based, dictionary-based, and statistical machine learning methods. Subsequently, an overview of NER models based on deep learning, including supervised, distant supervision, and Transformer-based approaches, is presented. Particularly, recent advancements in Transformer architecture and its related models in the field of natural language processing are elucidated, such as Transformer-based masked language modeling and autoregressive language modeling, including BERT, T5, and GPT. Furthermore, brief discussions are conducted on data transfer learning and model transfer learning methods applied to NER. Finally, challenges faced by NER tasks and future development trends are summarized.

RIB-NER:A span-based Chinese named entity recognition model

TIAN Hong-peng, WU Jing-wei

2024, 46(07): 1311-1320. doi:

Abstract ( 347 )

PDF (819KB) ( 585 ) 　　

Named entity recognition serves as an important foundation for many downstream tasks in the field of natural language processing. As an important international language, Chinese is unique in many aspects. Traditionally, models of Chinese named entity recognition tasks use sequence labeling mechanisms that require conditional random fields to capture label dependencies. However, this approach is prone to misclassification of labels. Aiming at this problem, a span-based named entity recognition model called RIB-NER is proposed. Firstly, the method provides character-level embedding through RoBERTa as a model embedding layer to obtain more contextual semantic and lexical information. Secondly, IDCNN is used to increase the position information between words with parallel convolution kernels, so that the connection between words is closer. At the same time, a BiLSTM network is integrated in the model to obtain context information. Finally, a Biaffine model is employed to score the start and end tokens in the sentence, and these tokens are used to explore spans. The proposed algorithm is tested on MSRA and Weibo corpora, the results show that it can accurately identify entity boundaries, achieving F1 scores of 95.11% and 73.94% respectively. Compared with traditional deep learning approaches, it demonstrates better recognition performance.

A short text semantic matching strategy based on BERT sentence vector and differential attention

WANG Qin-chen, DUAN Li-guo, WANG Jun-shan, ZHANG Hao-yan, GAO Hao

2024, 46(07): 1321-1330. doi:

Abstract ( 220 )

PDF (1091KB) ( 596 ) 　　

Short text semantic matching is a core issue in the field of natural language processing, which can be widely used in automatic question answering, search engines, and other fields. In the past, most of the work only considered the similar parts between texts, while ignoring the different parts between texts, making the model unable to fully utilize the key information to determine whether texts match. In response to the above issues, this paper proposes a short text semantic matching strategy based on BERT sentence vectors and differential attention. BERT is used to vectorize sentence pairs, BiLSTM is used, and a multi-header differential attention mechanism is introduced to obtain attention weights that represent intention differences between the current word vector and the global semantic information of the text. A one-dimensional convolutional neural network is used to reduce the dimension of the semantic feature vectors of the sentence pairs, Finally, the word sentence vector is spliced and sent to the full connection layer to calculate the semantic matching degree between the two sentences. Experiments on LCQMC and BQ datasets show that this strategy can effectively extract text semantic difference information, thereby enabling the model to display better results.

Current Issue

Author center

Review center

Online journal