Computer Engineering & Science

Performance evaluation and analysis of vectorized SpMV algorithm based on scratchpad memory

ZHANG Zong-mao, DONG De-zun, WANG Zi-cong, CHANG Jun-sheng, ZHANG Xiao-yun, WANG Shao-cong

2024, 46(9): 1521-1528. doi:

Abstract ( 491 )

PDF (1120KB) ( 934 ) 　　

Scratchpad memory (SPM), as an on-chip high-speed memory with a simple structure, fixed access latency, and direct software control, has been widely used in modern processor design. Sparse matrix vector multiplication (SpMV) is one of the critical kernel computation functions in high performance computing, artificial intelligence, and other application domains. In traditional multi-level cache processors, the irregular access operations of dense input vectors during the computation of the SpMV algorithm often lead to a significant number of cache misses, thereby affecting the execution efficiency of the SpMV algorithm. To evaluate the performance impact of scratchpad memory on the SpMV vector algorithm, this paper utilizes ARMs scalable vector extension (SVE) instructions to vectorize the SpMV algorithm based on the compressed sparse row (CSR) format. It stores the hot data, namely the dense input vectors, in the scratchpad memory and conducts a performance analysis of the SpMV vector algorithm on ARM-based processors integrated with scratchpad memory. This paper conducts experiments on 2 562 sparse matrices from real-world applications using the gem5 simulator. The experimental results show that, compared to traditional processor architectures, running the SpMV vector algorithm on the processor architecture integrated with scratchpad memory can achieve a maximum speedup of 7.45 times and an average speedup of 1.11 times.

A scalable parallel structured matrix multiplication algorithm framework

LI Sheng-guo, LIAO Xia, YU Heng-biao, HUANG Chun, JIANG Hao, LU Xi-yan, WANG Hua-lin, CHENG Li-zhi

2024, 46(9): 1529-1538. doi:

Abstract ( 516 )

PDF (1176KB) ( 701 ) 　　

Structured matrices play an important role in scientific computing and engineering applications, such as Cauchy, Toeplitz, Vandermonde, and Hankel matrices. Although these matrices are dense, they can be expressed with only O(n) parameters (generators), where n is the dimension of the matrix. The core idea of the algorithm in this paper is to use matrix generators to explicitly construct local matrix blocks of each process, thereby reducing communication overhead. Additionally, by leveraging the numerical low-rank property of these matrix blocks. This paper further minimize computational overhead. Consequently, the proposed parallel structured matrix multiplication algorithm framework can simultaneously reduce both computational and communication costs, making it suitable for matrix multiplication algorithms like Cannon, Fox, and PUMMA. Extensive numerical tests were conducted on the Tianhe-2 supercomputer, and the results demonstrate that the proposed algorithm achieves an 8.96× speedup compared to the PDGEMM function in ScaLAPACK.

An urgent memory reuse algorithm for computational graphs

CAO Bo-jun, QIAN Ru-yi, XU Yuan-chao

2024, 46(9): 1539-1546. doi:

Abstract ( 439 )

PDF (807KB) ( 549 ) 　　

The limited device memory capacity restricts the further expansion of deep neural network models, and memory reuse is one of the few methods to save memory usage without introducing additional overhead. Intermediate tensors in the computational graph occupy the majority of memory space and are the primary optimization targets for memory reuse algorithms. Existing typical memory reuse algorithms, including the large tensor priority algorithm and the short lifetime priority algorithm, only consider a single characteristic, focusing solely on whether the lifetimes of tensors overlap, while ignoring the relative positional relationship between the lifetimes of adjacent tensors. As the computational graph becomes more complex, the exploitation of memory reuse becomes less sufficient. To address this issue, a new memory reuse algorithm, UMR, is proposed. By deeply analyzing the relative positional relationship between the lifetimes of adjacent tensors in the graph and promptly reusing them, UMR obtains more opportunities for memory reuse. The algorithm is evaluated based on real inference models in MLPerf, and the results show that the memory reuse rate of the UMR algorithm is not lower than that of existing mainstream algorithms and can achieve the theoretical optimum for memory reuse in the model. Evaluations of the algorithm based on relatively complex computational graphs indicate that compared to the large tensor priority and short lifetime priority algorithms, UMR saves up to 21.6% and 18.7% of memory usage, with average savings of 6.5% and 13.2%, respectively.

A neural network pruning and quantization algorithm for hardware deployment

WANG Peng, ZHANG Jia-cheng, FAN Yu-yang,

2024, 46(9): 1547-1553. doi:

Abstract ( 623 )

PDF (1042KB) ( 900 ) 　　

Due to their superior performance, deep neural networks have been widely applied in fields such as image recognition and object detection. However, they contain a large number of parameters and require immense computational power, posing challenges for deployment on mobile edge devices that require low latency and low power consumption. To address this issue, a compression algorithm that replaces multiplication operations with bit-shifting and addition is proposed. This algorithm compresses neural network parameters to low bit-widths through pruning and quantization. This algorithm reduces the hardware deployment difficulty under limited multiplication resources, meets the requirements of low latency and low power consumption on mobile edge devices, and improves operational efficiency. Experiments conducted on classical neural networks with the ImageNet dataset revealed that when the neural network parameters were compressed to 4 bits, the accuracy remained essentially unchanged compared to the full-precision neural network. Furthermore, for ResNet18, ResNet50, and GoogleNet, the Top-1/Top-5 accuracies even improved by 0.38%/0.22%, 0.35%/0.21%, and 1.14%/0.57%, respectively. When testing the eighth convolutional layer of VGG16 deployed on Zynq7035, the results showed that the compressed network reduced the inference time by 51.1% and power consumption by 46.7%, while using 43% fewer DSP resources.

Research on key technologies of distributed training for Level2 market quotation factor mining

ZHAO Xin-bo, LU Zhong-hua

2024, 46(9): 1554-1565. doi:

Abstract ( 400 )

PDF (1977KB) ( 972 ) 　　

Level2 market quotation data is the new generation of real-time market data products from the Shanghai and Shenzhen Stock Exchanges. Serving as an enhanced version of basic market data, it currently has the highest information density, the greatest amount of information, and the most insufficient mining in China. The data is of significant value in identifying potential risks in the securities market, but existing research lacks risk measurement and analysis based on it. Moreover, the scale of Level2 market quotation data in the entire market is large, and the deep learning models used to extract information are becoming increasingly complex. Although hardware computing power is constantly developing and improving, it still cannot solve problems such as long training time and low efficiency. Therefore, based on Level2 market quotation data of CSI 300, deep learning and other methods are used to mine high-frequency volatility factors, and builds a high-frequency volatility prediction model based on TabNet and LightGBM. At the same time, a distributed training algorithm Parallel_DE based on parallel differential evolution is proposed for parameter calculation in the process of model distributed training, its scene mapping scheme and overall process design are elaborated. The above two work are fully verified based on the proposed distributed training platform. The experimental results show that the high-frequency volatility prediction model can predict the realized volatility with high precision, and the effect has certain advantages compared with other methods; the Parallel_DE algorithm can effectively reduce the error of local parameters on the test set while retaining the diversity of parameters to a certain extent, so as to efficiently and distributedly train a deep learning model with excellent performance. This paper provides valuable technologies and methodologies for leveraging Level2 market quotation data in risk identification within the securities market.

Spatial decomposition coloring and vectorization of molecular dynamics simulation of silicon crystal based on OpenMP

FU You, HAN Hao, SUN Yue-jiao, LIANG Jian-guo, YE Yu-xi, HUA Rong

2024, 46(9): 1566-1575. doi:

Abstract ( 306 )

PDF (1081KB) ( 583 ) 　　

As one of the hot spots in virtual process engineering research in materials, the molecular dynamics of silicon crystal is simulated using Tersoff multi-body potential. In the multi-body potential, the interaction between particles is computationally intensive, and there are dependencies between data. Efficient and accurate large-scale simulation on parallel architectures faces two challenges: write conflict and low computational efficiency. To address these issues, a series of optimization methods tailored for silicon crystal molecular dynamics applications have been implemented based on the OpenMP shared-memory programming model to enhance simulation efficiency: (1)In the process of large-scale thread-level parallel simulation, the spatial decomposition graph coloring idea is used to eliminate the data dependence between particles and effectively solve the problem of write conflict； (2)For the core computing program segment, the overall vectorization is used to improve the computational efficiency, and the series estimation is used to realize the transcendental function to achieve the parallel optimization of multi-body potential Tersoff on multi-core processors. The experimental results show that the Tersoff potential has good optimization potential on the X86 platform, and the spatial decomposition graph coloring and vectorization are feasible and scalable for silicon crystal application, which can effectively solve the write conflict caused by data intersection and computationally intensive optimization problems, and the speedup of final result can reach 23.17.

OMCI model similarity computation based on graph neural networks

YUAN Jia-wei, ZHAO Jin

2024, 46(9): 1576-1586. doi:

Abstract ( 395 )

PDF (922KB) ( 857 ) 　　

The optical network unit management and control interface (OMCI) is a crucial protocol for interconnectivity between the optical line terminal (OLT) and optical network unit (ONU) in Gigabit-capable passive optical networks (GPON) systems. When addressing OMCI interoperability issues, developers often need to conduct exception analysis on OMCI service models. However, due to the complexity of OMCI domain knowledge, directly analyzing OMCI service models can be extremely challenging, time-consuming, and laborious for inexperienced developers. To address these challenges, the OMCI model exception analysis method based on graph neural networks (GNNs) is proposed. This method leverages graph similarity computation algorithms to search for similar OMCI models from a database as references, compares the differences, and identifies anomalies. Firstly, real OMCI data is structured into graph data. Subsequently, the similarity graph neural network (SimGNN) is improved by integrating graph isomorphism networks and self-attention pooling for fast graph similarity computation. Finally, the similarity scores between each graph in the OMCI graph database and the anomalous graph data are calculated, and the most similar OMCI service model graphs are recommended based on the score ranking. Experimental results show that the improved graph similarity computation model outperforms the baseline model on the OMCI dataset used in this study. Moreover, it proves effective in practical applications, offering valuable assistance in analyzing OMCI interoperability issues.

Multi-feature-based log event anomaly detectionYU Jia-ni,HU Zhao-xia,JIANG Cong-feng

(School of Computer Science, Hangzhou Dianzi University, Hangzhou , China)

2024, 46(9): 1587-1597. doi:

Abstract ( 500 )

PDF (1539KB) ( 808 ) 　　

anomaly detection;log event;multi-features;attention mechanism

Information hiding algorithm in compressive sensing encrypted domain based on homomorphism addition

LI Ming, XIN Xin

2024, 46(9): 1598-1605. doi:

Abstract ( 420 )

PDF (1790KB) ( 648 ) 　　

Information hiding ensures necessary security of massive data in the cloud and Internet of things environments. Although traditional encryption can protect the privacy of the image effectively, it cannot protect the image in other aspects such as copyright and integrity. Therefore, information hiding in encrypted domain is required and challenged. An information hiding scheme in homomorphic encrypted domain is proposed. Firstly, the homomorphism of compressive sensing is explored, and it is found that doubling the measurements obtained by compressive sensing is equivalent to directly extending the original signal before compressive sensing. Secondly, by using homomorphic addition, information hiding in the compressive sensing encrypted domain is realized based on differential extension. The experimental simulation results show that the proposed algorithm has satisfactory performances in both privacy protection and information hiding, and the embedding capacity is higher than the related state of art works.

A secure federated learning scheme based on differential privacy and model clustering

XIAO Di, YU Zhu-yang, LI Min, WANG Lian

2024, 46(9): 1606-1615. doi:

Abstract ( 675 )

PDF (1150KB) ( 719 ) 　　

Model security and clients privacy are urgent challenges to be addressed in federated learning. In order to simultaneously tackle these challenges, a federated learning scheme based on differential privacy and model clustering is proposed. Local differential privacy is introduced in clients updates to protect clients privacy by disrupting the parameters. To ensure precise clustering of noisy model updates, cosine gradient is defined for the first time to cluster noisy model updates. Based on the clustering results, malicious models are accurately identified and filtered. Finally, global differential privacy is introduced to resist potential backdoor attacks. The noise boundary of global noise is obtained by theoretical analysis and it is proved that the total noise introduced by our scheme is lower than that introduced by the classical model security scheme. The experimental results demonstrate that our scheme can achieve the expected goals in terms of accuracy, robustness and privacy.

Mobile monocular depth estimation based on multi-scale feature fusion

CHEN Lei, LIANG Zheng-you, SUN Yu, CAI Jun-min

2024, 46(9): 1616-1524. doi:

Abstract ( 532 )

PDF (876KB) ( 715 ) 　　

The current depth estimation model based on depth learning has a large number of parameters, which is difficult to adapt to mobile devices. To address this issue, a lightweight depth estimation method with multi-scale feature fusion that can be deployed on mobile devices is proposed. Firstly, MobileNetV2 is used as the backbone to extract features of four scales. Then, by constructing skip connection paths from the encoder to the decoder, the features of the four scales are fused, fully utilizing the combined positional information from lower layers and semantic information from higher layers. Finally, the fused features are processed through convolutional layers to produce high-precision depth images. After training and testing on NYU Depth Dataset V2, the experimental results show that the proposed model achieves advanced performance with an evaluation index of δ1 up to 0.812 while only having 1.6×106 parameters numbers. Additionally, it only takes 0.094 seconds to infer a single image on the Kirin 980 CPU of a mobile device, demonstrating its practical application value.

A military image set captioning method based on image and text relevance and context guidance

MEI Yun-hong, LIU Mao-fu,

2024, 46(9): 1625-1634. doi:

Abstract ( 357 )

PDF (1839KB) ( 751 ) 　　

Traditional image captioning methods do not generate explanatory description texts due to the lack of a priori knowledge of the real world, while the accuracy of the generated description texts is not high in some specialized fields. To address these problems, the military news image set captioning task is proposed, and a military news image set dataset is also constructed. The task has two key challenges: the description information is derived from the whole image set and the corresponding news articles; the semantics learned by the model is not sufficient. A military news image set captioning method based on image and text relevance and context guidance (ITRCG) is further proposed. Based on ITRCG, cross-modal information interaction is realized, the model is guided to learn more complete semantics, and named entity generation is assisted by label cleaning. Experimental validation is conducted on the constructed military news image set dataset, and the results show that ITRCG can effectively improve the quality of the description text and achieve improvements in all evaluation metrics.

An improved whale optimization algorithm based on multiple strategies

DAI Chun-yu, MA Lian-jie, JIANG Han-cun, LI Hong-shuang

2024, 46(9): 1635-1647. doi:

Abstract ( 521 )

PDF (1433KB) ( 569 ) 　　

To address the issues of the standard whale optimization algorithm, including slow convergence speed, imbalance between exploration and exploitation, lack of information exchange among the population, and susceptibility to local optima, an improved algorithm is proposed. Firstly, the Tent chaotic mapping is employed to enhance the uniformity of the initial population distribution. Secondly, a nonlinear convergence factor is introduced to improve the algorithms global search ability in the early stage and local exploration ability in the middle and late stages, coordinating the transition mechanism between search and exploitation. Then, the average position vector of the population is introduced into the random search process, effectively addressing the lack of information exchange between individuals and the population. Next, an adaptive inertia weight is introduced into the position update formula to enhance the convergence speed and accuracy of the algorithm. Finally, the Cauchy operator is utilized to perform mutation perturbation on individuals trapped in local optima. Simulation experiments were conducted on 15 benchmark test functions to evaluate the improved algorithm. The experimental results demonstrate that the improved whale optimization algorithm possesses excellent performance, and the effectiveness of the improved algorithm is proven through the Wilcoxon rank-sum test

A deep neural network-enhanced pairwise bilinear factorization machine model

ZHOU Qi, ZHOU Ning-ning

2024, 46(9): 1648-1659. doi:

Abstract ( 371 )

PDF (1794KB) ( 726 ) 　　

The neural network-enhanced factorization machine (FM) model, which can capture more high-order feature interactions and improve the accuracy of predictions, has become a research hotspot in the field of recommendation algorithms. Aiming at the problem that existing models do not comprehensively consider high-order interaction features and original low-order features when modeling the interactions between users and items, and in order to improve the models ability to model user preferences, this paper proposes a new deep neural network-enhanced pairwise bilinear factorization machine model, DeepPRBFM, by combining depth neural networks and pairwise learning. This model adopts a bilinear structure with a pair of inputs containing positive and negative samples, utilizes multi-layer ResNet to preserve low-order features, enhances the interaction of high-order features with DNN, and employs a pairwise ranking-based loss function. Moreover, in the bilinear structure, increasing the proportion of negative samples can not only significantly alleviate the cold start problem of the recommendation system but also improve the prediction performance of the model. Experiments conducted on two real-world datasets show that the proposed model achieves higher recommendation accuracy and outperforms other models in objective metrics such as HR and NDCG.

Flight path planning based on improved genetic algorithm and multi-objective optimization model

AN Yuan-yuan, MA Xiao-ning

2024, 46(9): 1660-1666. doi:

Abstract ( 598 )

PDF (608KB) ( 739 ) 　　

For the existing path planning models, it is difficult to solve the problem of optimal path planning under different aircraft types and transportation time conditions with a single cost planning. This paper combines aircraft type configuration, transportation time and system cost, and establishes an optimization model of hub-and-spoke airline network through the location of hub cities, the flow distribution from non-hub city nodes to hub city nodes, the flight time of the fleet and the fleet size. The entropy method is used to establish the chromosome selection mechanism, and the adaptive crossover rate is introduced to improve the genetic algorithm. The improved algorithm (IGA) is used to optimize the location distribution of the best route and hub nodes. Our method is compared with the traditional genetic algorithm, bee colony algorithm and grey Wolf algorithm. The research indicates that combining different aircraft type configurations and transportation times is superior to single-cost path planning. By optimizing and solving the hub-and-spoke airline network model with the improved algorithm, the total system cost is reduced by 3.41×1010, providing a reference for the reasonable allocation of fleet resources.

A multimodal multi-label classification method based on hypergraph

LU Bin, FAN Qiang, ZHOU Xiao-lei, YAN Hao , WANG Fang-xiao,

2024, 46(9): 1667-1674. doi:

Abstract ( 525 )

PDF (1031KB) ( 930 ) 　　

Label classification aims to select the most relevant subset of labels from a set of labels to tag an instance, which has become a hot issue in the field of artificial intelligence. Traditional multi- label learning methods mainly focus on identifying single-modal data, with limited research on mining high-order correlation between multi-modal data. To address the issue of insufficient representation of high-order correlations between multi-modal data in multi-label scenarios, this paper proposed a multi-modal multi-label classification method based on hypergraphs. The hypergraph model is introduced to model the high-order correlations of multi-modal data, and the fusion of multi-modal features and hyperedge convolution operation are utilized to achieve the mining of multi-modal data relationships and feature recognition, thus improving the performance of multi-modal multi-label classification. Experiments were conducted on the movie genre classification task, and the proposed method was compared with traditional methods. The experimental results show that the proposed method outperforms the comparison methods in terms of accuracy, precision, and F1 score, demonstrating the effectiveness of the method.

Meta-learning based graph neural network cold start recommendation

WU Si-qi, ZHAO Qing-hua, YU Yu-chen

2024, 46(9): 1675-1684. doi:

Abstract ( 670 )

PDF (1751KB) ( 855 ) 　　

In order to overcome the limitation of the cold start problem in the recommendation process on the performance of new users or new project scenarios, a meta-learning based graph neural network cold start recommendation model, namely MetaNGCF, is proposed to improve the accuracy and diversity of user-to-recommendation. Firstly, a perceptual meta-learning structure with adaptive properties is proposed to construct a model with a hybrid user-project interaction graph and neural graph, which expresses user behavior and project knowledge in a unified way. This structure incorporates an adaptive weighted loss strategy to correct the meta-learning paths in real time, in order to avoid the damage caused by noisy tasks on the model. Secondly, a clustering algorithm is applied to transform the high-dimensional feature space into a low-dimensional low-rank feature space. User preference learning is utilized to task aggregation layer gradient to encode the collaborative signals and automatically gene- ralize the higher-rank connectivity between users and projects, which in turn captures the NGCF general knowledge semantics. Finally, the results are validated in comparison with existing MetaHIN algorithm, and the results show that MetaNGCF has better performance on Recall@20 and NDCG@20.

A deep subspace clustering algorithm based on dual self-expression and the maximum entropy principle

LI Meng, LIU Zi-yi, SONG Yu-hang

2024, 46(9): 1685-1692. doi:

Abstract ( 426 )

PDF (843KB) ( 620 ) 　　

The deep subspace clustering algorithm utilizes deep neural networks to map the original input data to a latent space and employs the self-expression of the data as a measure of data similarity, effectively achieving clustering of high-dimensional data. However, such algorithms only focus on the self-expressive relationship in the latent space, resulting in their performance heavily relying on the quality of features extracted by the deep neural networks. Additionally, the regularization process ignores the connectivity within each subspace, affecting the performance of spectral clustering. To address these issues, a deep subspace clustering algorithm based on dual self-expression and the maximum entropy principle is proposed. This algorithm simultaneously learns the self-expressive relationships in both the latent space and the input space, guiding the deep neural network to obtain data representations suitable for subspace clustering. By maximizing the entropy of the similarity matrix, it ensures that elements within the same subspace are uniformly and densely distributed, thereby improving the performance of data clustering. Extensive experiments on five datasets verify the effectiveness of the proposed algorithm.

A point of interest recommendation model based on tracks and friend relationship of users

LIU Guo-qi, HE Ting-nian, RONG Yi-xuan, LI Zhuo-ran

2024, 46(9): 1693-1701. doi:

Abstract ( 461 )

PDF (1040KB) ( 678 ) 　　

Point of interest (POI) recommendation is one of the most important applications of location-based social networking (LBSN). Existing studies have proposed methods that utilize POI information and spatio-temporal information for recommendation, but the relevant auxiliary information has not been fully utilized, so it can not alleviate the problems of insufficient information and less travel data caused by short track check-in of users. To solve these problems, a friends relationship and self- attention recommendation model ATFR is proposed. The prediction model consists of three parts: Firstly, the representation vector of friend relationship is obtained by graph embedding method and input into the neural network. Secondly, the user interest preference vector is obtained by GRU. Then, the self-attention mechanism is used to model the sequential and social effects of the user check-in sequence and selectively focus on the historical check-in records associated with the check-in sequence. Finally, the future interest points are recommended according to the interest points sorting list. Experimental results on two public datasets show that this model performs better than other algorithms and can be used to improve the quality of personalized interest recommendation services for websites and applications.

A high utility quantitative frequent pattern mining algorithm based on related degree

WANG Hui, LI Yan, DING Ding, WU Kun, HUANG Ya-ping,

2024, 46(9): 1702-1710. doi:

Abstract ( 287 )

PDF (811KB) ( 506 ) 　　

The high utility frequent pattern mining algorithm mines more important frequent patterns from the data by using the importance degree information. On this basis, the high utility quantitative frequent pattern mining algorithm further explores the quantitative relationship between data items, and thus has become a popular research topic in the field of data mining. RHUQI-Miner is proposed to improve the performance and practicability of the algorithm. Firstly, the concept of related degree is proposed, the item related degree structure is constructed according to the related degree, and a pruning optimization strategy is given to find frequent patterns with higher related degree, reducing redundancy and invalid frequent patterns. Secondly, the fixed pattern length strategy is used to modify the utility information of the item in the mining process, so that the algorithm can control the length of the output frequent pattern according to the actual data situation, and further improve the performance and practicability of the algorithm. The experimental results show that RHUQI-Miner can effectively reduce the time and memory consumption in the mining process, which can provide data support for differentiated and precise maintenance strategies.

Current Issue

Author center

Review center

Online journal