Computer Engineering & Science

GPU-accelerated RTL simulation with Loop unrolling

TIAN Xi, LI Tun, CHENG Yue, PI Yan, ZOU Hongji

2025, 47(2): 191-199. doi:

Abstract ( 620 )

PDF (1661KB) ( 624 ) 　　

With the development of open-source and agile hardware design methodologies, providing efficient RTL (register-transfer level) simulation support has become increasingly important. The parallel capabilities of GPUs enable the acceleration of RTL simulations by leveraging structural-level and stimulus-level parallelism within RTL simulations. However, due to the presence of feedback loops in timing designs, achieving data-level parallelism within a single testbench remains a significant challenge. This paper proposes a novel method for accelerating RTL simulations using GPUs. The core technologies of this method involve the identification and unfolding of feedback loops in RTL designs, as well as RTL circuit partitioning techniques based on this approach. Circuit partitioning and loop unfolding harness the parallel capabilities of GPUs to accelerate RTL simulations through both structural parallelism and data parallelism within a single testbench. Experimental results demonstrate that the proposed GPU-accelerated RTL simulation method exhibits a speedup ranging from 1.2 to 107.1 times compared to traditional GPU-based RTL simulation methods, and a speedup of 2.2 to 14 times compared to the fastest RTL simulator currently available, ESSENT.

Optimization of isoline and isosurface extraction algorithm based on domestic heterogeneous many-core processors

ZHANG Yuanyin, XIAO Minguang, LIU Zhiyong, WENG Lingling, CHEN Zhiguang, LU Yutong

2025, 47(2): 200-209. doi:

Abstract ( 621 )

PDF (1993KB) ( 850 ) 　　

The MT-3000 is a domestic heterogeneous many-core processor designed by the National University of Defense Technology for the next generation of supercomputers. It has superior computing power and can effectively accelerate data processing in visualization. Isoline and isosurface extraction is the most common geometric visualization method for scalar field data. However, existing extraction algorithms typically target general CPUs or GPUs. On MT-3000 processors, the computing efficiency is low due to the limited cache space on-chip, bandwidth throttling of memory access from the cores, etc. In addition, due to the unique nature of programming models, existing software and methods are unable to run on MT-3000 processors directly. In order to fully utilize the computational efficiency of the domestic supercomputing systems in the field of visualization, this paper implements a new parallelization algorithm of the grid scan algorithm for isoline extraction and the marching cubes algorithm for isosurface extraction based on the hardware characteristics of MT-3000. Techniques such as vector instructions and pipeline implementation are used to better adapt to the many-core architecture, thus achieving the goal of improving performance. The experimental results show a speedup of over 4, and the execution time of both the algorithms decreases nearly linearly while increasing cores, which proves the scalability of the algorithms.

Research on key technologies of Geant4 integration based on Fast CAE

YU Haohao, TANG Bin

2025, 47(2): 210-218. doi:

Abstract ( 443 )

PDF (1584KB) ( 501 ) 　　

The Monte Carlo application toolkit Geant4 is primarily used for simulating the physical processes of particle transport in matter. It is widely applied in various fields such as space applications, radiation medicine, and accelerator physics. However, the default user interface of Geant4 is simple, and the input files and commands can be cumbersome, resulting in poor usability. Firstly, Leveraging the open-source pre-and post-processing integrated platform of FastCAE, key technical research was conducted to integrate the Geant4 solver. This research includes the development of a simulation software solution that integrates geometry modeling, mesh generation, solver computation, and post- processing visualization. Secondly, to address the issue of converting geometric models into physical geometries within the integration, two file conversion methods, namely “Geometry” and “mesh”, were developed. Additionally, Geant4 calculation results were visualized by converting the result files in vtu and vtp formats into vtk files. Lastly, through the implementation of a proton therapy case study, the complete integration process of Geant4 was achieved, demonstrating the effectiveness and usability of the developed mesh conversion methods and visualization techniques. The proposed solution can improve the efficiency of Geant4 development and accelerate the process of productization.

High-performance processor design based on dynamic timing slack exploitation

LIAN Zihan, HE Weifeng

2025, 47(2): 219-227. doi:

Abstract ( 464 )

PDF (1037KB) ( 546 ) 　　

Conventional synchronous circuit design methods determine the operating frequency based on the critical path identified through static timing analysis. However, the critical path is not excited every cycle, leading to dynamic timing slack between the critical path and the actual activated path. Therefore, a high-performance processor design method based on instruction-level timing slack exploitation is proposed, aiming to maximize the exploitation of dynamic timing slack for performance improvement. An automated timing analysis platform is built to obtain instruction timing. A timing encoding strategy is designed to transmit timing information to the hardware through instruction encoding without increasing hardware overhead. Additionally, a timing decoding and arbitration circuit is designed at the hardware level to adjust the clock cycle accordingly based on the instruction timing encoding, thereby achieving instruction-level dynamic timing slack exploitation. Simulation verification of the proposed method is conducted on a superscalar processor based on the RISC-V instruction set. The results show that, compared to traditional design methods, this method can achieve a maximum performance improvement of 31%.

Design and implementation of a high-radix SRT cube root algorithm with radix-4

ZHAO Caihong, LIU Zixuan, ZHOU Jiantao,

2025, 47(2): 228-237. doi:

Abstract ( 402 )

PDF (1085KB) ( 857 ) 　　

The SRT cube root algorithm plays a significant role in fields such as multimedia and computer graphics. Although existing algorithms can accelerate computation by increasing the radix, they still face issues such as a lack of initialization processing, complex design of the quotient digit selection table, and implementation difficulties. This paper designs and implements an SRT cube root algorithm with radix-4. Firstly, proposing a high-radix SRT cube root initialization algorithm to ensure the feasibility of subsequent iterative calculations; designing a quotient digit selection table for the radix-4 SRT cube root algorithm to provide necessary conditions for quotient digit selection; optimizing the timely conversion algorithm to avoid multiple carries during the conversion process. Secondly, the above radix-4 SRT cube root algorithm is improved and implemented based on the PyRTL tool, effectively mitigat- ing the implementation challenges of high-radix SRT cube root algorithms. Finally, a comparison with the existing radix-2 SRT cube root algorithm demonstrates the effectiveness and superiority of the proposed algorithm.

rtTorTIM: A real-time Tor traffic identification method based on multi-modal feature fusion and Stacking ensemble learning

WANG Yufei, LIU Qiang, ZHANG Weizhen, WU Xiaojie, LI Jiawen, WANG Yuheng

2025, 47(2): 238-246. doi:

Abstract ( 626 )

PDF (988KB) ( 809 ) 　　

Tor network, as a representative of anonymous networks, offers strong privacy protection while also providing a breeding ground for cybercriminal activities. Therefore, conducting research on real-time and high-precision identification of Tor network traffic is of great practical significance. To address issues of weak generalization and poor real-time performance in existed research, a Tor network traffic identification method, called rtTorTIM, based on multi-modal feature fusion and Stacking ensemble learning technology is proposed. Specifically, the method firstly extracts features from three modalities: host-level, stream-level, and packet-level of Tor network traffic, and then constructs a feature dataset. Random forest, linear regression, and K-nearest neighbor methods are subsequently selected as base learners, along with a linear neural network for decision fusion, to construct a two-layer Stacking traffic classifier. Comparative experimental results based on ISCX Tor 2016 public dataset show that accuracy, precision, and recall of the rtTorTIM method are all 99% in Tor traffic identification, while also demonstrating better performance in terms of real-time classification.

A RFID mutual authentication protocol based on a novel confusion operation

JIA Haozhou, XU Peng , WANG Danchen, XU Yang,

2025, 47(2): 247-255. doi:

Abstract ( 429 )

PDF (1485KB) ( 627 ) 　　

Addressing the privacy and security issues in RFID systems, an ultra-lightweight RFID authentication protocol based on a novel confusion operation is proposed. This protocol achieves low complexity and high security by utilizing simple bitwise XOR, circular left rotation, and a newly proposed ultra-lightweight grouping cyclic operation. Additionally, since the messages during the protocol interaction are generated based on random numbers, attackers cannot exhaustively crack the messages. Analysis and verification results show that the protocol can effectively resist common types of network attacks and has advantages in terms of computation and storage costs.

Research on intrusion detection method based on SAE and WGAN

LIU Yongmin, XU Cheng, HUANG Hao, ZHANG Qianlei, ZHAO Junjie,

2025, 47(2): 256-264. doi:

Abstract ( 650 )

PDF (1180KB) ( 500 ) 　　

In recent years, the rapid development of technologies in the field of machine learning (ML) and deep learning (DL) has led to increasing research on their application in intrusion detection systems (IDS). However, current datasets in the field of intrusion detection face issues such as feature redundancy and an imbalance in the number of samples across different attack categories. To solve these problems, a network anomaly detecting method based on stacked autoencoder (SAE) and Wasserstein generative adversarial network (WGAN) is proposed. Firstly, to address the problem of feature redundancy, this paper employs the encoding-hidden layer-decoding concept of SAEs for data dimensionality reduction. This approach refines various features and extracts lower-dimensional features that are more suitable for classification. Secondly, to tackle the issue of sample imbalance (limited data volume and diversity), the processed data is used as input for the generator in the WGAN model. The generative capabilities of the generative adversarial network are utilized for sample augmentation, thereby compensating for the lack of certain types of samples during the training of the classification model. Finally, the random forest (RF) classification model is used for detection. Experimental results on NSL-KDD dataset show that SAE-WGAN-RF model which based on the proposed method achieves an F1-Score of 95.58%, Recall of 96.54%, and Precision of 96.03%, representing significant improvements compared to common classical algorithms.

Privacy-preserving gene testing based on deep neural network

HUANG Ying, TANG Min,

2025, 47(2): 265-275. doi:

Abstract ( 545 )

PDF (1471KB) ( 696 ) 　　

Deep neural network (DNN) is powerful and widely used for gene testing tasks in biomedi- cal fields. Building a reliable DNN model requires a large number of valid medical samples, while in reality, biological data with high privacy are usually stored in a decentralized manner. Existing solutions struggle to achieve both data security and high model accuracy when dealing with such distributed and large-scale complex learning tasks. To mitigate this problem, a novel privacy-preserving scheme based on the DNN model is proposed, which combines multiple data sources and quickly constructs a high- precision gene testing model. Firstly, the mask matrix is combined with the functional encryption for inner product to eliminate the approximate substitution strategies required in schemes such as fully homomorphic and secret sharing, thereby achieving consistency between the privacy-preserving and the centralized DNN training. Secondly, a non-interactive DNN training mode is constructed to resist the inference attacks caused by global model parameters leakage, ensuring the security of data. Experimental results on real medical datasets demonstrate the correctness, effectiveness, and high accuracy of the proposed scheme.

PCB surface defect dataset and detection based on YOLOv5s-P6SE

LIANG Tairan, JIANG Shixin, LI Quanzhou, OUYANG Bin, Lv Shengping

2025, 47(2): 276-287. doi:

Abstract ( 766 )

PDF (2409KB) ( 1037 ) 　　

To address the demand for surface defect detection in PCB production, a defect classification standard encompassing 11 categories was established based on actual workshop conditions, images of real PCB surface defects were collected, and finally a dataset named Dataset_PCBSD was constructed, containing 3 239 images with 4 672 defective objects. A new PCB surface defect detection model, YOLOv5s-P6SE, was developed based on improvements to YOLOv5s. To enhance detection accuracy, a P6 detection layer for detecting extremely large objects was added to YOLOv5s, along with the introduction of the SE attention module and soft non-maximum suppression post-processing. Experimental results show that YOLOv5s-P6SE achieves a 5.5% improvement in mean average precision (mAP) compared to the baseline model YOLOv5s. Additionally, YOLOv5s-P6SE outperforms Faster R-CNN, SSD, the PCB defect detection model YOLOv4-MN3, and the DETR model RT-DETR-L in terms of both mAP and model size. It also excels in balancing mAP and model size compared to YOLOv8s.

Robust image hiding by invertible generative adversarial network

XU Tianyou, GAO Guangyong

2025, 47(2): 288-297. doi:

Abstract ( 472 )

PDF (2022KB) ( 1056 ) 　　

The purpose of image hiding is to hide the secret image in the cover image,so that the secret image is still imperceptible to the human eyes,but can be restored when needed.Previous image hiding methods were limited in terms of hiding ability and robustness,and they are often susceptible to distortion in transmission.So,this paper proposes a model called RIHIGAN.It uses the same network through forward and backward processes to achieve image hiding and restoration.In the invertible network module,the models image reconstruction ability is enhanced by combining attention mechanisms.On the basis of reversible networks,the architecture of generative adversarial networks is introduced.At the same time,the structure of the discriminator has been improved by combining residual blocks to enhance its discrimination ability. The experiments results show that RIHIGAN effectively enhances robustness while maintaining recovery rate and invisibility.

Focusing paradigm prompt learning of segment anything for unsupervised video object segmentation

SHEN Yonghui, BU Dongxu, ZHANG Shengyu, SONG Huihui,

2025, 47(2): 298-307. doi:

Abstract ( 419 )

PDF (2618KB) ( 494 ) 　　

unsupervised video object segmentation;focusing learning;segment anything model

Cephalometric anatomical landmark localization model based on appearance token and landmark token

LU Gang, XIAO Jinmei, WANG Xiangwen, JIANG Yun, LIN Xianghong

2025, 47(2): 308-316. doi:

Abstract ( 484 )

PDF (1195KB) ( 552 ) 　　

The currently existing deep learning models are still unable to accurately and reliably locate anatomical landmark points on 2D cephalometric X-ray images. To address this issue, proposes a localization model for cephalometric measurement based on appearance token and landmark token. Firstly, fixed-size image patches of different resolutions are sampled from the original image and input into a feature extraction network to extract multi-scale features. Then, these features are converted into appearance tokens through linear projection and, together with landmark tokens, input into a relational reasoning layer. This allows the landmark tokens to learn the intrinsic relationships between the appearance tokens and the land-marks in the interence layer. Finally, through multiple iterative inferences, the model moves the initial points from coarse to fine in a cascaded manner towards the target. Compared with advanced baseline models, the proposed model demonstrates superior localization performance on public cephalometric X-ray images.

A PCB defect detection algorithm based on improved ESP-YOLO

WANG Haiqun, WANG Bingnan, GE Chao

2025, 47(2): 317-326. doi:

Abstract ( 556 )

PDF (1093KB) ( 1191 ) 　　

Defect detection of PCB boards is a crucial means to ensure their quality. To avoid missed and false detections and to enhance the speed of PCB defect detection, an improved ESP-YOLO algorithm for PCB defect detection is proposed. This algorithm incorporates the ESP network structure, utilizing ESP blocks for downsampling, and improves the feature extraction module by adopting a lighter network structure for feature extraction, thereby solving the problem of large PCB defect detection models being difficult to deploy. Additionally, a parameter-free attention mechanism SimAM is introduced to increase the algorithm’s focus on targets in complex environments without increasing the number of network parameters, addressing the issue of missed PCB defect detections due to complex backgrounds. Furthermore, the RFB multi-scale feature extraction module is introduced to expand the model’s receptive field and improve its multi-scale feature extraction capability, solving the problem of missed detections due to varying defect sizes. A learnable parametric feature fusion module, BiFPN, is also introduced to enhance the feature representation ability of the fused feature map. Experimental results show that the ESP-YOLO algorithm has a parameter count of 5.32×106 and a GFLOPs of 11.2, representing a reduction of 23.8% and 29.1% respectively compared to the original YOLOv5s algorithm. The average accuracy is 97.8%, representing an improvement of 3.2% compared to the original algorithm.

A clustering algorithm based on the multi-level density center graph

LU Jianyun, SHAO Junming

2025, 47(2): 327-335. doi:

Abstract ( 469 )

PDF (1852KB) ( 357 ) 　　

Density-based clustering is an algorithm that partitions a dataset based on the density relationships among data objects. By determining the membership relationships between low-density objects and density-center objects within the dataset, density-based clustering can effectively handle clusters of various sizes, shapes, and densities. However, due to the impact of variable densities, noise and complex distributions within datasets, how to accurately estimate the local density of data objects and determine the number of clusters through density centers remain challenges that require further research. To address these issues in density-based clustering, a clustering algorithm based on the multi-level density center graph (CMDCG) is proposed. Firstly, the local density of each data object is calculated using information entropy based on its neighborhood. Secondly, the membership relationships of each data object are statistically analyzed according to its local density and neighborhood space, and density centers are determined. Finally, multi-level density centers are obtained by varying the neighborhood space, and a graph structure is constructed based on the membership relationships among these multi-level density centers. The connected components of the graph are identified as initial clusters, and other data objects are assigned to these initial clusters based on their membership relationships. Experimental results on both synthetic and real dataset demonstrate that the CMDCG algorithm can accurately identify the number of clusters and form correct initial clusters, with clustering results that are robust to varying densities and noise.

Trajectory-user linking based on contextual global spatial graph

HOU Xuan, LIANG Zhizhen, ZHANG Lei, LIU Bailong, ZHANG Xuefei

2025, 47(2): 336-348. doi:

Abstract ( 413 )

PDF (2388KB) ( 697 ) 　　

Trajectory-user linking (TUL) refers to determining the user to whom a target trajectory belongs and has become an important trajectory data mining task. Although deep learning-based models have made significant progress in TUL research, existing approaches mainly focus on the basic spatiotemporal features of individual trajectory points, neglecting the global spatial correlation, contextual information, and users multi-periodic movement patterns, resulting in low accuracy in TUL results. In this regard, a trajectory-user linking model based on contextual global spatial graph (CGSG-TUL) is proposed. In terms of location embedding, a contextual global spatial graph is constructed based on historical trajectories, incorporating contextual information such as proximity relationships and categories of all locations. This effectively models the spatial correlations of locations. Regarding time encoding, the timestamps of check-ins are encoded according to different time scales to capture users multi-periodic movement patterns. Experimental results on two real datasets, Foursquare-NYK and Foursquare-TKY, demonstrate that CGSG-TUL outperforms the state-of-the-art baseline model GNNTUL, with an average improvement of 2.50% and 2.72% in terms of ACC@1 and Macro-F1.

A commonsense question answering method based on multi-source knowledge infusion

ZHU Jiajun, BAO Meikai, ZHANG Kai, LIU Ye, LIU Qi

2025, 47(2): 349-360. doi:

Abstract ( 429 )

PDF (1404KB) ( 1054 ) 　　

Commonsense Question Answering is dedicated to having models answer questions that require human commonsense knowledge. A category of methods for this task is to retrieve relevant knowledge to assist the model in answering commonsense questions. This category of methods are mainly divided into two steps: knowledge retrieval and knowledge inference. Knowledge retrieval refers to retrieving the knowledge associated with question, while knowledge inference refers to using the retrieved knowledge to answer commonsense questions. In this regard, one of the challenges facing commonsense question answering is how to find appropriate external knowledge to help answer the question. Many existing commonsense question answering models usually rely on single source of external knowledge, but it is difficult for a single source of knowledge to comprehensively cover all the required knowledge. To address this problem, this paper proposes a commonsense question answering method based on multi-source knowledge infusion. Firstly, in order to cope with the knowledge coverage problem during knowledge retrieval, using pretrained language models to integrate knowledge from multiple sources (including structured and unstructured knowledge) to form a unified knowledge representation. Secondly, in order to make full use of the semantic relations embedded in structured knowledge during knowledge inference, model identify entity concepts and relationship paths between entities in the context to construct the entity relationship graph, and then use graph attention network to model the entity relationship graph. Finally, using the evidence information in the entity relationships graph and entity knowledge representations to reason and answer the questions. The experimental results on the CommonsenseQA dataset show that the accuracy of the commonsense question answering method based on multi-source knowledge infusion is 79.20% and 75.02% on the verification set and test set, respectively, which exceeds the best baseline models. This verifies the effectiveness of multi-source knowledge infusion method in commonsense question answering tasks.

A citation recommendation method based on dual-channel heterogeneous hypergraph neural networks

LI Ruihong, LI Xiaohong, YAO Jin, WANG Shanshan

2025, 47(2): 361-369. doi:

Abstract ( 452 )

PDF (839KB) ( 491 ) 　　

Addressing the issue that existing citation recommendation methods primarily focus on modeling binary relationships using graph structures, and lack sufficient representation of the diversity and variety of node types and interaction relationships, a citation recommendation method based on dual-channel heterogeneous hypergraph neural networks is proposed. Firstly, a heterogeneous graph is constructed, convolutional neural networks (CNNs) and Transformers are utilized to encode the local and global semantic features of each node in the heterogeneous graph, respectively, obtaining structural representations of the target node on the heterogeneous graph channel. Secondly, multiple types of hyperedges are designed to expand heterogeneous data information. Thirdly, a hypergraph is used to encode interactions between nodes, and a hypergraph neural network is employed to capture potential complex high-order semantic relationships in the hypergraph, obtaining semantic representations of the target node on the hypergraph channel. Finally, information from the two channels is aggregated to obtain the final semantic representation of the target node. The correlation between the target paper node and candidate paper nodes is calculated to generate a citation recommendation list. Experimental results on the DBLP and PubMed datasets demonstrate that the proposed method can effectively improve the quality of citation recommendations and achieve better recommendation outcomes.

A hierarchical clustering algorithm based on partitioning natural neighborhood graph

CAI Fapeng, FENG Ji, YANG Degang, CHEN Zhongshang

2025, 47(2): 370-380. doi:

Abstract ( 386 )

PDF (3115KB) ( 752 ) 　　

Natural neighborhood graph can adaptively identify data with different shapes, sizes and dimensions. However, some small clusters cnnot be correctly identified by the algorithm when dealing with data of uneven density and complex structure. To address this issue, a hierarchical clustering algorithm based on natural neighborhood graph partitioning (HC-PNNG) is proposed. The algorithm first constructs a natural sparse graph using the natural neighbor relationship. Subsequently, it completes the hierarchical merging of natural sparse graphs based on the similarity between graphs, thereby achieving more universally applicable hierarchical clustering results. Comparative experiments were conducted on synthetic and real datasets, comparing the proposed algorithm with the latest clustering algorithms. The results demonstrate that the proposed algorithm significantly outperforms other clustering algorithms, verifying its effectiveness.

Current Issue

Author center

Review center

Online journal