60th Anniversary Special Feature — Prospective Commentary
-
Discussion on the development of HPC-AI converged high performance computing technologies
- LU Xicheng, YANG Bo, LIU Jie, HUANG Libo, CHEN Xinhai
-
2026, 48(4):
571-579.
doi:
-
Abstract
(
157 )
PDF (708KB)
(
54
)
-
The technological evolution of high-performance computing (HPC) has always been closely intertwined with the strategic demands in fields such as national defense and military affairs,fundamental science,and industrial engineering.Its development can be broadly divided into 4 key stages: dedicated vector machine, massively parallel computer, heterogeneous parallel computer, and HPC-AI converged computer.Each stage continuously advances in system architecture,software ecosystems,and application paradigms.Currently,HPC is undergoing a profound paradigm shift driven by artificial intelligence.“AI for Science” has emerged as a new scientific research paradigm,in which the high-performance with high-precision for scientific computing and high-performance with mixed-precision characteristics for intelligent computing are converging deeply.This convergence poses formidable challenges to underlying computing architectures in terms of precision coordination,data exchange,and I/O pattern adaptation.Looking ahead to the development of HPC-AI converged HPC technologies,the competitive focus is shifting from single floating-point peak performance toward a comprehensive consideration of data movement efficiency,energy-performance ratio,and system scalability.Tighter integration among computing units,more efficient data flow,and more unified programming abstractions will become crucial features of next-generation HPC systems.The CPU-SIMT converged computing architecture,as a promising HPC-AI converged computing architecture, employs a solution combining “converged computing architecture+hierarchical interconnection networks+converged parallel storage”. This solution is expected to break through the “communication wall” bottleneck in tightly coupled HPC-AI converged computing applications,offering a new technological pathway for building next-generation HPC systems and efficiently supporting applications under the emerging “AI for Science” computing paradigm.
High Performance Computing
-
HSI:A high-bandwidth and low-latency protocol conversion mechanism for multiple chiplets
- WANG Yong, YANG Qianming, FU Wenwen, WANG Yongwen
-
2026, 48(4):
580-589.
doi:
-
Abstract
(
154 )
PDF (1477KB)
(
63
)
-
Chiplet technology has emerged as a promising approach to extend Moore’s Law and enhance chip performance due to its low cost, high yield, and high integration density. Currently, research on inter-chiplet transmission primarily focuses on high-speed interconnect interfaces, while the study of protocol conversion technology from the network-on-chip (NoC) to chiplet interfaces remains underexplored, posing a bottleneck for transmission latency and bandwidth between chiplets. This paper proposes a high-bandwidth and low-latency protocol conversion mechanism, named HSI. HSI employs a combined polling scheduling strategy to read multiple types of Flits from the NoC, thereby reducing transmission latency and enhancing bandwidth. It utilizes a multi-slice packet format to encapsulate Flits, improving effective bandwidth utilization, and adopts a multi-write single-read queue structure to support parallel memory access for multiple Flits, reducing parsing latency. To validate the superiority of HSI, this paper implements and verifies the HSI mechanism with respect to the mainstream CHI network protocol and UCIe chiplet interface protocol. The results demonstrate that HSI achieves a transmission bandwidth of up to 512 Gbit/s, which is compatible with the transmission bandwidth of 32-lane UCIe and the memory access bandwidth of DDR5.0. Moreover, the transmission latency for a single Flit is merely 6.05 ns, while the average transmission latency for burst Flit streams ranges from 1.2~1.7 ns.
-
Depth-driven graph partitioning for critical path delay optimization
- YU Xuewen, CHEN Haiyan, HUANG Pengcheng
-
2026, 48(4):
590-598.
doi:
-
Abstract
(
120 )
PDF (2309KB)
(
30
)
-
In microprocessor design, critical path delay is a crucial factor that restricts the increase in the microprocessor's clock frequency and performance enhancement. The ever-increasing design complexity poses challenges to traditional optimization methods. To address this difficult issue, an automated critical path delay optimization strategy based on depth-driven graph partitioning is proposed, along with the implementation of the corresponding algorithm. The delay optimization problem is modeled as a directed acyclic graph (DAG) partitioning and selection problem. Leveraging the logic netlist designed in the semi-custom design flow, the strategy utilizes depth-driven graph partitioning to identify and select a set of sub-circuit structures with optimization potential. These sub-circuits then undergo logical reconstruction, and the corresponding logic cells in the logic netlist are replaced accordingly. Experimental results demonstrate that the proposed algorithm can optimize circuits designed by electronic design automation (EDA) tools, effectively reducing the logical depth along critical paths. Consequently, it provides an effective strategy for optimizing critical path delay within limited costs, thus aiming to achieve an improvement in microprocessor performance.
-
An efficient large language model inference method for bandwidth-constrained digital signal processors
- CHEN Yang, YANG Xi, SU Huayou, CHEN Kangkang
-
2026, 48(4):
599-607.
doi:
-
Abstract
(
126 )
PDF (1240KB)
(
32
)
-
With the rise of large language models (LLMs), the parameter scale of neural network models has grown exponentially, reaching the order of hundreds of billions or even trillions, posing immense challenges to the computing power and bandwidth of computational devices for model inference tasks. To achieve high-performance LLMs inference on low-bandwidth devices, this study focuses on bandwidth-constrained, long-vector digital signal processor (DSP) architectures, designing and implementing efficient LLMs inference methods. It proposes a tensor shape-aware low-precision matrix multiplication method that fully leverages the DSP’s computational capabilities while reducing memory access pressure. Additionally, it introduces a data dependency-based operator fusion method to minimize the transmission of intermediate temporary data and employs a deferred operator execution method to enhance the core execution efficiency of DSP devices. Experimental results demonstrate that this approach effectively improves the inference per-formance of large models on bandwidth-constrained DSP devices. Compared to conventional implementations, the optimized inference method achieves a speedup ranging from 1.4 to 2.3 times. Furthermore, when compared to multi-core ARM CPUs and Intel Xeon Gold CPUs with higher memory bandwidth, the LLMs inference performance achieves speedups of 2.5 times and 1.5 times, respectively, under the same number of cores.
-
A high-speed multiplier based on reconfigurable low-power processing
- CHEN Yifan, YANG Yuheng, JIANG Yanfeng, CAI Mengye
-
2026, 48(4):
608-616.
doi:
-
Abstract
(
160 )
PDF (879KB)
(
24
)
-
To address the issues of high latency and high-power consumption associated with traditional radix-4 Booth-encoded multipliers, this paper introduces the implementation of a low-power, high-speed multiplier based on an improved Booth encoding scheme. The multiplier employs an improved radix-4 Booth encoding method and utilizes an advance zero encoding module to mitigate power losses caused by conventional encoding. Additionally, a preprocessing approach is adopted to increase extension sign bits, thereby reducing critical path delay. By optimizing the generation rules for the partial product array, the number of compressors is reduced. Furthermore, through enhancements to the compressor structure and the adoption of a reconfigurable compression design, the critical path is shortened, leading to a reduction in overall power consumption of the compression tree. The designed multiplier is implemented using 180 nm process and synthesized with Design Compiler. For a 32-bit multiplier employing this architecture, the critical path delay is 6.73 ns, the circuit area is 116 736 μm2, and the overall power consumption, obtained through random generation of 5 000 sets of random numbers, is 13 838 μW.
Computer Network and Znformation Security
-
A reinforcement learning-based method for generating adversarial examples against PE malware
- ZHANG Chaoran, MA Yuqi, ZHANG Sanfeng, YANG Wang
-
2026, 48(4):
617-627.
doi:
-
Abstract
(
149 )
PDF (911KB)
(
43
)
-
This paper proposes a reinforcement learning-based method for generating adversarial examples against PE malware. Firstly, it regards the generation of adversarial examples for PE malware as a sequence-to-sequence generation task, which models sequences on an offline reinforcement learning dataset and leverages the powerful sequence generation capability of Transformer by incrementally generating sequences through predicting actions at each step. Furthermore, an information transmission mechanism is introduced to facilitate cross-episode information transfer during the reinforcement learning process, enhancing data efficiency. Experimental results demonstrate that the evasion rate of PE malware adversarial examples generated using this method outperforms those in comparative experiments and exhibits transferability.
-
Multivariate time series anomaly detection based on multi-view feature contrastive learning
- QIU Hongrui, WANG Chaoqun, LIU Yi, LUO Yu
-
2026, 48(4):
628-639.
doi:
-
Abstract
(
113 )
PDF (1985KB)
(
36
)
-
To address the challenges of complex temporal dependencies, scarcity of anomalous samples, and the underutilization of frequency-domain information from time-series data in existing models for multivariate time series anomaly detection, this paper proposes a multivariate time series anomaly detection model based on multi-view feature contrastive learning. The model constructs dual feature channels by learning representations of both time-domain and frequency-domain information, and employs a pure contrastive loss to guide the learning process. Additionally, a block-based strategy and graph attention mechanism are adopted in the design of the time-domain channel, while the analysis of temporal variations is extended to a two-dimensional space in the frequency-domain channel, utilizing a multi-scale convolutional module to further enhance the representational capacity of time series data, thereby improving anomaly detection accuracy. Experiments on five publicly available multivariate time series datasets demonstrate that the proposed model achieves superior performance in multidimensional time series anomaly detection tasks.
-
A malicious user detection method based on three-way decision in mobile crowdsensing
- LI Zhiwen, WAN Zixuan, ZHAO Guosheng, LIAO Yiwei
-
2026, 48(4):
640-649.
doi:
-
Abstract
(
118 )
PDF (1637KB)
(
23
)
-
Malicious users pose a significant security threat to mobile crowdsensing networks, severely impacting their service performance and data quality. However, existing binary (black-and-white) malicious user detection methods lack mechanisms for handling suspicious users, leaving persistent security vulnerabilities. To address this issue, this paper proposes a malicious user detection method based on three-way decision. Firstly, an evaluation probability function is constructed using user behavior, data quality, and user recommendations as evaluation metrics. Then, the three-way decision method is employed to classify users into three categories: trust-worthy users, suspicious users, and malicious users. Finally, the grey correlation analysis method is utilized to dynamically identify malicious users among the suspicious ones. Simulation experiments demonstrate that the proposed detection method performs well in terms of accuracy, false positive rate, and false negative rate, effectively enhancing the security performance of mobile crowdsensing networks.
-
A reversible hidden bamboo slips image watermarking algorithm based on mean value theorem of divided difference
- LIU Xueyan, LI Xiliang, QI Yujiao, JIA Bolong
-
2026, 48(4):
650-658.
doi:
-
Abstract
(
145 )
PDF (1876KB)
(
24
)
-
Bamboo slips image watermarking is a key technology for copyright protection, tampering detection and integrity protection of image data. However, most of the image watermarking keys currently lack the supervision of authoritative institutions, which fails to achieve a good balance between perceived transparency and security. This paper takes the digital images of bamboo slips provided by Gansu Bamboo Slips Museum as the research object, and proposes a reversible hiding watermarking algorithm based on the mean value theorem of divided difference. The watermark key is generated by the combination of the authority department and the department of bamboo slips. In view of the uneven distribution of texture and color of bamboo slips after excavation, different watermark key information after splitting is embedded into the RGB three-channel of bamboo Slips image based on the mean value theorem of divided difference. When the least significant bit (LSB) is replaced, the security and perception transparency of the watermark key information are further improved by using Logistic mapping. The experimental results show that the watermark embedded by the proposed algorithm has good invisibility, and the algorithm has high positioning accuracy for tampered positions of the attacked image.
-
Forest areas remote sensing image extraction algorithm with superpixel-based fuzzy C-means
- FENG Dandan, WANG Xiaopeng
-
2026, 48(4):
659-666.
doi:
-
Abstract
(
175 )
PDF (3064KB)
(
21
)
-
Due to factors such as tree species and growing environments in forested areas, phenomena such as uneven distribution and holes appear in forest areas within remote sensing images, making it difficult to accurately extract these areas using traditional fuzzy C-means (FCM) clustering algorithms. To address this issue, a superpixel-based fuzzy C-means method for forest areas extraction from remote sensing images is proposed. Firstly, a GAN-based morphological composite filter is employed to fill holes in the remote sensing forest areas images. Secondly, multi-scale morphology is utilized to transform clustering from individual pixels to superpixels, reducing the complexity of the clustering algorithm. Finally, histogram-based fuzzy C-means clustering is applied to superpixel blocks to extract forest areas information. Experimental results on optical forest areas remote sensing images demonstrate that the proposed algorithm outperforms several other FCM algorithms in terms of performance metrics such as segmentation accuracy, normalized mutual information, F1-score, and Kappa coefficient, with the algorithm achieving a highest accuracy (ACC) of 89.05% and an F1-score of 93.95%.
-
Color image compression based on multi-grouping absolute moment block truncation coding and sort mapping
- ZHANG Mengtao, XIONG Lizhi
-
2026, 48(4):
667-675.
doi:
-
Abstract
(
91 )
PDF (946KB)
(
19
)
-
Digital images, as an important carrier, have been applied in various fields. The generation of many color images occupies a large amount of storage space and network bandwidth. Therefore, color image compression has become a key technology. Absolute Moment Block Truncation Coding (AMBTC), as one of the classic image compression schemes, has been widely studied. However, in existing related schemes, the visual quality and compression rate of reconstructed images are relatively low. To address this problem, a color image compression method based on multi-grouping absolute moment block truncation coding (MGAMBTC) and sort mapping is proposed. The sort mapping algorithm is proposed by utilizing the feature of multiple quantization levels in MGAMBTC. By reordering the quantization levels and mapping them onto the first few bits of the bitmap, the bitmap is compressed. This scheme achieves higher visual quality of reconstructed images than other schemes at the same bit rate. At the same time, the effectiveness of the sort mapping algorithm is demonstrated in experiments.
-
Research on traffic sign recognition algorithm in complex weather conditions
- WANG Haiqun, ZHAO Tao, WANG Bingnan, CHAO Shuai
-
2026, 48(4):
676-688.
doi:
-
Abstract
(
166 )
PDF (2837KB)
(
30
)
-
Traffic sign images captured in complex weather conditions suffer from reduced clarity and increased recognition difficulty, making it challenging for existing algorithms to accurately identify them. To address this issue, an improved traffic sign recognition algorithm based on YOLOv8 is proposed. Firstly, according to the idea of residual learning, a feature map enhancement module is designed to replace the residual block of C2f in the backbone network to improve the feature extraction ability of the backbone network. Secondly, on the basis of coordinate attention (CA), features are grouped and 3×3 convolution branches are added to realize cross-spatial information aggregation, which realizes the capture of finer features and makes the model focus more on the target area rather than the background. Then, the hybrid pooling is used to optimize the spatial pyramid pooling network to improve the feature expression ability of the model. Finally, in order to enhance the expression ability of the target multi-scale features, a multi-scale feature fusion network based on feature recombination and double-branch downsampling is designed to effectively promote the information interaction between different levels of features. Experiments were carried out on the self-made complex weather traffic sign dataset SWTSD. The mean average precision reaches 90.4%, outperforming the baseline algorithm by 3.9%, and the FPS reaches 109.4, which can meet the real-time requirements.
-
Scene text detection based on feature enhancement and adaptively multi-scale feature fusion
- LI Qiong, QI Changshi, XIE Kai
-
2026, 48(4):
689-698.
doi:
-
Abstract
(
126 )
PDF (2585KB)
(
32
)
-
To address the issue of inaccurate text region localization caused by diverse text forms and complex back-grounds in natural scenes, this paper proposes a text detection algorithm based on feature enhancement and adaptively multi-scale feature fusion. Firstly, the residual network is improved to reduce the loss of semantic information. Secondly, coordinate attention is embedded into the extracted features to suppress redundant background information and improve attention to text regions, thereby enhancing the ability to locate text boundaries. Thirdly, an adaptive multi-scale feature fusion module is incorporated to integrate learned spatial location weights into feature maps at different scales, enabling more comprehensive fusion of multi-scale feature information. Finally, a differentiable binarization algorithm is used to generate text detection results. To verify the effectiveness of the algorithm, experiments were conducted on the publicly available datasets ICDAR2015, MSRA-TD500, and Total Text, achieving comprehensive metric F1 -score of 88.1%, 87.7%, and 86.3%, respectively. The experimental results demonstrate that this algorithm exhibits good robustness and generalization in text detection.
-
Low-quality steel stamp character detection and recognition based on adaptive feature fusion
- Lv Shujing, LOU Pengjie, PENG Shiquan, ZHAO Chunlong, LIU Yundan, Lv Yue
-
2026, 48(4):
699-708.
doi:
-
Abstract
(
114 )
PDF (3017KB)
(
15
)
-
To address the challenges faced by stamp character detection on metal products, such as character tilt, blurriness, inconsistent fonts, and interference from rust stains, a character detection model based on adaptive feature fusion, named YOLO-CHAR, is proposed. This model employs the MobileNet feature extraction network to dynamically adjust the weights of channel features, enhancing the model’s ability to capture key features. At the feature fusion layer, it utilizes the generalized feature pyramid network(GFPN) structure and the simplified attention module(SimAM) attention mechanism to flexibly capture multi-scale features and strengthen feature fusion capabilities. Based on this character detection model, a low-quality train wheelsets stamp character detection and recognition system is designed and implemented. This system has been put into use, achieving an overall daily average recognition accuracy of over 92% for wheelsets, which meets the on-site operational requirements.
Artificial Intelligence and Data Mining
-
Design and FPGA implementation of a high-precision frequency offset estimation algorithm
- HUANG Yinjian, ZHENG Longhao, TANG Lijun
-
2026, 48(4):
709-717.
doi:
-
Abstract
(
168 )
PDF (1839KB)
(
30
)
-
Based on the study of the performance of Rife and Quinn algorithms, an improved algorithm is proposed to solve the problems of the fluctuating accuracy and weak anti-noise ability in traditional frequency offset estimation algorithm. This algorithm combines the precision advantage of the Rife algorithm when dealing with large frequency offset factors with the stability of the Quinn algorithm. It also employs multi-spectral-line interpolation with added weighting coefficients to overcome the problem of misjudging the correction direction when the actual frequency is close to the quantized frequency points. Experimental results demonstrate that the proposed algorithm can maintain high frequency estimation accuracy even under low signal-to-noise ratio (SNR) conditions, exhibiting overall more stable performance and closer approximation to the Cramér-Rao lower bound (CRLB) compared to other similar algorithms. Finally, the algorithm was deployed on an FPGA platform, and the results were compared and analyzed with the actual signal frequency, revealing a maximum root mean square error (RMSE) of approximately 16 Hz.
-
Heuristic for CDCL solver based on community structure
-
2026, 48(4):
718-730.
doi:
-
Abstract
(
181 )
PDF (735KB)
(
19
)
-
Due to the presence of community structures in industrial SAT (satisfiability) instances, both the variable state independent decaying sum(VSIDS) and learning rate branching (LRB) heuristics fail to effectively leverage such structures. To address this issue, a branch optimization algorithm based on community reward, namely Cr, is proposed. The core principle of the Cr algorithm is to increase the activity scores of variables within the same community, thereby focusing the search on local solution spaces to reduce restart and backtracking costs, ultimately improve solving efficiency. Firstly, variables are classified into bridge variables and internal variables based on their connectivity within the community. Subsequently, communities are categorized into three types according to the variables they contain. Then, focusing on variable types and community types, different approaches are explored to increase the activity scores of variables within the same community. The determination of whether variables belong to the same community is based on the current community settings, which is in turn determined by the variable with the highest activity scores. Variable types and the current community are two key factors in the Cr algorithm. In preliminary experiments using the Minisat, Maplesat, and Glucose solvers, the impact of these two factors on solving efficiency was analyzed through the global learning rate (GLR), the Incr sequence, and the reward factor α proposed in this paper. Furthermore, based on the analysis results, the Cr algorithm was applied to the advanced SAT solver lstech_maple. Experiments demonstrate that leveraging community structures can effectively enhance the efficiency of advanced SAT solvers. To explain the potential role of communities in conflict-driven clause learning(CDCL) search, the community continuity index (CCI) is proposed, and its role is interpreted in conjunction with the literal block distance (LBD) metric.
-
A collaborative filtering recommendation algorithm fusing ROUSTIDA and improved probabilistic intuitionistic fuzzy clustering
- ZHANG Yanju, WU Yixuan, CHEN Zerong
-
2026, 48(4):
731-742.
doi:
-
Abstract
(
194 )
PDF (2069KB)
(
22
)
-
Fuzzy clustering measures the ambiguity of user reviews and groups similar users into the same cluster, which can improve the scalability and address data sparsity issues in traditional collaborative filtering algorithms. However, existing collaborative filtering algorithms based on fuzzy clustering often overlook the problems of cluster center initialization and fuzzy set weighting, leading to unstable clustering results and an inability to fully utilize review information, which in turn affects recommendation accuracy. To address these issues, this paper proposes a collaborative filtering recommendation algorithm fusing ROUSTIDA and improved probabilistic intuitionistic fuzzy clustering. The algorithm fills in missing data based on attribute reduction rules from rough set theory and the principle of minimiz- ing the difference between the missing matrix and the similarity matrix, thereby reducing data sparsity. It introduces a density function-based initialization method for selecting cluster centers, mitigating the high sensitivity of fuzzy clustering to initial cluster centers. During clustering computation, it separately calculates the probability weights of membership and non-membership degrees, as well as the correlation coefficients of hesitation degrees, using a weighted probabilistic Euclidean distance as the proximity function for clustering to filter out relevant neighbor sets. This approach retains more user review information during the clustering process. Experimental results on MovieLens 100K and Jester datasets demonstrate that, compared to other fuzzy clustering-based recommendation algorithms such as UFCM and FCM-Slope One, the proposed algorithm achieves lower mean absolute error (MAE) and root mean square error (RMSE) values, indicating superior recommendation accuracy.
-
Sound event detection & localization based on saliency detector and decay mask self-attention module
- WANG Chunli, CHEN Shanli, LIU Suqian, ZHAO Xiaochun
-
2026, 48(4):
743-751.
doi:
-
Abstract
(
89 )
PDF (731KB)
(
22
)
-
A novel acoustic module is proposed, which combines a saliency detector with multi-head self-attention equipped with a decay mask. This model aids in better focusing on spatial information when performing sound event localization & detection tasks. By utilizing the saliency detector to concentrate on highly salient regions within local information, the model pays more attention to categories with rich information content. Secondly, a decay mask is introduced into the multi-head self-attention module, enabling the model to focus more on local information. Additionally, adaptive constraints are incorporated to diversify the attention heads. Experimental results demonstrate that the proposed model outperforms the baseline models. When compared with models that fuse Transformer and Multi-scale architectures, the proposed model exhibits superior detection & localization performance. Finally, lev- eraging video information as additional data to enhance performance, the model demonstrates excellent overall capabilities.
-
A multi-attribute network public opinion prediction method based on big data
- PALIDAN Muhetaer, GUO Wenqiang, LU Chong
-
2026, 48(4):
752-760.
doi:
-
Abstract
(
138 )
PDF (1335KB)
(
28
)
-
To quantitatively analyze the ability of social media network public opinion control, a network public opinion risk prediction method based on multi-attribute decision-making and comprehensive weight analysis is proposed. Firstly, web crawling methods are employed for data collection, and anti-interference matched filtering methods are used to clean the collected network public opinion data. Secondly, based on the preprocessed network media public opinion data, a multi-attribute comprehensive decision object model is constructed to obtain multiple quantifiable attribute sets, and word segmentation technology is used to decompose the text data into words. Based on the segmentation results, the association rules between the evolution of public opinion risks and people’s preferences are explored, and then the degree of association is calculated. Finally, the degree of association is fed as input into the BERT pre-trained vector model to obtain the directed feature values of network public opinion risks. By leveraging the evolutionary characteristics of network public opinion risks, predictions of their evolution are achieved. Simulation results demonstrate that the proposed method exhibits strong optimization capabilities in predicting the evolution of network public opinion risks. The F1 comprehensive evaluation metric has improved compared to the standard methods, enhancing the accuracy of public opinion classification. Moreover, the prediction accuracy for the evolution of public opinion risks reached 97.6%.