Computer Engineering & Science

A hardware offloading structure and method for ultra long vector reduction operation based on direct memory access and dynamic shared buffer

XU Jinbo, DAI Yi, JIAN Jie

2025, 47(04): 571-581. doi:

Abstract ( 42 )

PDF (1319KB) ( 55 ) 　　

MPI (Message Passing Interface) collective communication enhances system performance by organizing multiple processes across multiple computing nodes to collaboratively complete a series of communication operations. Among these, reduction operations on ultra-long operand vectors are widely used in high performance computing and AI (Artificial Intelligence) computations. This paper proposes a hardware offloading structure and method for ultra-long vector reduction operations based on DMA (Direct Memory Access) and dynamic shared buffers. It achieves control over the hardware offloading process for collective communication through a dedicated hardware communication sequence trigger mechanism. The DMA transmission protocol is employed to enhance the software-hardware transmission efficiency of reduction operands. An on-chip dynamic shared buffer storage structure is introduced to achieve flexible and efficient caching of a large number of operands. By deploying an on-chip ALU (Arithmetic Logic Unit) array, computations are performed directly within the network chip. Experimental results demonstrate significant acceleration compared to both non-offloaded MPI methods and the original offloading method used in Tianhe, especially when dealing with longer reduction vectors.

Design and FPGA implementation of lightweight convolutional neural network hardware acceleration

LI Zhenqi, WANG Qiang, QI Xingyun, LAI Mingche, ZHAO Yankang, LU Yihang, LI Yuan

2025, 47(04): 582-591. doi:

Abstract ( 51 )

PDF (3153KB) ( 58 ) 　　

In recent years, convolutional neural networks (CNNs) have achieved remarkable results in fields such as computer vision. However, CNNs typically have complex network structures and substantial computational requirements, making it difficult to implement them on portable devices with limited computational resources and power consumption. FPGAs, with their high parallelism, energy efficiency, and reconfigurability, have emerged as one of the most effective computing platforms for accele- rating CNN inference on portable devices. This paper proposes a CNN accelerator that can be configured for different network structures, and optimizes its latency and power consumption through three aspects: data reuse, pipeline optimization based on row buffers, and low-latency convolution techniques based on adder trees. Taking the YOLOv2-tiny lightweight network model as an example, a real-time target detection system was built on the Navigator ZYNQ-7020 development board. The experimental results show that the design meets low hardware and power requirements for portable devices, with 88% resource consumption and 2.959 W power consumption. It achieves a detection speed of 3.91 fps at an image resolution of 416×256.

High-power multiphase power supply technology based on domestic devices

JIA Chunbo, CHEN Guang, YAO Xinan, LI Baofeng

2025, 47(04): 592-600. doi:

Abstract ( 70 )

PDF (2359KB) ( 57 ) 　　

High-performance computing has entered the post-exascale computing era, which poses stringent requirements for power supply technology for high-performance processors, including high power, low voltage, and fast response. However, current domestic digital multi-phase controllers and DrMOS power devices cannot meet the power supply requirements for high power at low voltage. To address this issue, a "1 drive 2" engineering design solution is proposed, which utilizes one phase of the digital multi-phase controller to drive two DrMOS per phase, thereby effectively doubling the power delivery capability. By using device selection, parameter setting, and feedback equalization techniques, the scheme controls ripple noise, dynamic response, and multi-phase equalization to ensure that the scheme meets engineering specification requirements. This paper details the principle and implementation of the scheme, and builds a verification system to verify the feasibility and effectiveness of the scheme.

A survey of memory pool systems based on emerging memory-semantic interconnect protocols

HONG Wentao, WU Lizhou, ZHANG Jintao, MENG Fanfeng, OU Yang, WANG Zicong, XIAO Nong

2025, 47(04): 601-611. doi:

Abstract ( 42 )

PDF (1708KB) ( 68 ) 　　

In the era of big data, applications in various data centers, such as AI and cloud comput- ing, have increasingly urgent needs for storing and computing large-scale data, while the access overhead of massive data has become a major bottleneck limiting system performance. In addition, existing data center architectures suffer from issues of low memory utilization and limited memory expansion capabilities. Memory pool systems based on new memory semantic interconnect protocols offer a range of characteristics, including high bandwidth, low energy consumption, large capacity, and scalability, providing new insights into addressing these issues and exerting a significant impact on future data center architectures. This paper discusses and compares the features and operating modes of five new memory semantic interconnect protocols: OpenCAPI, Gen-Z, CCIX, NVLink, and CXL. It analyzes their roles in constructing memory pool systems and further explores their application research. Both in industry and academia, CXL currently attracts the highest attention and holds the best development prospects among memory semantic interconnect technologies. Therefore, we specifically emphasize the characteristics, advantages, and research status of CXL. Finally, we analyze the challenges still faced by memory pool systems based on new memory semantic interconnect protocols and offer prospects for future research in this direction.

Implementation of high-speed AES based on FPGA and improvement of MixColumn

SHEN Jinshang, ZHANG Qingshun, SONG Tierui

2025, 47(04): 612-620. doi:

Abstract ( 33 )

PDF (1102KB) ( 36 ) 　　

A high-speed communication implementation scheme for AES based on FPGA is proposed. By splitting the encryption process into a 30-level parallel pipeline structure, communication speed and encryption efficiency can be improved. At the same time, based on the special GF (28) finite field operation rules of the MixColumn parts in AES and the structural characteristics of FPGA parallel operation, an intermediate cross-MixColumn structure is designed. This structure can effectively reduce the computational delay and usage area of MixColumn and inverse MixColumn parts, and improve the encryption efficiency. From the perspective of logical algebra, the differences in computational resource usage between traditional MixColumn structures, newer MixColumn structures, and inter-mediate cross computing structures are analyzed. Finally, the verification results on Xilinx’s XC5VSX240T chip show that the proposed scheme achieves a throughput of 60.928 Gbps and an encryption efficiency of 14.875 Mbps/LUT.

An isolated sets based parallel Louvain algorithm for community detection

LI Shijie, LIU Yang, TANG Jintao, QIE Hang

2025, 47(04): 621-633. doi:

Abstract ( 23 )

PDF (1242KB) ( 20 ) 　　

To apply the popular Louvain algorithm used in community detection to large-scale graph networks, researchers have proposed a series of parallel Louvain algorithms. However, these parallel algorithms face two challenges: delay caused by information synchronization and the community label exchange problem. To address these challenges, this paper innovatively introduces the concept of isolated sets and partitions the graph network based on the characteristics of isolated sets. On this basis, a parallel Louvain algorithm based on isolated sets is proposed. This algorithm allows for parallel computation and updating of vertex information without generating synchronization delays or requiring community label exchanges. Furthermore, to address the limitation of the long tail effect in data processing inherent in the isolated sets parallel algorithm, an improved fusion algorithm based on hash tables is proposed, which further enhances computational efficiency. Experimental results show that the parallel algorithm and fusion algorithm based on isolated sets have good speedup ratios and higher modularity compared to traditional algorithms.

Research and application of multi-dimensional encrypted data aggregation technology for smart oil and gas exploration and development system

ZHANG Xiaojun, ZHANG Hao, LI Xingpeng, ZHANG Jingwei

2025, 47(04): 634-643. doi:

Abstract ( 22 )

PDF (1168KB) ( 37 ) 　　

Industrial Internet of Things (IIoT) technologies enable intelligent oil and gas exploration and development systems to accelerate the data convergence and break through the barrier of information islands. Simultaneously, the information security protection of data confidentiality, integrity and authentication in the exploration and development process have become more and more important. Regarding this, this paper proposes a multi-dimensional dense data aggregation scheme for intelligent oil and gas exploration and development systems. The scheme combines super increasing sequence technology, modifies the homomorphic encryption algorithm, and designs random blind secret parameters. Even if the decryption key is leaked, the important data transmitted by terminal equipment will not be intercepted. The trusted center generates corresponding private keys according to the real identity of each communication entity, thus each other can flexibly negotiate the authenticated session keys, calculates message authentication code based on the hash function. The control center can lightweight execute integrity verification of aggregated ciphertext sent from the data integration platform server, decrypt the aggregated value, master the average value of status parameters, and achieve real-time supervision and regulation. The security analysis and performance evaluation show the security and efficiency of the scheme in the deployment of intelligent oil and gas exploration and development systems.

Information freshness analysis of nonlinear information state update system based on nonlinear energy harvesting

XUE Kailai , JIA Xiangdong , HAN Xianghua, NIU Xiayang, ZHANG Liang

2025, 47(04): 644-654. doi:

Abstract ( 25 )

PDF (1065KB) ( 31 ) 　　

This paper aims to address the trade-off between the nonlinear age of information (AoI) and the energy efficiency (EE) of time-sensitive amplified forwarding (AF) relay-assisted IoT systems in Nakagami-m fading channels. Firstly, an AF relay transmission model is proposed to find the end-to-end approximate signal-to-noise ratio (SNR). Secondly, the end-to-end block error probability of packet transmission is derived by considering the outdated channel state information (CSI) in the Nakagami-m fading channel. Finally, under the simultaneous consideration of non linear energy harvesting (EH) and outdated CSI, the statistical descriptions of the time duration for the sensor to fully charge its battery are derived as well as those of the time intervals of update packet delivery. As a result, the average AoI is achieved with closed-form expression and the trade-off model of nonlinear AoI to EE is established by simultaneously exploiting outdated CSI and non-linearity of EH circuit. The experimental results show that when the packet length is between 200~250 b and the source transmit power is 35 dBm, the AoI-EE achieves the optimum.

A representation knowledge distillation-based WiFi gesture recognition method

GONG Haocheng, ZHU Hai, HUANG Zifei, YANG Mingze, ZHANG Kaiyu, WU Fei

2025, 47(04): 655-666. doi:

Abstract ( 28 )

PDF (1275KB) ( 34 ) 　　

With the rapid development of artificial intelligence and wireless sensing technologies, WiFi gesture recognition has emerged as one of the research areas attracting significant attention. Current research efforts aim to enhance the robustness of models across different data domains and reduce the reliance on retraining by extracting domain-independent features from channel state information (CSI) and proposing the body coordinate velocity profile (BVP). This enables high accuracy in both intra-domain and cross-domain recognition. However, in practical scenarios, converting collected CSI signals into BVP requires substantial computational resources, falling short of meeting the real-time and scalability requirements in production environments. Additionally, traditional models lack the capability to capture global features and long-term dependencies when dealing with large and complex datasets. To address these issues, a representation knowledge distillation-based WiFi gesture recognition (RKD-WGR) framework is proposed. RKD-WGR utilizes BVP data as input for the teacher model to guide the student model, which uses CSI data as input. This integrates the BVP inference capability into the student model while allowing CSI to learn from itself to complement information missing from BVP. Meanwhile, to improve recognition performance and strengthen the knowledge transfer from the teacher model to the student model, a 3D WiFi Transformer (3DWiT) is introduced as the teacher model. It leverages the spatio-temporal information of BVP to assist the teacher model in acquiring more information and enhancing its knowledge transfer capability. Experimental results on Widar 3.0 dataset demonstrate that, without using BVP and solely relying on CSI, the accuracy for six gesture classes reach 97.1%, for ten gesture classes it is 96.5%, and for 22 gesture classes it achieves 89.5%. These results validate the effectiveness of the proposed framework and model.

A method for maximizing the impact of social networks with network structure adaptability

WANG Xiaojie, HOU Xiaojing, Xu Chun, ZHANG Lei

2025, 47(04): 667-676. doi:

Abstract ( 23 )

PDF (1760KB) ( 36 ) 　　

Influence maximization (IM) has been extensively studied in the analysis and mining of social networks, aiming to find a seed set with k nodes to maximize the coverage of influence spread under a specific propagation model. The current studies rarely consider the influence of network structure on information propagation, and IM algorithms are typically not adaptive to networks with various structures. To solve this problem, this paper studies the IM problem with network structure adaptability, and analyzes the influence of network structure on information propagation. Firstly, according to the relation between network structure and propagation process, three allocation strategies are proposed to adapt to different network types. Secondly, with the influence of nodes measured at the community scale, the initial seed nodes set is constructed. Finally, the initial set of seed nodes is adjusted and optimized to further improve the quality of the seed nodes. Experiments on real and synthetic datasets with different structures show that the proposed algorithm achieves better performance. The paper discovers that the relation between influence spread and the average distance between seed nodes is not that the greater the distance, the better the influence spread, which changes the inherent perception of the average distance between seed nodes when considering the problem of propagation overlap.

3D axial Transformer model for kidney tumor segmentation in CT images

ZHANG Jinlong, WU Min, SUN Yubao

2025, 47(04): 677-685. doi:

Abstract ( 34 )

PDF (1011KB) ( 37 ) 　　

Automatic segmentation of kidneys and their tumor areas in CT image sequences can provide quantitative references for radiotherapy and chemotherapy planning.Currently,kidney tumor segmentation models based on Transformer have attracted widespread attention,especially when used in conjunction with the U-Net model and its variants.Existing Transformer-based segmentation networks typically learn features within local windows of individual slices,resulting in insufficient representation zof intra-slice spatial information and inter-slice axial information.To address this issue,a three- dimensional axial Transformer module is proposed,which decomposes the complex coupling of the three dimensions into alternating axial attentions,integrating both intra-slice and inter-slice axial correlation information.Based on the three-dimensional axial Transformer module,a two-stage kidney tumor segmentation encoder-decoder network,ATrans UNet (Axial Transformer UNet),incorporates multi-scale features and residual learning.On KiTS19 dataset,the Dice similarity coefficients for kidney and kidney tumor segmentation are 96.43% and 81.04%,respectively,representing an improvement of 8.40% over 2D-Unet and 4.84% over 3D-Unet in average Dice scores.

Image encryption and FPGA implementation based on 3D chaotic system

YAN Shaohui, JIANG Jiawei, CUI Yu

2025, 47(04): 686-694. doi:

Abstract ( 23 )

PDF (3474KB) ( 24 ) 　　

This paper aims to implement the application of a chaotic system in image encryption within field-programmable gate array (FPGA). Based on the improved Bao chaotic system, the chaotic system is discretized using the improved Eulers algorithm, and the hardware design is carried out using Verilog language. The accuracy of chaotic system at the software design level is verified through register transfer level (TRL) circuits and ModelSim timing simulation. Discretized chaotic sequences are utilized on the FPGA for image encryption and corresponding key decryption, and the feasibility of the encryption scheme is verified through a video graphics array (VGA) interface. This study successfully implements image encryption using chaotic system at the hardware level, laying the foundation for further application and implementation of chaotic encryption technology on FPGA.

An improved marine animal object detection algorithm based on YOLOv8n: DPSC-YOLO

LIANG Jiajie, XU Huiying, ZHU Xinzhong, WANG Shumeng, LIU Ziyang, LI Chen

2025, 47(04): 695-705. doi:

Abstract ( 47 )

PDF (2375KB) ( 59 ) 　　

In the complex marine environment, deep learning-based object detection algorithms face challenges such as difficulty in feature extraction and missed detection due to blurred images capture and complex backgrounds. Therefore, marine object detection algorithms need to be more efficient and superior in performance. To address this, an improved marine animal detection algorithm based on YOLOv8n, named DPSC-YOLO, is proposed. The DCNv2 module is introduced into the backbone network to adapt to geometric variations of objects by enhancing spatial modeling capabilities. Spatial pyramid pooling faster cross stage partial channel (SPPFCSPC) is added at the end of the backbone network to reduce computational complexity while maintaining the models receptive field. An F2 small object detection head is added to the neck network, combined with the other three scales, using four different receptive field detection layers to improve the accuracy of extremely small object detection. The CoT- Attention mechanism is integrated into the C2f module of the neck network to better utilize contextual information between adjacent keys and dynamically adjust attention allocation based on data characteristics. Experimental results show that DPSC-YOLO improves mAP@0.5 by 1.1% and mAP@0.5:0.95 by 4.6% compared to YOLOv8n, with only a slight increase in parameters and computational com- plexity. This proves that DPSC-YOLO is more suitable for object detection tasks in complex marine environment.

BigFlow: A service system for cross-center collaborative analysis of scientific data

ZHU Xiaojie, CHENG Zhenjing, WANG Huajin, YANG Gang, TIAN Yao, FAN Dongwei, MI Linying, LIANG Zhaoji,

2025, 47(04): 706-717. doi:

Abstract ( 26 )

PDF (3402KB) ( 37 ) 　　

The integration of big data technology and scientific data has spawned numerous new paradigms for scientific research and brought about a widespread need for cross-center collaborative analysis of scientific data. However, such analysis faces significant technical challenges, including inefficient cross-center data transfer, difficulties in cross-framework heterogeneous computing, and low efficiency in cross-center job scheduling, while also requiring trustworthiness throughout the analysis process. To address these technological challenges, a scientific data cross-center collaborative analysis service system called BigFlow has been developed.The systems cross-center collaborative analysis capabilities have been tested and validated based on scenarios such as large-scale astronomical star catalog cross-matching and the identification of check dam locations in the Yellow River basin.

Named entity recognition of crop diseases and pests fusing dual dictionary

ZHU Xiping, GAO Ang, XIAO Lijuan

2025, 47(04): 718-727. doi:

Abstract ( 24 )

PDF (862KB) ( 29 ) 　　

Addressing the issues of domain-specificity, imbalance, and nested entities in crop pest and disease data, which lead to low recognition accuracy of general models, a crop disease and pest entity recognition model incorporating a dual-dictionary approach is proposed. Firstly, the original character data and vocabulary data are introduced into the LE-RoBERTa module and GC-SoftLexicon module, respectively, two independent character vectors are obtained after enhancement processing. Then, the fused character vectors are input into the BiLSTM encoding layer and CRF decoding layer to obtain the optimal entity sequence output. Experimental results show that the model achieves an F1 -score of 95.56% on the constructed crop disease and pest entity dataset, effectively recognizing crop disease and pest entities.

Research on dynamic graph generation model based on deep adversarial network

ZHANG Mengyuan, DUAN Yang, WANG Binbin, ZHANG Lei, WU Yi, LIU Chang, GUO Naiwang, CHENG Dawei

2025, 47(04): 728-739. doi:

Abstract ( 38 )

PDF (1393KB) ( 32 ) 　　

In recent years, the problem of graph generation has received widespread attention. By learning the distribution of real graphs, graph generation techniques can generate synthetic graphs with similar characteristics, which are widely used in various fields such as e-commerce and power networks. In practical applications, most graphs are dynamic, with their topological structures changing over time. However, existing graph generators are primarily designed for static graphs, neglecting the temporal characteristics of graphs. Additionally, current dynamic graph generation models generally suffer from long training times, making it difficult to handle large-scale dynamic graphs. To address these issues, a novel GAN-based model, called dynamic graph generative adversarial network (DGGAN), is proposed. The models encoder employs a graph self-attention mechanism for parallel computation, thereby enhancing model training efficiency. A gating mechanism is used to control information flow, helping the model learn and memorize key information more effectively. Comprehensive experimental evaluations of DGGAN and representative graph generation methods were conducted on six dynamic graph datasets. The experimental results demonstrate that DGGAN outperforms existing models in terms of generated graphs quality and efficiency.

Research on pattern aware sampling algorithm

SHEN Lingzhen, WANG Xin, SHI Junhao, WANG Lu

2025, 47(04): 740-750. doi:

Abstract ( 20 )

PDF (2560KB) ( 35 ) 　　

With the rapid expansion of graph data scale, traditional analysis techniques struggle to cope with, particularly in frequent pattern mining tasks where traditional algorithms are at risk of computational resource collapse. Graph sampling technology effectively reduces data volume and calculation cost, making it a crucial research direction in graph data analysis. However, existing graph sampling algorithms have limitations in supporting frequent pattern mining tasks. The reason is that these algorithms fail to fully incorporate the key attributes of graph data into structural features, resulting in lower sampling quality.Therefore, this paper proposes a pattern aware sampling (PAS) algorithm that considers the high frequency structure and key attributes of the graph. PAS utilizes neighborhoods (local features) and high frequency single-edge patterns (global features) to weight nodes and edges in the graph, and then completes the biased walk on the weighted graph for sampling tasks. Experiments demonstrate that compared with other baseline algorithms, PAS achieves superior performance on multiple indicators and can mine top B frequent patterns highly consistent with those in original graph. Under a sampling ratio of merely 0.20, the accuracy reaches up to 94%.

Construction of Mongolian-Chinese pseudo-parallel corpus enhanced by noisy data

TIAN Yonghong, ZHANG Junjin, SONG Zheyu

2025, 47(04): 751-760. doi:

Abstract ( 18 )

PDF (808KB) ( 31 ) 　　

Neural machine translation (NMT), as the mainstream approach in machine translation, has achieved excellent performance in general translation tasks. However, its translation quality relies heavily on large-scale parallel corpora. For low-resource languages, the scarcity of corpora poses a significant challenge to its development. The emergence of data augmentation techniques can effectively address the issue of data scarcity. Therefore, a pseudo-parallel corpus is constructed by introducing noisy data into back translation. Firstly, the text is pre-processed with corpus. Secondly, the back translation and the back translation combined with noisy data are carried out. Thirdly, the text acquaintance degree is matched. Finally, the back translation technology is compared with the back translation technology combined with noisy data. Experiments on experimental datasets show that the back translation technology combined with noisy data effectively improves the performance of low-resource machine translation. Specifically, its translation results achieve 1.10% improvement compared with those using the back translation technique alone on BLEU score and 1.96% improvement compared with those not using the back translation technique at all.

Current Issue

Author center

Review center

Online journal