• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2022, Vol. 44 ›› Issue (07): 1152-1161.

• High Performance Computing • Previous Articles     Next Articles

A parallel large dataset generator based on MPI

GE Xu-ran1,LIU Yang1,CHEN Zhi-guang2,XIAO Nong1   

  1. (1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
    2.School of Computer,Sun Yat-sen University,Guangzhou 510006,China)
  • Received:2021-11-19 Revised:2022-01-06 Accepted:2022-07-25 Online:2022-07-25 Published:2022-07-25

Abstract: The speed of big data processing and analysis algorithms in optimization research is often limited by the size of the dataset. In the case of insufficient data volume, the communication time of the algorithm is often higher than the real calculation time, and the real effect cannot be verified. Therefore, a large dataset generator is designed to provide benchmark datasets for parallel big data processing and analysis algorithms running on supercomputers. Firstly, a parallel random number generator is constructed using MPI parallel programming technology. On this basis, artificial datasets with controllable scale and complexity are implemented which mainly includes classification and clustering datasets, regression datasets, manifold Learning datasets, factorization datasets, etc. Besides, the I/O system of the large dataset generator is designed. The system provides interfaces for MPI-I/O parallel read and write datasets. It also sets the distribution and mapping rules of the dataset between different processes and realizes the data access between different nodes through point-to-point communication. Experimental results show that the parallel large dataset generator effectively improves the efficiency and scale of data generation, and provides high-quality, large-scale test datasets for big data processing and analysis algorithms. 

Key words: MPI, large dataset generator, I/O system, parallel big data processing algorithm, algorithm test