• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2014, Vol. 36 ›› Issue (06): 1018-1022.

• 论文 • Previous Articles     Next Articles

Research on parallelizing the
TFIDF algorithm based on Hadoop           

WANG Jingyu1,2,ZHAO Weiyan2   

  1. (1.School of Computer and Communication Engineering,University of Science and Technology Beijing,Beijing 100083;
    2.College of Information Engineering,Inner Mongolia University of Science and Technology,Baotou 014010,China)
  • Received:2012-12-22 Revised:2013-02-25 Online:2014-06-25 Published:2014-06-25

Abstract:

Aiming to improve the efficiency of text classification algorithm on a large data set during the training and testing process, the TFIDF text classification algorithm based on the Hadoop distribution platform is proposed, and its implementation process is given. By using the MapReduce programming model, the parallelized TFIDF text classification algorithm is implemented, which takes the word locations into consideration. Comparative experiments are conducted between the improved TFIDF algorithm and the traditional serial algorithm in both the standalone mode and the cluster mode. The experimental results show that the improved TFIDF text classification algorithm can achieve highspeed mass data classification and optimize performance.Key words

Key words: text classification;MapReduce;parallelization;TFIDF algorithm