Research on parallelizing the 
TFIDF algorithm based on Hadoop

J4 ›› 2014, Vol. 36 ›› Issue (06): 1018-1022.

• 论文 • Previous Articles Next Articles

Research on parallelizing the
TFIDF algorithm based on Hadoop

WANG Jingyu1,2,ZHAO Weiyan2

(1.School of Computer and Communication Engineering,University of Science and Technology Beijing,Beijing 100083;
2.College of Information Engineering,Inner Mongolia University of Science and Technology,Baotou 014010,China)

Received:2012-12-22 Revised:2013-02-25 Online:2014-06-25 Published:2014-06-25

Abstract

Abstract:

Aiming to improve the efficiency of text classification algorithm on a large data set during the training and testing process, the TFIDF text classification algorithm based on the Hadoop distribution platform is proposed, and its implementation process is given. By using the MapReduce programming model, the parallelized TFIDF text classification algorithm is implemented, which takes the word locations into consideration. Comparative experiments are conducted between the improved TFIDF algorithm and the traditional serial algorithm in both the standalone mode and the cluster mode. The experimental results show that the improved TFIDF text classification algorithm can achieve highspeed mass data classification and optimize performance.Key words

Key words: text classification；MapReduce；parallelization；TFIDF algorithm

WANG Jingyu1,2,ZHAO Weiyan2. Research on parallelizing the
TFIDF algorithm based on Hadoop [J]. J4, 2014, 36(06): 1018-1022.

Research on parallelizing the
TFIDF algorithm based on Hadoop

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 0

Recommended Articles

Metrics

Comments

Research on parallelizing the TFIDF algorithm based on Hadoop

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 0

Recommended Articles

Metrics

Comments

Research on parallelizing the
TFIDF algorithm based on Hadoop