• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2014, Vol. 36 ›› Issue (06): 1018-1022.

• 论文 • 上一篇    下一篇

基于Hadoop平台的TFIDF算法并行化研究

王静宇1,2,赵伟燕2   

  1. (1.北京科技大学计算机与通信工程学院,北京 100083;2.内蒙古科技大学信息工程学院,内蒙古 包头 014010)
  • 收稿日期:2012-12-22 修回日期:2013-02-25 出版日期:2014-06-25 发布日期:2014-06-25
  • 基金资助:

    国家自然科学基金资助项目(61163025);内蒙古自然科学基金资助项目(2012MS0912);内蒙古教育厅科研资助项目(Njzy12110)

Research on parallelizing the
TFIDF algorithm based on Hadoop           

WANG Jingyu1,2,ZHAO Weiyan2   

  1. (1.School of Computer and Communication Engineering,University of Science and Technology Beijing,Beijing 100083;
    2.College of Information Engineering,Inner Mongolia University of Science and Technology,Baotou 014010,China)
  • Received:2012-12-22 Revised:2013-02-25 Online:2014-06-25 Published:2014-06-25

摘要:

针对大数据集下文本分类算法在单机上训练和测试过程效率低下的问题,提出了基于Hadoop分布式平台的TFIDF文本分类算法,并给出了算法实现的具体流程。通过MapReduce编程模型实现了考虑到词在文档中位置的并行化TFIDF文本分类算法,并与传统串行算法进行了对比,同时在单机和集群模式下进行了实验。实验表明,使用并行化的TFIDF文本分类算法可实现对海量数据的高速有效分类,并使算法性能得到优化。

关键词: 文本分类, MapReduce, 并行化, TFIDF算法

Abstract:

Aiming to improve the efficiency of text classification algorithm on a large data set during the training and testing process, the TFIDF text classification algorithm based on the Hadoop distribution platform is proposed, and its implementation process is given. By using the MapReduce programming model, the parallelized TFIDF text classification algorithm is implemented, which takes the word locations into consideration. Comparative experiments are conducted between the improved TFIDF algorithm and the traditional serial algorithm in both the standalone mode and the cluster mode. The experimental results show that the improved TFIDF text classification algorithm can achieve highspeed mass data classification and optimize performance.Key words

Key words: text classification;MapReduce;parallelization;TFIDF algorithm