• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

Two data compression methods for recommender systems

LIU Bo1,LIU Xiaoguang1,WANG Gang1,WU Di2   

  1. (1.College of Computer and Control Engineer,Nankai University,Tianjin 300350;
    2.Bytedance Inc.,Beijing 100085,China)
     
  • Received:2016-07-11 Revised:2016-09-13 Online:2016-11-25 Published:2016-11-25

Abstract:

There is an enormous number of training data being generated in Headlines Today's sever. These data is formatted for Machine Learning. We observed that whichever common data compression method cannot perfectly satisfy business requirements: a better compression ratio. We present two methods for training data from Headlines Today’s sever. One is called hierarchical cluster compression (HCC), and the other is hash recoding compression (HRC). The HCC with Gzip Compression can quadruple the compression speed than pure Gzip Compression, which indicates that the first  proposed method can effectively promote compression speed and guarantee the compression ratio as well; the HRC with Snappy Compression is able to halve the compression ratio in comparison with pure Snappy Compression, which shows that the HRC can reduce the compression ratio and lower the compression speed as little as possible. Above all, it is meaningful to choose whichever method for decreasing operation costs, promoting business processes efficiency and providing better user experience.

Key words: hierarchical cluster compression, Hash recoding compression, dictionary compression, training data, Gzip, Snappy