• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (03): 407-415.

• 高性能计算 • 上一篇    下一篇

基于长短时记忆网络的电商大数据同一性标定

刘亚波,吴秋轩   

  1. (杭州电子科技大学自动化学院,浙江 杭州 310018)

  • 收稿日期:2020-03-26 修回日期:2020-05-08 接受日期:2021-03-25 出版日期:2021-03-25 发布日期:2021-03-26

Identity calibration of E-commerce big data based on long short-term memory network

LIU Ya-bo,WU Qiu-xuan   

  1. (School of Automation Engineering,Hangzhou Dianzi University,Hangzhou 310018,China)
  • Received:2020-03-26 Revised:2020-05-08 Accepted:2021-03-25 Online:2021-03-25 Published:2021-03-26

摘要: 政府采购平台上的电商大数据,由于商品种类繁多且书写格式无统一规范,采用传统模型在大数据中标定出同一种商品时准确率低、速度慢、样本利用率低、泛化能力不足。提出一种基于长短时记忆网络(LSTM)的同一性标定模型,该模型由分词、重要性排序和相似度计算3个子模型串联组成。分词子模型对电商大数据进行预处理,获得有区分度的关键词序列;LSTM重要性排序子模型筛选最能表征商品信息的重要关键词序列;LSTM相似度计算子模型在给定大数据中准确标定出同一种商品。另外还引入二分查找、GloVe词向量化和词序列语义校验技术,分别用于提高标定速度、训练样本利用率与标定泛化能力。实验结果表明,在处理不同品类的电商大数据时,所提模型对易混淆样本的同一性标定准确率高。


关键词: 电商大数据, 长短时记忆网络, 重要性排序, 相似度计算

Abstract: Due to the variety of products and the lack of uniform writing format, the e-commerce big data under the government procurement platform uses the traditional model to mark the same product with low accuracy, slow speed, low sample utilization rate and insufficient generalization ability. An identity calibration model based on Long Short-Term Memory Network (LSTM) is proposed, which consists of three sub-models in series, such as word segmentation, importance ranking, and similarity calculation. Firstly, the word segmentation sub-model preprocesses the e-commerce big data to obtain a differentiated keyword sequence.  Next, the LSTM importance ranking sub-model screens the most important keyword sequences that characterize the product information. Finally, the LSTM similarity calculation sub-model accurately calibrates the same commodity in the given big data. In addition, binary search, GloVe word vectorization, and word sequence semantic verification technology are introduced to improve the calibration speed, training sample utilization rate, and high calibration generalization ability, respectively. The experimental results show that, when dealing with big data of different types of government procurement e-commerce, the accuracy of calibrating the identity of confusing samples is high.


Key words: E-commerce big data, long short-term memory, importance ranking, similarity calculation