不规则文本中商品名称识别的特征选择

计算机工程与科学

不规则文本中商品名称识别的特征选择

杨美妮1，何涛2，沈静1，张建军1

（1.海军工程大学理学院,湖北武汉 430033;2.中国科学院武汉文献情报中心,湖北武汉 430071）

收稿日期:2015-09-06 修回日期:2015-11-10 出版日期:2016-10-25 发布日期:2016-10-25
基金资助:
国家自然科学基金（61402516）

Feature selection for product name

recognition in informal texts

YANG Mei-ni1，HE Tao2，SHEN Jing1，ZHANG Jian-jun1

（1.College of Science,Naval University of Engineering,Wuhan 430033;

2.Wuhan Library of Chinese Academy of Sciences,Wuhan 430071,China）

Received:2015-09-06 Revised:2015-11-10 Online:2016-10-25 Published:2016-10-25

摘要/Abstract

摘要：

传统的命名实体识别任务多见于人名、地名、机构名这些普通的命名实体，且大多采用规则文本进行研究。随着电子商务和互联网广告的不断发展，如何从用户的各种不规则的上下文信息中自动识别出商品名称这一特殊的命名实体成为了一个需要解决的问题。为了解决这一问题，建立了一个最大熵模型用于识别论坛发帖这种不规则文本中的商品名称，并探讨了多种特征对于识别效果的影响。这些特征不仅包括传统命名实体识别方法中所使用的局部特征和布朗聚类特征，还包括词的分布式表示这种比较新颖的特征。这些特征按照各种不同的方式进行组合作为模型的输入。在CPROD01评测数据集上的实验结果表明，布朗聚类特征能够有效地提高商品名称识别系统的准确性。

关键词: 商品名称, 不规则文本, 最大熵模型, 词的分布式表示

Abstract:

Most previous studies on named entity recognition (NER) focus on common names such as persons,organizations,and locations in formal texts.With the development of e-commerce and online advertising,how to recognize product names which are special named entities in informal users context becomes more and more important.We design a maximum entropy model to recognize product names from forum posts and explore the impact of various features on the performance.These features include not only traditional features used for NER,but also distributed word representations which are novel ones obtained from the new area of machine learning.We compare the results of the experiments using different feature combinations as inputs.Experiments on the CPROD01 dataset show that the Brown cluster features can improve the accuracy of the product name recognition system.

Key words: product name, informal text, maximum entropy model, distributed representation of words

杨美妮1，何涛2，沈静1，张建军1. 不规则文本中商品名称识别的特征选择[J]. 计算机工程与科学.

YANG Mei-ni1，HE Tao2，SHEN Jing1，ZHANG Jian-jun1.

Feature selection for product name

recognition in informal texts

[J]. Computer Engineering & Science.

[1]	夏吾吉1,2，华却才让1. 基于投射的藏语语义依存分析研究[J]. 计算机工程与科学, 2019, 41(10): 1868-1873.
[2]	卢达威1，宋柔2. 基于最大熵模型的汉语标点句缺失话题自动识别初探[J]. J4, 2015, 37(12): 2282-2293.
[3]	才藏太. 基于最大熵分类器的藏文句子边界自动识别方法研究[J]. J4, 2012, 34(6): 187-190.
[4]	张晓艳王挺陈火旺. 基于混合统计模型的汉语命名实体识别方法[J]. J4, 2006, 28(6): 135-139.