表示学习中句子与随机游走序列等价性的一种新证明

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇

表示学习中句子与随机游走序列等价性的一种新证明

孙燕1,3,4,5,孙茂松1,2,赵海兴1,3,4,冶忠林1,3,4

(1.青海师范大学计算机学院，青海西宁 810016;2.清华大学计算机科学与技术系，北京 100084;

3.青海省藏文信息处理与机器翻译重点实验室，青海西宁 810008;

4.藏文信息处理教育部重点实验室，青海西宁 810008;5.青海民族大学计算机学院，青海西宁 810007）

收稿日期:2019-07-11 修回日期:2019-09-16 出版日期:2020-02-25 发布日期:2020-02-25
基金资助:
国家自然科学基金(11661069);长江学者和创新研究团队项目(IRT_15R40);青海科技厅应用基础研究(2019ZJ7012)

A new proof of the equivalence between random walk

sequences and sentences in representation learning

SUN Yan1,3,4,5,SUN Mao-song1,2,ZHAO Hai-xing1,3,4,YE Zhong-lin1,3,4

(1.School of Computer,Qinghai Normal University,Xining,Qinghai 810016;

2.Department of Computer Science and Technology,Tsinghua University,Beijing 100084;

3.Key Laboratory of Tibetan Information Processing and Machine Translation in QH,Xining 810008;

4.Key Laboratory of the Education Ministry for Tibetan Information Processing,Xining 810008;

5.School of Computer,Qinghai Nationalities University,Xining 810007,China)

Received:2019-07-11 Revised:2019-09-16 Online:2020-02-25 Published:2020-02-25

摘要/Abstract

摘要：

表示学习是机器学习中通过浅层的神经网络将具有关联关系的信息映射到低维度向量空间中。词表示学习的目标是将词语与其上下文词语的关系映射到低维度的表示向量空间中，而网络表示学习的目标是将网络节点及上下文节点之间的关系映射到低维度的表示向量空间中。词向量是词表示学习的结果，而节点表示向量是网络表示学习的结果。DeepWalk通过随机游走策略获取网络节点上的游走序列作为word2vec模型中的句子，之后通过滑动窗口获取节点对输入到神经网络中进行训练，而word2vec和DeepWalk底层所采用模型和优化方法是相同的，即Skip-Gram模型和负采样优化方法，在word2vec和DeepWalk中负采样的Skip-Gram模型称为SGNS。现有研究结果表明，基于SGNS模型实现的词表示学习和网络表示学习算法均为隐式地分解目标特征矩阵。有学者提出基于单词词频服从Zipf定律和网络中节点度服从幂律分布，认为网络中的随机游走序列等同于语言模型中的句子，但是仅仅基于它们服从幂律分布的理由，来判断句子等同随机游走序列是不充分的。因此，基于SGNS隐式分解目标特征矩阵的理论和依据，设计了2个对比实验，利用奇异值分解和矩阵补全方法分别在3个公共数据集上做节点分类任务，通过实验证实了句子和随机游走序列的等同性。

关键词: 词向量, 移位正定互信息, 句子, 随机游走序列

Abstract:

Representation learning is to map the information with association relationship into a low-dimensional vector space through a shallow neural network in machine learning. The goal of word representation learning is to map the relationship between words and their context words into the low-dimensional vector, while the goal of network representation learning is to map the relationship between network nodes and context nodes into the low-dimensional vector. The word vector is the output of the word representation learning, and the node representation vector is the output of the network representation learning. DeepWalk obtains the walk sequence in the network as a sentence of word2vec model through a random walk strategy, and the node pair is trained in the neural network through the sliding window. word2vec and DeepWalk use the same underlying model and optimization method: Skip-Gram model and negative sampling optimization method. In word2vec and DeepWalk, the Skip-Gram model with negative sampling is called SGNS. The existing research results show that both the word representation learning and the network representation learning algorithm based on the SGNS model both implicitly decompose the target feature matrix. Perozzi et al. proposed that word frequency obeys Zipf's law and node degree in the network obeys the power law distribution, and they considered that the random walk sequence of the network is equivalent to the sentence of the language model. However, the reason for judging whether a sentence is equivalent to a random walk sequence is only based on the power law distribution, which is not sufficient. Therefore, based on the theory and basis of SGNS's implicit decomposition of the target feature matrix, this paper designs two comparative experiments. The experiments use singular value decomposition and matrix completion method to perform node classification tasks on three public data sets, and confirms the equivalence between sentences and random walk sequences.

Key words: word vector, shifted positive pointwise mutual information, sentence, random walk sequence

孙燕, 孙茂松, 赵海兴, 冶忠林, . 表示学习中句子与随机游走序列等价性的一种新证明[J]. 计算机工程与科学.

SUN Yan, SUN Mao-song, ZHAO Hai-xing, YE Zhong-lin, .

A new proof of the equivalence between random walk

sequences and sentences in representation learning

[J]. Computer Engineering & Science.

[1]	喻金平, 朱伟锋, 廖列法. 基于RoBERTa-wwm-BiLSTM-CRF的扶持政策文本实体识别研究[J]. 计算机工程与科学, 2023, 45(08): 1498-1507.
[2]	排日旦·阿布都热依木, 吐尔地·托合提, 艾斯卡尔·艾木都拉, . 基于深度学习的实体关系抽取方法研究[J]. 计算机工程与科学, 2023, 45(05): 895-902.
[3]	董芃杉, 张晶, 金日泽. 基于双通道门控复合网络的中文产品评论情感分析[J]. 计算机工程与科学, 2023, 45(05): 911-919.
[4]	线岩团, 张志菊, 王红斌, 文永华, . 基于Siamese循环神经网络的泰文句子切分方法[J]. 计算机工程与科学, 2021, 43(12): 2238-2242.
[5]	闫雄, 段跃兴, 张泽华. 采用自注意力机制和CNN融合的实体关系抽取[J]. 计算机工程与科学, 2020, 42(11): 2059-2066.
[6]	李超, 严馨, 谢俊, 徐广义, 周枫, 莫源源, . 基于HDP的主题词向量构造——以柬语为例[J]. 计算机工程与科学, 2020, 42(06): 1111-1119.
[7]	蒋亚芳, 严馨, 徐广义, 周枫, 邓忠莹, . 融合多信息句子图模型的多文档摘要抽取[J]. 计算机工程与科学, 2020, 42(03): 535-542.
[8]	李俭兵1,2,3,刘栗材1,3. 基于改进型神经网络的影评文本情感分析算法[J]. 计算机工程与科学, 2019, 41(12): 2261-2269.
[9]	申强强，熊泽宇，熊岳山. 一种新的基于段向量的文本自动摘要方法[J]. 计算机工程与科学, 2019, 41(06): 1064-1070.
[10]	刘培磊，唐晋韬，王挺，谢松县，岳大鹏，刘海池. 基于词向量语义聚类的微博热点挖掘方法 [J]. 计算机工程与科学, 2018, 40(02): 313-319.
[11]	柳路芳1，李波1，陈鹏1，周凌寒1，王兵2. 基于词向量与可比语料库的双语词典提取研究[J]. 计算机工程与科学, 2018, 40(02): 368-373.
[12]	刘梦兰1,2，刘斌1,2,彭智勇1,2. 基于词向量的专利自动扩展查询研究[J]. 计算机工程与科学, 2017, 39(12): 2297-2305.
[13]	努尔艾合买提·艾买提1，艾孜尔古丽1,2，玉素甫·艾白都拉1. 现代维吾尔语句子成分分析技术研究[J]. J4, 2015, 37(12): 2339-2344.
[14]	才藏太. 基于最大熵分类器的藏文句子边界自动识别方法研究[J]. J4, 2012, 34(6): 187-190.
[15]	程传鹏,吴志刚. 一种基于知网的句子相似度计算方法[J]. J4, 2012, 34(2): 172-175.

表示学习中句子与随机游走序列等价性的一种新证明

A new proof of the equivalence between random walk

sequences and sentences in representation learning

PDF

可视化

摘要/Abstract

引用本文

使用本文

相关文章 15

编辑推荐

Metrics

本文评价