• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    

表示学习中句子与随机游走序列等价性的一种新证明

孙燕1,3,4,5,孙茂松1,2,赵海兴1,3,4,冶忠林1,3,4   

  1. (1.青海师范大学计算机学院,青海 西宁 810016;2.清华大学计算机科学与技术系,北京 100084;
    3.青海省藏文信息处理与机器翻译重点实验室,青海 西宁 810008;
    4.藏文信息处理教育部重点实验室,青海 西宁 810008;5.青海民族大学计算机学院,青海 西宁 810007)

     
  • 收稿日期:2019-07-11 修回日期:2019-09-16 出版日期:2020-02-25 发布日期:2020-02-25
  • 基金资助:

    国家自然科学基金(11661069);长江学者和创新研究团队项目(IRT_15R40);青海科技厅应用基础研究(2019ZJ7012)

A new proof of the equivalence between random walk
sequences and sentences in representation learning

SUN Yan1,3,4,5,SUN Mao-song1,2,ZHAO Hai-xing1,3,4,YE Zhong-lin1,3,4   

  1. (1.School of Computer,Qinghai Normal University,Xining,Qinghai 810016;
    2.Department of Computer Science and Technology,Tsinghua University,Beijing 100084;
    3.Key Laboratory of Tibetan Information Processing and Machine Translation in QH,Xining 810008;
    4.Key Laboratory of the Education Ministry for Tibetan Information Processing,Xining 810008;
    5.School of Computer,Qinghai Nationalities University,Xining 810007,China)

     
  • Received:2019-07-11 Revised:2019-09-16 Online:2020-02-25 Published:2020-02-25

摘要:

表示学习是机器学习中通过浅层的神经网络将具有关联关系的信息映射到低维度向量空间中。词表示学习的目标是将词语与其上下文词语的关系映射到低维度的表示向量空间中,而网络表示学习的目标是将网络节点及上下文节点之间的关系映射到低维度的表示向量空间中。词向量是词表示学习的结果,而节点表示向量是网络表示学习的结果。DeepWalk通过随机游走策略获取网络节点上的游走序列作为word2vec模型中的句子,之后通过滑动窗口获取节点对输入到神经网络中进行训练,而word2vec和DeepWalk底层所采用模型和优化方法是相同的,即Skip-Gram模型和负采样优化方法,在word2vec和DeepWalk中负采样的Skip-Gram模型称为SGNS。现有研究结果表明,基于SGNS模型实现的词表示学习和网络表示学习算法均为隐式地分解目标特征矩阵。有学者提出基于单词词频服从Zipf定律和网络中节点度服从幂律分布,认为网络中的随机游走序列等同于语言模型中的句子,但是仅仅基于它们服从幂律分布的理由,来判断句子等同随机游走序列是不充分的。因此,基于SGNS隐式分解目标特征矩阵的理论和依据,设计了2个对比实验,利用奇异值分解和矩阵补全方法分别在3个公共数据集上做节点分类任务,通过实验证实了句子和随机游走序列的等同性。
 
 

关键词: 词向量, 移位正定互信息, 句子, 随机游走序列

Abstract:

Representation learning is to map the information with association relationship into a low-dimensional vector space through a shallow neural network in machine learning. The goal of word representation learning is to map the relationship between words and their context words into the low-dimensional vector, while the goal of network representation learning is to map the relationship between network nodes and context nodes into the low-dimensional vector. The word vector is the output of the word representation learning, and the node representation vector is the output of the network representation learning. DeepWalk obtains the walk sequence in the network as a sentence of word2vec model through a random walk strategy, and the node pair is trained in the neural network through the sliding window. word2vec and DeepWalk use the same underlying model and optimization method: Skip-Gram model and negative sampling optimization method. In word2vec and DeepWalk, the Skip-Gram model with negative sampling is called SGNS. The existing research results show that both the word representation learning and the network representation learning algorithm based on the SGNS model both implicitly decompose the target feature matrix. Perozzi et al. proposed that word frequency obeys Zipf's law and node degree in the network obeys the power law distribution, and they considered that the random walk sequence of the network is equivalent to the sentence of the language model. However, the reason for judging whether a sentence is equivalent to a random walk sequence is only based on the power law distribution, which is not sufficient. Therefore, based on the theory and basis of SGNS's implicit decomposition of the target feature matrix, this paper designs two comparative experiments. The experiments use singular value decomposition and matrix completion method to perform node classification tasks on three public data sets, and confirms the equivalence between sentences and random walk sequences.

 

 

Key words: word vector, shifted positive pointwise mutual information, sentence, random walk sequence