一种基于网络拓扑结构的社交网络链接预测方法研究

 2022-01-21 10:01

论文总字数:68390字

摘 要

社交网络中的链接预测是指预测节点之间产生链接的可能性,这正是信息科学中最基本的问题——缺失信息的还原和预测。链接预测相关研究是目前计算机科学关注的热点,链接预测在社会网络演化分析和推荐系统等应用中都起着重要,已成为分析社会网络的有力工具。

经典链接预测的方法主要是利用节点信息(例如文本信息,邻居信息等)来进行链接预测的。社交网络中的每一个节点都具有一些属性,利用节点真实的属性信息可以得到很好的链接预测效果,但是很多情况下这些信息的获取非常困难,不能保证信息的真实性,从不同网络获得的信息也不尽相同。而节点间的拓扑结构较容易获得,具有较高的真实性,并且利用拓扑结构进行链接预测具有较好的通用性。但是经典链接预测方法只使用了节点共同邻居,节点度等信息,对节点的拓扑结构信息的利用还不够深入。为此,本文提出了一种新的基于拓扑结构的链接预测方法,更好地利用了节点的拓扑结构信息。此外,不同网络之间会存在一定的区别,促成链接形成的拓扑结构特征也不尽相同。为了合理地利用节点的拓扑结构信息,本文引入了监督学习方法,对节点的拓扑结构信息进行学习,得到特定网络的促成链接形成的拓扑结构特征。本文的主要贡献包括:

  1. 提出了一种寻找节点间拓扑结构信息的方法。本文将路径作为拓扑结构信息的来源,根据“三度影响力原则”以及“六度分隔理论”,假定当节点之间的路径长度超过6时,节点之间不会形成链接,即只考虑路径长度小于等于6的节点对。对于和间的任意路径,连接路径上相邻的节点,就得到了和间的一个拓扑结构。由于节点路径长度不超过6,可以用一个7个节点的图来表示这个拓扑结构,称之为7-子图。
  2. 提出一种基于优化节点配置文件(OVCP)特征抽取,结合监督学习进行链接预测的方法。OVCP提供了一种将子图转换为地址的方法,对于任意一个7-子图,可以算出其地址,并将其作为特征向量的一个特征。为每一个可能形成链接的节点对和设置一个向量,向量的每一个特征的值表示该特征对应的7-子图在节点对和之间的路径上出现的次数。将一段特定时间内形成链接的节点对作为正例,未形成链接的节点对作为反例,然后再将正例和反例的特征向量进行训练并生成模型,最后再用该模型在验证集上验证算法的有效性。
  3. 提出一种基于重叠社区发现算法的节点对选择方法。寻找需要预测的节点对是一个非常重要的步骤,最简单的方法是认为任意两个节点都可能会形成链接,这种方法的复杂度是,但事实上大部分节点对都不会形成链接,因此这种选择方法效率非常低下,对不可能形成链接的节点对进行链接预测会浪费了非常多不必要的计算。为了避免将不可能形成链接的节点对进行计算,本文提出利用快速社区发现算法和重叠社区发现算法将网络分成多个社区,并假定只有社区内的节点对才会形成链接,大大降低了算法的复杂度。但由于快速社区发现算法忽略了社区间形成的链接,丢失了部分可能会形成链接的节点对,而重叠社区算法可以保留社区间可能形成的链接,认为一个节点可以加入多个社区,与现实生活相符,保证了社区挖掘的真实性与准确性。实验结果表明引入重叠社区算法进行社区划分可在提高效率的同时获得较好的链接预测效果。

在上述工作的基础上,本文实现了该链接预测方法,在DBLP、Facebook Friendship和Facebook Wall等真实数据集的实验结果表明该方法能获得优秀的AUC精确度,远高于传统的基于拓扑结构的链接预测方法。

关键词:社交网络,链接预测,拓扑结构,优化节点配置文件,重叠社区发现算法,监督学习

A Social Network Link Prediction Method

Based on Network Topology

Abstract

Link prediction in social network means how to predict the possibility of a link between two unconnected nodes by observing current nodes and network information. It deals with the fundamental problem of information science: restoration and prediction of unknown information. Link prediction plays an important role in analysis on social network evolution and recommend system, which make it a powerful tool to analyze social network.

The traditional link prediction methods mainly base on nodes information, such as text information and neighbor information. Most nodes in social network have some attributes. We can get great prediction results if we can make full use of these attributes. But in most case, it’s hard to crawl these attributes of these nodes and we can’t make sure that all attributes are authentic. What’s more, nodes in different networks have different attributes. Nevertheless, the topology between nodes is easy to get and we can guarantee its authenticity. Besides that, it’s universal to predict links if we choose topology-based methods. But the traditional methods predict links according to common neighbors and nodes degrees only. The research on making use of topology is not thoroughly enough. So in this paper, we propose a new link prediction method based on topology, which makes full use of topology between nodes. There are some differences in different network. So the features that promote the formation of link between nods are different. Main contributions of this paper are:

  1. Propose a method to find topological structures between nodes. In this paper, we consider paths as topological structures. According to the "Three Degrees of Influence Rule" and the "Six Degrees of Separation", we assume that two node will not connect to each other if the length of the path between them is longer than six. So we ignore the paths longer than six. For a path between and, we complete the link between any two nodes on the path. So we can get a topological structure with more information. The length of the path in any topological structure is shorter than or equal to six, so we can store it by a graph with seven nodes, which is called 7-subgraph.
  2. Propose a supervised method based on OVCP and supervised learning. OVCP provides a way to convert a 7-subgraph to an address. For a 7-subgraph, we can calculate its address and treat it as a feature of feature vector. We set a feature vector for each nodes pair and. The value of a feature represents the times of a 7-subgraph on the path of and. If two nodes connected to each other in a time interval, we will set the vector of them as true case. Otherwise, we will set it as false case. Then, we use the vector list and case list to train the supervised learning model. At last, we apply the model to the validation set and validate the effectiveness of our method.
  3. Propose a nodes pair choosing method based on overlapping community detection algorithm. It is an important step to find nodes pair. The simplest way is to consider that any two nodes will be likely to connect to each other whose complexity is. In this paper, we use fast unfolding community detection algorithm and overlapping community detection algorithm to split the original network. These two algorithms can split the network into several communities. We assume that only the nodes pair in the same community will connect to each other, which reduces the complexity. However, the fast unfolding community detection algorithm will ignore some possible link between nodes in different community, but the overlapping community detection algorithm won’t, which is consistent with the reality. Experiments indicate that the overlapping community detection algorithm can achieve higher AUC efficiently.

With these works mentioned above, this paper implements a link prediction method based on topology. The result in DBLP dataset, Facebook Friendship dataset and Facebook Wall dataset indicates that the method can achieve excellent AUC, which is higher than the traditional link prediction methods based on topology.

Keywords:Social Network, Link Prediction, Topology, OVCP, Overlapping Community Detection, Supervised Learning

目 录

摘 要 I

Abstract II

第一章 绪论 1

1.1 研究背景和目的 1

1.2 国内外研究现状 2

1.2.1 链接预测方法 2

1.2.2 拓扑结构的研究 4

1.2.3 节点配置文件 4

1.3 主要工作及贡献 5

1.3.1 本文主要工作 5

1.3.2 本文主要贡献 6

1.4 论文组织结构 7

第二章 问题定义与分析 8

2.1 问题定义 8

2.2 评价标准 10

2.4 本章小结 11

第三章 特征抽取 12

3.1 VCP方法 12

3.2 OVCP方法 17

3.2.1 过渡节点与7-子图 17

3.2.2 7-子图的特征抽取 19

3.3 复杂度分析 22

3.4 本章小结 22

第四章 社区发现 23

4.1 快速社区发现算法 23

4.2 重叠社区发现算法 25

4.3 节点对选择复杂度分析 28

4.3 本章小结 28

第五章 监督学习 29

5.1 支持向量机 29

5.2 朴素贝叶斯分类器 30

5.3 训练模型 31

5.4 评价验证集 31

5.5 本章小结 31

第六章 实验结果及分析 32

6.1 系统实现 32

6.2 数据集分析和处理 32

6.3 实验结果及分析 34

剩余内容已隐藏,请支付后下载全文,论文总字数:68390字

您需要先支付 80元 才能查看全部内容!立即支付

该课题毕业论文、开题报告、外文翻译、程序设计、图纸设计等资料可联系客服协助查找;