论文总字数：14562字

目录

1 绪论 1

1.1离群数据挖掘技术 1

1.2 研究背景及意义 1

1.3 国内外研究现状 1

1.4 本文的结构安排 1

2信息熵定义 2

3 SLOM算法的简介 3

3.1 相关定义 3

3.2 SLOM算法的分析 5

4 LDOF算法的简介 5

4.1 相关定义 5

4.2 LDOF算法的分析 6

5信息熵加权算法 6

5.1 信息熵加权定义 6

5.2信息熵加权算法 7

5 .3信息熵加权算法的优势 8

6实验分析 8

6.1 随机数据实验 8

6.1.1 随机生成的20个数据 8

6.1.2 实验具体步骤 9

6.1.3 实验结果分析 10

6.2 两个簇数据实验 11

6.2.2 FCM算法简介 12

6.2.3 实验具体步骤 12

6.2.4 实验结果分析 14

6.3 实验总结 17

7 结论 17

参考文献 19

基于信息熵加权的离群点检测研究

冯超

ABSTRACT:Outlier detection is one of the important research directions of data mining technology. Outlier detection techniques can find those outlier data that are not consistent with the majority data in the whole data.This kind of technology is widely used in intrusion detection. This paper is to study the correlation algorithm of outlier data mining.This article increases the entropy calculation based on SLOM and LDOF algorithm. That is, when calculating the distance use information of entropy theory to calculate the weighted distance， the accuracy of detection of outliers is improved, and therefore require more spending at the expense of some space, the attribute weight vector based on the properties of information entropy calculated is stored in these spaces Finally, experiments are completed to prove our opinion.

Key Word: Outlier;Information entropy;Weighted;SLOM;LDOF

1 绪论

1.1离群数据挖掘技术

离群数据，可定义为与数据群体中大部分数据所不一样的、脱离数据群体特征的数据，离群数据的挖掘就是在一个数据集中把那些远离数据集中心的样本数据给找出来。离群数据挖掘的实现算法目前有很多种，包括本文接下来用来对比实验的SLOM算法、LDOF算法、FCM算法、还有参考文献[12]所提出来的在FCM算法的基础上进行改进的WSRFCM算法等等，可以实现离群数据挖掘的算法确实多种多样，但是每个算法都各有他们自身的优势和劣势，现在所需要做的就是在已知算法的基础上提出改进使得算法对于离群数据挖掘的准确率可以进一步提高。

1.2 研究背景及意义

离群点的检测是数据挖掘技术重要研究方向之一，离群点检测技术可以在众多数据中发现与大多数据不一致的那些离群的数据，在现实生活中该技术被广泛的使用，如网络入侵检测、信用卡恶意透支等，这些离群点检测技术应用的领域都是深入广大人名群众的生活，与普通大众息息相关的。

除此之外，对于大量数据的处理操作，若使用离群点检测技术还可以最大化的利用机械作业，有效地减少人工分析数据工作量以及人工分析的经验成分。

综上所述，离群数据的挖掘算法的研究具有相当重要的现实意义。

1.3 国内外研究现状

目前，不管是我国还是国外，都有很多的学者在研究离群数据的挖掘技术，国内外学者对于离群点检测的算法已经提出了很多种方案，有与聚类相关的FCM算法，有与加权关联规则相关的算法，还有与权重值相关的算法，可以说，随着时间的推移，广大的学者不仅仅是在研究新的算法，也在不停的研究如何使得已经提出的算法得到改进从而获得更好的离群点检测结果。

本文探讨的离群数据挖掘的研究是针对已有的离群点检测算法提出改进意见，从而达到提高原有的离群点检测技术的检测精度的目的，使算法得到有效改善。