论文总字数：22683字

摘要

在信息飞速增长的时代，人们每天都要接受很多的信息，这些信息对人们来讲有的是有用的，有的则不是。互联网发展至今，越来越多的人们开始在网页上浏览各种新闻信息。不同的人有不同的偏好，有人只关注国内的新闻，有人则关心国际风云的变幻，有的人则无所谓。如果能把新闻按类别分类好则替阅读新闻的人省去了许多麻烦。然而，由于网络上的新闻数量巨大，若是靠人工来完成则工作量太大，时效性也差，因而我们使用机器学习技术中的支持向量机（Support Vector Machine）来完成这项工作。

支持向量机是常用的解决文本分类问题的方法。本文利用爬虫技术从互联网上的新闻门户中爬取一定数量的国内新闻文本和国际新闻文本，利用NLPIR2015软件对文本进行分词，利用TFIDF算法和文档频率算法对分词后的文本进行特征提取，并结合去除停用词等其他处理方法，得到文本的特征向量，最终对支持向量机进行训练和测试，并且调整参数进行改进与分析。

关键词：支持向量机，文本分类，TFIDF算法，文档频率算法，停用词，特征向量

The Design and implementation of DOCUMENT CATEGORIZATION system based on support vector machine

Abstract

In the era of information explosion, people face a variety of information everyday. Some of which is a useful, while the other is useless. Along the development of the Internet, more and more people begin to read news on the Internet. Different people have different preferences: some people only pay attention to the domestic news, some people are concerning about the international situation changes, and some people have no preference. If different kinds of news are categorized, it’s more convenient for readers to read the news. However, the number of the news online is so much that if we only rely on manual work, then it will be a hard work and the timeliness is poor. Therefore, we consider using artificial intelligence technology to automatically classify these news online. We use support vector machine (Support Vector Machine) to complete the work.

Support vector machine (SVM) is firstly put forward by Cortes and Vapnik in 1995, in solving small sample, nonlinear and high dimensional pattern recognition problems it shows lots of unique advantages and it can also be applied to many other machine learning problems such as function fitting.

Support vector machine (SVM) is based on the theory of VC dimension theory in statistical learning and structural risk minimization principle. it seeks the best compromise in the model complexity (i.e. to the specified training samples, learning accuracy) and learning ability (i.e., no wrong recognition ability of arbitrary sample) according to the limited sample information in order to obtain the best generalization ability.

Because support vector machine focused on the VC dimension, when it solves problems, it doesn’t care the dimension of the sample (even if the sample has 10 thousand dimensions), which makes the support vector machine (SVM) very suitable to solve the problem of document categorization. This is the reason that we choose support vector machine to do document categorization.

KEYWORDS: support vector machine, document categorization, TFIDF algorithm, Document Frequency algorithm, stop words, feature vector

基于支持向量机的文本分类系统的设计与实现 I

摘要 I

Abstract II

第1章绪论 5

1.1 研究背景与意义 5

1.2 相关研究 5

1.2.1 文本分类 5

1.2.2 支持向量机在文本分类当中的应用 10

1.2.3 TFIDF特征提取算法在文本分类当中的应用 11

1.2.4 文档频次算法在文本分类当中的应用 11

1.3 功能分析 12

1.3.1 功能划分： 12

1.3.2 功能描述： 12

1.4 本文的组织结构 13

第2章相关技术 14

2.1 分词 14

2.2 特征提取 14

2.2.1 去除停用词 14

2.2.2 TFIDF算法 14

2.2.3 文档频次算法 15

2.3 支持向量机 16

第3章系统设计与实现 19

3.1 爬虫 19

3.2 分词 20

3.3 特征提取 21

3.3.1 特征提取步骤 21

3.3.2 流程 21

3.3.3 函数与数据结构 22

3.4 支持向量机 24

3.4.1 模型设计与流程 24

3.4.2 函数与数据结构 26

第4章实验结果分析 30

4.1 结果展示 30

4.2 结果分析 47

4.3 评价 48

致谢 49

参考文献 50

绪论

研究背景与意义

互联网是上个世纪最伟大的发明之一。随着商业网络的发展和大量的商业公司进入了互联网，网上商业应用取得了巨大的发展，互联网为用户们提供了越来越多的服务，互联网迅速的普及和发展起来。随着时代的发展，越来越多的人们接触到了互联网，互联网因此惠及人们生活的方方面面。21世纪，人类进入信息爆炸的时代，新闻信息飞速增加。众多的新闻门户纷纷成立，吸引了大量的用户。渐渐地，人们开始习惯从互联网获取新闻信息。但不同的人有不同的偏好，有人只关注国内的新闻，有人则关心国际风云的变幻，有的人则无所谓。如果能把新闻按类别分类好则替读者省去许多麻烦。然而，由于网络上的新闻数量巨大，若是靠人工来完成则工作量太大，时效性也差，如果能用机器学习技术来自动进行新闻分类则可以省去很多人工成本，且迅捷及时。因而深入研究文本的智能分类的方法具有非常深远的意义。

注册

找回密码

基于支持向量机的文本分类系统设计与实现

绪论

研究背景与意义

相关研究

文本分类

您可能感兴趣的文章

登录

绪论

研究背景与意义

相关研究

文本分类

您可能感兴趣的文章