
 2022-05-21 10:05


摘 要









In the 21st century, the world has entered the information age. A large number of information resources on the Internet tend to be constantly updated and changed over time. For sites with high timeliness, if there is no preservation of past news updates, users want to know When there is a news corpus in a certain field in the past, there are often lack of channels and methods.

The existing corpus on the Internet is very large, and it stores the corpus data appearing on books and magazines on the massive Internet. However, these corpora generally do not systematically collect corpus for a certain field, the source of corpus is complex, and the corpus content is confusing; the existing corpus system also has the problem of too long content update period, which is highly time-sensitive on the network. The corpus cannot update and supplement the content in time. On the other hand, some corpora lack the basic language processing of the original corpus, such as word segmentation and part-of-speech tagging work, which makes the subsequent analysis of corpus features not convenient enough.

This project designed and completed a natural language processing system for dynamic chronology corpus, which focuses on the corpus content of colleges and universities in the field of education at home and abroad. The main tasks of the project are:

According to the characteristics of the webpage structure of the homepage of the official website of domestic and foreign universities, the HTML parsing method based on the document tree was designed, and the crawler script was written to complete the content crawling and parsing of the title and text of the homepage of the domestic and foreign universities.

Analyze and compare several mainstream Chinese word segmentation algorithms, select and implement the N-Gram word segmentation model, and complete the simple part-of-speech tagging work based on the word segmentation results.

Build a user-friendly front-end website for retrieving chronological corpus from domestic and foreign university homepages, showing corpus source, content release time, location in the document, Chinese word segmentation results and part-of-speech tagging results.

Keywords: Web crawler; chronological corpus; Chinese word segmentation; N-Gram; Web system

目 录

摘 要 Ⅰ


第一章 绪论 1

1.1 引言 1

1.2 选题背景及意义 1

1.3 论文主要工作 2

1.4 论文组织结构 4

1.5 本章小结 4

第二章 系统整体设计 5

2.1 模块组成及功能 5

2.2 系统架构图 6

2.3 本章小结 7

第三章 网络爬虫模块 8

3.1 开发环境概述 8

3.1.1Python语言及原生爬虫库 8

3.1.2BeautifulSoup4 HTML/XML解析库 8

3.2 爬虫原理分析 10

3.3 利用Requests库获取网页源代码 11

3.4 网页HTML代码结构分析 12

3.5 基于文档树对HTML代码解析 14

3.5.1确定待提取字符串对象的入口class 14

3.5.2利用入口class解析字符串对象 14

3.5.3排除结果集的干扰项 15

3.6 本章小结 19

第四章 数据库模块 20

4.1 选用MySQL数据库 20

4.2 存储需求分析 20

4.3 防止数据库重复插入 21

4.3.1MySQL防止重复插入的三种方法 21

4.3.2使用段落内容的MD5值作为主键 22

4.4 Django创建Model及ORM映射到MySQL 23

4.5 本章小结 24

第五章 语料分析模块 25

5.1 前言 25

5.2 国内外研究现状 25

5.3 各类分词模型的性能比较 26

5.4 选择N-Gram作为系统分词模型 27

5.5 N-Gram模型介绍及应用 27

5.5.1N-Gram模型原理介绍 27

5.5.2Bi-Gram在中文分词上的应用 28

5.6 Bi-Gram模型在中文分词上的实现过程 30

5.6.1数据集来源 30

5.6.2对数据集进行预处理 30

5.6.3训练语料得到Bi-Gram分布矩阵 30

5.6.4根据Bi-Gram预测测试语料 31

5.7 使用最大概率法进行词性标注 32

5.8 本章小结 33

第六章 Web系统设计 34

6.1 开发环境概述 34

6.1.1Web框架Django 34

6.1.2全文搜索框架haystack 35

6.1.3前端开发框架BootStrap 35

6.2 Django Web后端开发 36

6.2.1开发环境及软件版本 36

6.2.2Django安装及配置 36

6.2.3配置全局url 37

6.2.4添加haystack全文检索组件 38

6.2.5添加定时任务模块APScheduler实现定期自动爬虫 39

6.3 本章小结 39

第七章 系统测试及成果展示 40

7.1 爬虫测试 40

7.2 语料分析测试 42

7.2.1分词性能评价指标 42

7.2.2分词结果分析 42

7.2.3 词性标注结果分析 42

7.3 Web模块测试 43

7.4 本章小结 45

第八章 总结与展望 46

8.1工作总结 46

8.2工作展望 46

8.3本章小结 47

参考文献 48

致 谢 50

第一章 绪论

1.1 引言



您需要先支付 80元 才能查看全部内容!立即支付
