
 2022-02-06 06:02


摘 要


关键词: 数字数据, Web, 数据抓取工具, HTML, 数据库

Design and implementation of crawler based on Web


The world is completely working on digital data. The largest and prime or main collection of this digital data is web. The size of this web is increasing round-the-clock. The principal problem is to search this huge database for specific information. To state whether a web page is relevant to a search topic is a dilemma. There are many techniques to state the relevancy but if focus on the users' perspective as key issue to guide search then semantic based web crawler are unsurpassed. Database web crawlers maps relevancy with the help of lexical database. The crawler uses the senses provided by lexical database to discover relatedness among the search query and the web page being searched. Focused web crawler helps to find the similarity of web page to the search query without downloading that page. Thus focused web crawler is saving the bandwidth required to download a HTML page. This paper proposes and discusses one such approach to implement semantic based focused web crawler.

KEY WORDS: digital data, Web, crawler, HTML, database


摘要 …………………………………………………………………………………………2

Abstract …………………………………………………………………………………………2

第一章 绪论 …………………………………………………………………………………5

1.1 课题研究背景 ……………………………………………………………………5

1.2 国内外研究现状 ………………………………………………………………6

1.3 主要研究内容 ……………………………………………………………………7

1.4 论文结构安排 ……………………………………………………………………7

第二章 相关技术基础 ………………………………………………………………………8

2.1 网页抓取策略 ……………………………………………………………………8

2.2 数据库的选择 ……………………………………………………………………9

2.3 爬行运用 ……………………………………………………………………9

第三章 数据抓取工具的需求分析 …………………………………………………………10

3.1 系统实现的目标 ………………………………………………………………10

3.2 功能性需求描述 ………………………………………………………………10

3.2.1 信息获取模块 ………………………………………………………………10

3.2.2 信息处理模块 ………………………………………………………………11

3.2.3 信息存储模块 ………………………………………………………………12

3.3 非功能性需求 ……………………………………………………………………13

第四章 数据抓取工具的设计和实现 …………………………………………………………15

4.1 系统设计原则 ……………………………………………………………………15

4.2 系统体系结构 ……………………………………………………………………15

4.3 系统详细设计和实现 …………………………………………………………16

4.3.1 获取HTML页面 ……………………………………………………………16

4.3.2 处理HTML页面 ……………………………………………………………17

4.3.3 关键字匹配 ………………………………………………………………19

4.3.4 数据入库 ………………………………………………………………19

4.3.5 数据库管理 ……………………………………………………………20

4.3.6 线程间通信 ……………………………………………………………20

第五章 数据抓取工具的测试结果 …………………………………………………………22

5.1 开发环境 …………………………………………………………………………22

5.2 抓取工具运行结果 ………………………………………………………………22

第六章 结论 …………………………………………………………………………………27

6.1本文工作总结 …………………………………………………………………27

6.2存在的问题和展望 ………………………………………………………………27

致谢 …………………………………………………………………………………………28

参考文献(References) ……………………………………………………………………29

  1. 绪论




您需要先支付 80元 才能查看全部内容!立即支付
