论文总字数：26163字

摘要

网络爬虫是一种常见的网页数据采集工具，按照不同的爬行策略和爬行目标可分为不同类型的爬虫，例如主题网络爬虫、增量式网络爬虫以及深网网络爬虫等。其中，增量式网络爬虫是一种只爬取新产生的页面或者更新后的页面的网络爬虫，它可以避免对数据的重复爬取并减少存储空间的开销，因此可以提高爬虫的爬行效率。

本文的主要工作为：

对网络爬虫关键技术进行了研究，对比分析了多种信息抽取技术、URL去重机制和网页更新预测算法的优缺点。并选择正则表达式和基于行块分布函数算法进行信息抽取，使用Cuckoo过滤器实现了URL去重，使用自调节网页更新频率算法来预测频道列表页的更新时间。

改进Scrapy爬虫框架，扩展其中的URL过滤、URL去重和URL调度使其支持增量抓取。并选取部分新闻列表页作为测试集进行测试，结果表明改进后的Scrapy爬虫有更高的爬取成功率和更少的时间开销。

本文实现的增量式网络爬虫既支持单机爬行，也可以在分布式环境下进行爬取，在互联网数据呈爆炸性增长的环境下，它是可以满足并发性采集需求的，因此也是符合当前大数据背景下对网页数据采集的要求的。

关键词：网络爬虫、URL去重、网页更新预测、增量发现

DESIGN AND IMPLEMENTATION OF INCREMENTAL DISCOVERY ALGORITHM FOR TARGET WEBSITES AND CHANNELS IN WEB CRAWLER

Abstract

Web crawler is a common web data collection tool, according to different crawling strategies and web targets, it can be divided into different types, for example, thematic web crawler, incremental web crawler, deep web crawler and so on. Among them, the incremental web crawler is a kind of web crawler that only crawling pages newly generated or updated, it avoids the crawling of duplicated data thus reduce the storage overhead and improve the efficiency of crawlers.

The main work includes:

Research on the key technologies of web crawler. Comparisons of different information extraction technologies, URL duplication removement mechanism and page update frequency prediction algorithms. We select information extraction algorithm based on regular expressions and block distribution function to extract URL and contents, use Cuckoo filter to implement URL duplication check, use self-tuning algorithm to predict the update frequency of the channel list page.

Improvement of Scrapy crawling framework. We write extensions of URL filtering, URL duplication removement and URL scheduling to support incremental crawl. And select part of the news list page as the test set for testing, the results show that the improved Scrapy system has higher success rate and less time overhead.

In this paper, the implemented incremental web crawler supports both single crawling and crawling in a distributed environment. With data explosion in the Internet environment, it is collected to meet the needs of concurrency crawl, and therefore meets the web data collection requirements based on the current Big Data context.

Keywords: web crawler, URL duplication removing ,web update frequency prediction, incremental discovery

绪论

1.1 研究背景

网络爬虫是一种自动提取网页数据的程序，它按照一定爬行规则，自动地沿超链接在互联网中爬行并下载到达的网页数据。它也被形象地称为网络蜘蛛、网络蚂蚁、网络追逐者以及网络机器人。网络爬虫通常作为搜索引擎的组成部分为其提供更新网页内容或其他网页内容索引的功能，用户可以因此更高效地进行搜索。由于网络爬虫可以下载网页数据，而这些数据经过后续的分析处理可以为行业或企业创造商业价值，因此与网络爬虫相关的研究和开发也迅速发展起来。

剩余内容已隐藏，请支付后下载全文，论文总字数：26163字

您需要先支付 80元 才能查看全部内容！立即支付

该课题毕业论文、开题报告、外文翻译、程序设计、图纸设计等资料可联系客服协助查找;

注册

找回密码

网络爬虫中目标采集网站与频道的增量发现算法的设计与实现

Abstract

目录

绪论

您可能感兴趣的文章

登录

Abstract

目 录

绪 论

您可能感兴趣的文章

目录

绪论