
 2022-11-21 09:11


摘 要

由于计算机技术的不断开发,新的编程语言层出不穷,Python和Html是首屈一指的。与早期广泛使用的高级语言(Java, C)相比,Python拥有更实用的模块和库,虽然牺牲了基本特性,但开发小规模项目时更加方便。HTML在网站的前端也得到了广泛的应用,结合CSS的mark up语言功能丰富了网页的内容和形态,促进了人性化电子商务系统的开发。随着互联网的快速发展,搜索引擎在网络搜索服务中扮演着越来越重要的角色。网络浏览器是搜索引擎系统的重要组成部分,它在网络上起着收集网页的作用,用来为搜索引擎提供索引支持。面对高度膨胀的网络信息,中央集中式独立吹车长期以来无法适应目前网络信息规模,高性能分布式网络吹车系统是信息收集领域研究的焦点。这篇文章针对网页起重机原则、分布式设计、主要模块、瓶颈现象和网络起重机解决方案进行了相关研究。论文的主要任务是:

1. 该系统的庄重和优先功能提供后,由模型提供的URL队列的设计和体现。

2. 大规模清除URL重复、确认DNS、页面漂移和解决瓶颈问题等均可提供解决方案。




关键字:Python,HTML,数据挖掘 ;数据分析


As computer technology continues to develop, new programming languages are emerging, and Python and Html are second to none.

Compared with the early, widely used high-level language (Java, C), Python has more practical modules and libraries that sacrifice basic features but are easier to develop on a small scale.

HTML has also been widely used in the front end of the website. Combined with the mark up language function of CSS, it enriches the content and form of the web page and promotes the development of humanized e-commerce system.

With the rapid development of the Internet, search engine plays an increasingly important role in the network search service.

Web browser is an important part of the search engine system, it plays the role of collecting web pages on the network, used to provide index support for the search engine.

In the face of the highly inflated network information, the centralized independent network information blowing system has been unable to adapt to the current network information scale for a long time. The high-performance distributed network information blowing system is the focus of research in the field of information collection.

This article focuses on the principles of web cranes, distributed design, main modules, bottlenecks, and network crane solutions.

The main tasks of the paper are:

1. The solemn and priority functions of the system are provided after the URL queue design and embodiment provided by the Mercator model.

2. Solutions can be provided to clean up URL duplication on a large scale, confirm DNS, page drift and solve bottleneck problems.

3. Provide Settings of index files and data files for effective page storage and management, and build a page storage system based on the file method.

Based on the above operation, the white paper designs and embodies a high performance decentralized network roller configuration system.

  • According to the experiment, the network roller system not only has high page scrolling efficiency, high components and stable working characteristics, but also has excellent expansibility, internal defects and load balancing characteristics

Keywords: Python, HTML, data mining; The data analysis


基于Python的网络爬虫 I

一 绪论 1

1.1毕业设计背景与目的 1

1.2国内外研究状况 1

1.3论文结构和内容 1

二 相关技术 2

2.1 Python语言 2

2.1.1 Python语言的产生和发展历史 2

2.1.2 Python语言的原理 2

2.1.3 Python语言的特色 2

2.1.4 python语言的缺点 3

2.2 URL 3

2.2.1 URL的定义 3

2.2.2 URI和URL的对比和举例 4

2.2.3 URL组成 4

2.3 Html 4

2.3.1 定义 4

2.3.2 Html原理 4

2.3.3 Html特点 5

2.4 开发工具 5

2.4.1 Chrome 5

2.4.2 PycharmCE 5

2.4.3 终端 5

三 系统需求 5

3.1输入板块分析 6

3.2需要抓取的内容 7

3.3 本地输出 7

四 项目分析及实现 7

4.1 输入模块 7

4.1.1 目标网站URL 7

4.1.2 Urillib2 模块 8

4.1.3 伪装 8

4.2 抓取模块 8

4.2.1 URL与html 8

4.2.2 Beautiful Soup库 8

4.2.3 抓取方法 8

4.3 输出模块 9

4.4最终效果 9

五 项目测试 10

5.1 抓取结果错误测试 10

5.2 显示结果错误测试 10

5.3 网络连接测试 10

六 结论 11

6.1 收获与成长 11

6.2 不足与展望 12

七 致谢

参考文献 13

一 绪论





您需要先支付 80元 才能查看全部内容!立即支付
