
 2022-10-17 09:10


摘 要



关键词: .NET爬虫;正则表达式;表格数据结构;数据库技术;


With the popularity and development of the Internet, there are more and more information contained in web pages, and the format of data contained in web pages is becoming more and more complicated. Forms are common forms of data in web pages because they have obvious descriptions of the relationship between data. Advantages are an important goal of tabular data extraction. However, in real life, web pages are mainly provided to users for browsing queries, which are difficult to extract and use. Consider designing and implementing web page table detection and content extraction software to achieve Automatic detection and extraction management of form information in web pages.

The system mainly uses .NET crawling technology, which mainly implements three main functional modules: webpage parsing module, table identification and data extraction module, and database connection and storage module, which realizes the user input corresponding URL, and the system will detect the webpage. Whether to include the form and extract the form data for query display. The system is based on VS2012 development environment, based on the three-tier architecture mode of .NET development, the database is the SQLServer database used as the data storage carrier, convenient for query and display after data extraction.

Keywords:.NET crawler;Regular Expression;Tabular Data Structure;Database Technology;

目 录

第一章 绪 论 1

1.1 课题背景和意义 1

1.2 研究现状 1

1.3 主要研究内容 1

1.4 论文组织结构 2

第二章 相关技术 3

2.1关键技术 3

2.2开发工具 3

第三章 系统需求分析与总体设计 5

3.1 系统的业务流程分析 5

3.2 功能需求分析 6

3.3 非功能性需求分析 7

3.4 环境需求分析 7

3.5 系统总体设计 7

第四章 系统详细设计与实现 9

4.1 Web网页解析子模块的设计与实现 9

4.2表格识别与数据提取子模块的设计和实现 11

4.3数据库连接与存储模块的设计和实现 14

第五章 系统测试 18

5.1测试方案 18

5.2功能性测试 18

5.3非功能性测试 22

5.4本章小结 23

结束语 24

致 谢 25

参考文献 26

绪 论

1.1 课题背景和意义

随着互联网的快速发展,我们真正迎来了信息爆炸和数据网络的时代。在各种类型的Web网页中,表格结构的数据占了绝大多数,大约52%的使用Table标记的HTML网站使用完整的页面表结构,例如火车图,购物网站和选项页面。 虽然某些表单用于网页布局,但大多数表单用于存储数据信息。表单类型允许用户更直观地理解信息关系,使表单数据更易于理解,并且越来越多的人在Web页面中使用表单结构,Web信息提取技术的研究越来越多地集中在识别表单结构上。


1.2 研究现状




您需要先支付 80元 才能查看全部内容!立即支付
