search
HomeWeb Front-endHTML Tutorial爬虫的理论知识储备_html/css_WEB-ITnose

参考资料:汪海:Python网络爬虫W3School HTML教程《计算机网络第二版》 谢希仁

网络爬虫,是一中按照一定的规则,自动地抓取万维网信息的程序或脚本。爬虫通过网页的链接地址来寻找网页并获取网页内容,再根据网页中其他链接不断循环爬取。

1 浏览网页的过程

浏览网页的过程其实就是浏览器作为一个浏览的“客户端”,向服务器端发送了 一次请求,把服务器端的文件“抓”到本地,再进行解释、展现。

  • 使用统一资源定位符URL来标志万维网上的各种文档,并使每一个文档在整个因特网的范围内具有唯一的标识符URL。
  • 通过超文本传送协议HTTP来实现万维网上各种连接,使用TCP连接进行可靠的传送。
  • 使用超文本标记语言HTML使得网页设计者可以很方便地用链接从本页面的某处链接到任意网页,并在自己主机屏幕上显示。

2 统一资源定位符URL

URL是用来表示从因特网上得到的资源位置和访问这些资源的方法。URL给资源的位置提供一种抽象的识别方法,并用这种方法给资源定位。只要能够对资源定位,系统就可以对资源进行各种操作,如存取、更新、替换和查找其属性。URL相当于一个文件名在网络范围的扩展。因此,URL是与因特网相连的机器上的任何可访问对象的指针。由于访问不同对象使用的协议不同,URL还能之处读取某个对象时所使用的协议。URL的一般形式为:

 <协议>://<主机>:<端口>/<路径>

协议是指用哪种协议获取该万维网文档,如http,ftp;主机是指该网络文档所在主机的域名;端口和路径有时可以省略。对万维网的网点访问使用HTTP协议,HTTP的默认端口号是80,通常可省略。若在省略文件的路径,则URL就指到因特网上的某个主页。如: www.baidu.com。

3 超文本传送协议HTTP

HTTP协议定义了浏览器怎样向万维网服务器请求万维网文档,以及服务器怎样把文档传送给浏览器。下图给出了万维网的大致工作过程。

万维网工作过程

HTTP规定在HTTP客户与HTTP服务器之间的每次交互,都由一个ASCII码穿构成的请求和一个“MIME-like”的响应组成,HTTP报文通常都使用TCP连接传送。

HTTP有两类报文:请求报文(从客户向服务器发送请求报文)和响应报文(从服务器到客户的回答)。HTTP请求报文和响应报文都是由三部分组成,两种报文格式的区别就是开始行不同。

  1. 开始行 用于区分是请求报文还是响应报文。开始行在两种报文中分别叫请求行状态行
  2. 首部行 用来说明浏览器或报文主题的一些信息。
  3. 实体主体 在请求报文中一般不用该字段,而在响应报文中也可能没有该字段。

请求行只有三个内容,即方法、请求资源URL和HTTP的版本。下表给出了请求报文中常用的几种方法。

方法 意义
GET 请求读取URL标志的信息
OPTION 请求一些选项的信息
HEAD 请求读取URL标志信息的首部
POST 给服务器添加信息,如注释
PUT 在致命的URL下存储一个文档
DELETE 删除致命的URL所标志的资源
CONNECT 用于代理服务器
GET http://www.bilibili.com/video/douga.html  HTTP/1.1

下面是一个请求报文的例子

请求报文

4 超文本标记语言HTML

HTML指的是超文本标记语言,是使用标记标签来描述网页的。

HTML标签是由尖括号包围的关键词,比如。HTML标签通常是成对出现的,标签对中的第一个标签是开始标签,第二个是结束标签,比如

HTML文档包含HTML标签和纯文本,也称为网页。Web 浏览器的作用是读取 HTML 文档,并以网页的形式显示出它们。浏览器不会显示 HTML 标签,而是使用标签来解释页面的内容。

四个基本的标签

  • -

    等:定义HTML 标题。
  • :定义HTML 段落。

  • :定义HTML 链接。
  • 爬虫的理论知识储备_html/css_WEB-ITnose:定义HTML 图像。
  • :HTML分组标签,定义文档中的分区或节。
    <h1 id="This-is-a-heading">This is a heading</h1><h2 id="This-is-a-heading">This is a heading</h2><h3 id="This-is-a-heading">This is a heading</h3><p>This is a paragraph.</p><p>This is another paragraph.</p><a href="http://www.w3school.com.cn">This is a link</a><img  src="/static/imghwm/default1.png"  data-src="w3school.jpg"  class="lazy"    style="max-width:90%"  style="max-width:90%" / alt="爬虫的理论知识储备_html/css_WEB-ITnose" >

    HTML 元素指的是从开始标签(start tag)到结束标签(end tag)的所有代码。元素的内容是开始标签与结束标签之间的内容。大多数 HTML 元素可以嵌套(可以包含其他 HTML 元素),HTML 文档由嵌套的 HTML 元素构成。如下例包含3个HTML元素。

    <html>    <body>        <p>This is my first paragraph.</p>    </body></html>

    HTML 属性:HTML 标签可以拥有属性,属性提供了有关 HTML 元素的更多的信息,属性总是以名称/值对的形式出现,比如:name="value",属性总是在 HTML 元素的开始标签中规定;属性值应该始终被包括在引号内,双引号是最常用的,不过使用单引号也没有问题。

    HTML 链接由标签定义,链接的地址在 href 属性中指定:This is a link

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
From Text to Websites: The Power of HTMLFrom Text to Websites: The Power of HTMLApr 13, 2025 am 12:07 AM

HTML is a language used to build web pages, defining web page structure and content through tags and attributes. 1) HTML organizes document structure through tags, such as,. 2) The browser parses HTML to build the DOM and renders the web page. 3) New features of HTML5, such as, enhance multimedia functions. 4) Common errors include unclosed labels and unquoted attribute values. 5) Optimization suggestions include using semantic tags and reducing file size.

Understanding HTML, CSS, and JavaScript: A Beginner's GuideUnderstanding HTML, CSS, and JavaScript: A Beginner's GuideApr 12, 2025 am 12:02 AM

WebdevelopmentreliesonHTML,CSS,andJavaScript:1)HTMLstructurescontent,2)CSSstylesit,and3)JavaScriptaddsinteractivity,formingthebasisofmodernwebexperiences.

The Role of HTML: Structuring Web ContentThe Role of HTML: Structuring Web ContentApr 11, 2025 am 12:12 AM

The role of HTML is to define the structure and content of a web page through tags and attributes. 1. HTML organizes content through tags such as , making it easy to read and understand. 2. Use semantic tags such as, etc. to enhance accessibility and SEO. 3. Optimizing HTML code can improve web page loading speed and user experience.

HTML and Code: A Closer Look at the TerminologyHTML and Code: A Closer Look at the TerminologyApr 10, 2025 am 09:28 AM

HTMLisaspecifictypeofcodefocusedonstructuringwebcontent,while"code"broadlyincludeslanguageslikeJavaScriptandPythonforfunctionality.1)HTMLdefineswebpagestructureusingtags.2)"Code"encompassesawiderrangeoflanguagesforlogicandinteract

HTML, CSS, and JavaScript: Essential Tools for Web DevelopersHTML, CSS, and JavaScript: Essential Tools for Web DevelopersApr 09, 2025 am 12:12 AM

HTML, CSS and JavaScript are the three pillars of web development. 1. HTML defines the web page structure and uses tags such as, etc. 2. CSS controls the web page style, using selectors and attributes such as color, font-size, etc. 3. JavaScript realizes dynamic effects and interaction, through event monitoring and DOM operations.

The Roles of HTML, CSS, and JavaScript: Core ResponsibilitiesThe Roles of HTML, CSS, and JavaScript: Core ResponsibilitiesApr 08, 2025 pm 07:05 PM

HTML defines the web structure, CSS is responsible for style and layout, and JavaScript gives dynamic interaction. The three perform their duties in web development and jointly build a colorful website.

Is HTML easy to learn for beginners?Is HTML easy to learn for beginners?Apr 07, 2025 am 12:11 AM

HTML is suitable for beginners because it is simple and easy to learn and can quickly see results. 1) The learning curve of HTML is smooth and easy to get started. 2) Just master the basic tags to start creating web pages. 3) High flexibility and can be used in combination with CSS and JavaScript. 4) Rich learning resources and modern tools support the learning process.

What is an example of a starting tag in HTML?What is an example of a starting tag in HTML?Apr 06, 2025 am 12:04 AM

AnexampleofastartingtaginHTMLis,whichbeginsaparagraph.StartingtagsareessentialinHTMLastheyinitiateelements,definetheirtypes,andarecrucialforstructuringwebpagesandconstructingtheDOM.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools