(.*?)
.*?(.*?)
.*?We usually need to parse web crawled pages to get the data we need. By analyzing the combined structure of HTML tags, we can extract useful information contained in web pages. In Python, there are three common ways to parse HTML: regular expression parsing, XPath parsing, and CSS selector parsing.
Understanding the basic structure of HTML page is a prerequisite before explaining the HTML parsing method. When we open a website in a browser and select the "Show web page source code" menu item through the right-click menu of the mouse, we can see the HTML code corresponding to the web page. HTML code usually consists of tags, attributes, and text. The label carries the content displayed on the page, the attributes supplement the label information, and the text is the content displayed by the label. The following is a simple HTML page code structure example:
<!DOCTYPE html> <html> <head> <!-- head 标签中的内容不会在浏览器窗口中显示 --> <title>这是页面标题</title> </head> <body> <!-- body 标签中的内容会在浏览器窗口中显示 --> <h2 id="这是一级标题">这是一级标题</h2> <p>这是一段文本</p> </body> </html>
In this HTML page code example, is the document type declaration, <code> The
tag is the root tag of the entire page, and
are sub-tags of the
tag, placed in The content under the
tag will be displayed in the browser window. This part of the content is the main body of the web page; the content under the
tag will not be displayed in the browser window. It is displayed in the browser window, but it contains important meta-information of the page, usually called the header of the web page. The general code structure of an HTML page is as follows:
<!DOCTYPE html> <html> <head> <!-- 页面的元信息,如字符编码、标题、关键字、媒体查询等 --> </head> <body> <!-- 页面的主体,显示在浏览器窗口中的内容 --> </body> </html>
tags, cascading style sheets (CSS) and JavaScript are the three basic components that make up an HTML page. Tags are used to carry the content to be displayed on the page, CSS is responsible for rendering the page, and JavaScript is used to control the interactive behavior of the page. To parse HTML pages, you can use XPath syntax, which is originally a query syntax for XML. It can extract content or tag attributes in tags based on the hierarchical structure of HTML tags. In addition, you can also use CSS selectors to locate pages. Elements are the same as rendering page elements using CSS.
XPath is a syntax for finding information in XML (eXtensible Markup Language) documents. XML is similar to HTML and is a tag language that uses tags to carry data. The difference The reason is that XML tags are extensible and customizable, and XML has stricter syntax requirements. XPath uses path expressions to select nodes or node sets in XML documents. The nodes mentioned here include elements, attributes, text, namespaces, processing instructions, comments, root nodes, etc.
XPath path expression is similar to file path syntax, you can use "/" and "//" to select nodes. When selecting the root node, you can use a single slash "/"; when selecting a node at any position, you can use a double slash "//". For example, "/bookstore/book" means selecting all book sub-nodes under the root node bookstore, and "//title" means selecting the title node at any position.
XPath can also use predicates to filter nodes. Nested expressions within square brackets can be numbers, comparison operators, or function calls that serve as predicates. For example, "/bookstore/book[1]" means selecting the first child node book of bookstore, and "//book[@lang]" means selecting all book nodes with the lang attribute.
XPath functions include string, mathematical, logical, node, sequence and other functions. These functions can be used to select nodes, calculate values, convert data types and other operations. For example, the "string-length(string)" function can return the length of the string, and the "count(node-set)" function can return the number of nodes in the node set.
Below we use an example to illustrate how to use XPath to parse the page. Suppose we have the following XML file:
<?xml version="1.0" encoding="UTF-8"?> <bookstore> <book> <title lang="eng">Harry Potter</title> <price>29.99</price> </book> <book> <title lang="zh">Learning XML</title> <price>39.95</price> </book> </bookstore>
For this XML file, we can use the XPath syntax as shown below to get the nodes in the document.
Path expression | Result |
---|---|
Select the root element bookstore. Note: If the path starts with a forward slash ( / ), this path always represents an absolute path to an element! | |
Selects all book child elements regardless of their position in the document. | |
Select all attributes named lang. | |
Select the first child node book of bookstore. |
选择器 | 结果 |
---|---|
div.content | 选取 class 为 content 的 div 元素。 |
h2 | 选取所有的 h2 元素。 |
div.footer p | 选取 class 为 footer 的 div 元素下的所有 p 元素。 |
[href] | 选取所有具有 href 属性的元素。 |
用正则表达式可以解析 HTML 页面,从而实现文本的匹配、查找和替换。使用 re 模块可以进行 Python 的正则表达式解析。
下面我们通过一个例子来说明如何使用正则表达式对页面进行解析。假设我们有如下的 HTML 代码:
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>这是页面标题</title> </head> <body> <div class="content"> <h2 id="这是一级标题">这是一级标题</h2> <p>这是一段文本</p> </div> <div class="footer"> <p>版权所有 © 2021</p> </div> </body> </html>
我们可以使用如下所示的正则表达式来选取页面元素。
import re html = ''' <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>这是页面标题</title> </head> <body> <div class="content"> <h2 id="这是一级标题">这是一级标题</h2> <p>这是一段文本</p> </div> <div class="footer"> <p>版权所有 © 2021</p> </div> </body> </html> ''' pattern = re.compile(r'.*?', re.S) match = re.search(pattern, html) if match: title = match.group(1) text = match.group(2) print(title) print(text)(.*?)
.*?(.*?)
.*?
以上代码中,我们使用 re 模块的 compile 方法来编译正则表达式,然后使用 search 方法来匹配 HTML 代码。在正则表达式中,“.*?”表示非贪婪匹配,也就是匹配到第一个符合条件的标签就停止匹配,而“re.S”表示让“.”可以匹配包括换行符在内的任意字符。最后,我们使用 group 方法来获取匹配的结果。
The above is the detailed content of How to parse HTML pages with Python crawler. For more information, please follow other related articles on the PHP Chinese website!
本篇文章带大家了解一下HTML(超文本标记语言),介绍一下HTML的本质,HTML文档的结构、HTML文档的基本标签和图像标签、列表、表格标签、媒体元素、表单,希望对大家有所帮助!
不算。html是一种用来告知浏览器如何组织页面的标记语言,而CSS是一种用来表现HTML或XML等文件样式的样式设计语言;html和css不具备很强的逻辑性和流程控制功能,缺乏灵活性,且html和css不能按照人类的设计对一件工作进行重复的循环,直至得到让人类满意的答案。
总结了一些web前端面试(笔试)题分享给大家,本篇文章就先给大家分享HTML部分的笔试题(附答案),大家可以自己做做,看看能答对几个!
HTML5中画布标签是“<canvas>”。canvas标签用于图形的绘制,它只是一个矩形的图形容器,绘制图形必须通过脚本(通常是JavaScript)来完成;开发者可利用多种js方法来在canvas中绘制路径、盒、圆、字符以及添加图像等。
在html中,document是文档对象的意思,代表浏览器窗口的文档;document对象是window对象的子对象,所以可通过“window.document”属性对其进行访问,每个载入浏览器的HTML文档都会成为Document对象。
html5废弃了dir列表标签。dir标签被用来定义目录列表,一般和li标签配合使用,在dir标签对中通过li标签来设置列表项,语法“<dir><li>列表项值</li>...</dir>”。HTML5已经不支持dir,可使用ul标签取代。
3种取消方法:1、给td元素添加“border:none”无边框样式即可,语法“td{border:none}”。2、给td元素添加“border:0”样式,语法“td{border:0;}”,将td边框的宽度设置为0即可。3、给td元素添加“border:transparent”样式,语法“td{border:transparent;}”,将td边框的颜色设置为透明即可。
AI-powered app for creating realistic nude photos
Online AI tool for removing clothes from photos.
Undress images for free
AI clothes remover
Generate AI Hentai for free.
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software
Easy-to-use and free code editor
God-level code editing software (SublimeText3)
The most popular open source editor