


使用Python中的正则表达式处理html文件
finditer方法是一种全匹配方法。已经使用过findall方法的话,该方法将返回由多个匹配字符串组成的列表。对于多个匹配项,finditer会按顺序返回一个迭代器,每个迭代生成一个匹配对象。这些匹配对象可通过for循环访问,在下面的代码中,因此组1可以被打印。
您需要撰写 Python 正则表达式,以便在 HTML 文本文件中识别特定的模式。将代码添加到STARTER脚本为这些模式编译RE(将它们分配给有意义的变量名称),并将这些RE应用于文件的每一行,打印出找到的匹配项。
1.编写识别HTML标签的模式,然后将其打印为“TAG:TAG string”(例如“TAG:b”代表标签)。为了简单起见,假设左括号和右括号每个标记的()将始终出现在同一行文本中。第一次尝试可能使regex“<.>”其中“.”是与任何字符匹配的预定义字符类符号。尝试找出这一点,找出为什么这不是一个好的解决方案。编写一个更好的解决方案,解决这个问题
2.修改代码,使其区分开头和结尾标记(例如p与/p)打印OPENTAG和CLOSETAG
import sys, re #------------------------------ testRE = re.compile('(logic|sicstus)', re.I) testI = re.compile('<[A-Za-z]>', re.I) testO = re.compile('<[^/](\S*?)[^>]*>') testC = re.compile('</(\S*?)[^>]*>') with open('RGX_DATA.html') as infs: linenum = 0 for line in infs: linenum += 1 if line.strip() == '': continue print(' ', '-' * 100, '[%d]' % linenum, '\n TEXT:', line, end='') m = testRE.search(line) if m: print('** TEST-RE:', m.group(1)) mm = testRE.finditer(line) for m in mm: print('** TEST-RE:', m.group(1)) index= testI.finditer(line) for i in index: print('Tag:',i.group().replace('<', '').replace('>', '')) open1= testO.finditer(line) for m in open1: print('opening:',m.group().replace('<', '').replace('>', '')) close1= testC.finditer(line) for n in close1: print('closing:',n.group().replace('<', '').replace('>', ''))
请注意,有些HTML标签有参数,例如:
<table border=1 cellspacing=0 cellpadding=8>
成功查找到并打印标记标签,确保启用带参数和不带参数的标记模式。现在扩展您的代码,以便打印两个打开的标签标签和参数,例如:
OPENTAG: table
PARAM: border=1
PARAM: cellspacing=0
PARAM: cellpadding=8
open1= testO.finditer(line) for m in open1: #print('opening:',m.group().replace('<', '').replace('>', '')) firstm= m.group().replace('<', '').replace('>', '').split() num = 0 for otherm in firstm: if num == 0: print('opening:',otherm) else: print('pram:',otherm) num+= 1
在正则表达式中,可以使用反向引用来指示匹配早期部分的子字符串,应再次出现正则表达式的。格式为\N(其中N为正整数),并返回到第N个匹配的文本正则表达式组。例如,正则表达式,如:r" (\w+) \1 仅当与组(\w+)完全匹配的字符串再次出现时才匹配 backref\1出现的位置。这可能与字符串“踢”匹配.例如,“the”出现两次。使用反向引用编写一个模式,当一行包含成对的open和关闭标签,例如在粗体中.
考虑到我们可能想要创建一个执行HTML剥离的脚本,即一个HTML文件,并返回一个纯文本文件,所有HTML标记都已从中删除出来这里我们不打算这样做,而是考虑一个更简单的例子,即删除我们在输入数据文件的任何行中找到的HTML标记。
如果您已经定义了一条RE来识别HTML标签,您应该可以将生成的文本输出为STRIPPED,并将其打印在屏幕上。。
import sys, re #------------------------------ # PART 1: # Key thing is to avoid matching strings that include # multiple tags, e.g. treating '<p><b>' as a single # tag. Can do this in several ways. Firstly, use # non-greedy matching, so get shortest possible match # including the two angle brackets: tag = re.compile('</?(.*?)>') # The above treats the '/' of a close tag as a separate # optional component - so that this doesn't turn up as # part of the match '.group(1)', which is meant to return # the tag label. # Following alternative solution uses a negated character # class to explicitly prevent this including '>': tag = re.compile('</?([^>]+)>') # Finally, following version separates finding the tag # label string from any (optional) parameters that might # also appear before the close angle bracket: tag = re.compile(r'</?(\w+\b)([^>]+)?>') # Note that use of '\b' (as word boundary anchor) here means # we must mark the regex string as a 'raw' string (r'..'). #------------------------------ # PART 2: # Following closeTag definition requires first first char # after the open angle bracket to be '/', while openTag # definition excludes this by requiring first char to be # a 'word char' (\w): openTag = re.compile(r'<(\w[^>]*)>') closeTag = re.compile(r'</([^>]*)>') # Following revised definitions are more carefully stated # for correct extraction of tag label (separately from # any parameters: openTag = re.compile(r'<(\w+\b)([^>]+)?>') closeTag = re.compile(r'</(\w+\b)\s*>') #------------------------------ # PART 3: # Above openTag definition will already get the string # encompassing any parameters, and return it as # m.group(2), i.e. defn: openTag = re.compile(r'<(\w+\b)([^>]+)?>') # If assume that parameters are continuous non-whitespace # chars separated by whitespace chars, then we can divide # them up using split - and that's how we handle them # here. (In reality, parameter strings can be a lot more # messy than this, but we won't try to deal with that.) #------------------------------ # PART 4: openCloseTagPair = re.compile(r'<(\w+\b)([^>]+)?>(.*?)</\1\s*>') # Note use of non-greedy matching for the text falling # *between* the open/close tag pair - to avoid false # results where have two similar tag pairs on same line. #------------------------------ # PART 5: URLS # This is quite tricky. The URL expressions in the file # are of two kinds, of which the first is a string # between double quotes ("..") which may include # whitespace. For this case we might have a regex: url = re.compile('href=("[^">]+")', re.I) # The second case does not have quotes, and does not # allow whitespace, consisting of a continuous sequence # of non-whitespace material (that ends when you reach a # space or close bracket '>'). This might be: url = re.compile('href=([^">\s]+)', re.I) # We can combine these two cases as follows, and still # get the expression back as group(1): url = re.compile(r'href=("[^">]+"|[^">\s]+)', re.I) # Note that I've done nothing here to exclude 'mailto:' # links as being accepted as URLS. #------------------------------ with open('RGX_DATA.html') as infs: linenum = 0 for line in infs: linenum += 1 if line.strip() == '': continue print(' ', '-' * 100, '[%d]' % linenum, '\n TEXT:', line, end='') # PART 1: find HTML tags # (The following uses 'finditer' to find ALL matches # within the line) mm = tag.finditer(line) for m in mm: print('** TAG:', m.group(1), ' + [%s]' % m.group(2)) # PART 2,3: find open/close tags (+ params of open tags) mm = openTag.finditer(line) for m in mm: print('** OPENTAG:', m.group(1)) if m.group(2): for param in m.group(2).split(): print(' PARAM:', param) mm = closeTag.finditer(line) for m in mm: print('** CLOSETAG:', m.group(1)) # PART 4: find open/close tag pairs appearing on same line mm = openCloseTagPair.finditer(line) for m in mm: print("** PAIR [%s]: \"%s\"" % (m.group(1), m.group(3))) # PART 5: find URLs: mm = url.finditer(line) for m in mm: print('** URL:', m.group(1)) # PART 6: Strip out HTML tags (note that .sub will do all # possible substitutions, unless number is limited by count # keyword arg - which is fortunately what we want here) stripped = tag.sub('', line) print('** STRIPPED:', stripped, end = '')
The above is the detailed content of How to use regular expressions in Python to process html files. For more information, please follow other related articles on the PHP Chinese website!

本篇文章带大家了解一下HTML(超文本标记语言),介绍一下HTML的本质,HTML文档的结构、HTML文档的基本标签和图像标签、列表、表格标签、媒体元素、表单,希望对大家有所帮助!

不算。html是一种用来告知浏览器如何组织页面的标记语言,而CSS是一种用来表现HTML或XML等文件样式的样式设计语言;html和css不具备很强的逻辑性和流程控制功能,缺乏灵活性,且html和css不能按照人类的设计对一件工作进行重复的循环,直至得到让人类满意的答案。

总结了一些web前端面试(笔试)题分享给大家,本篇文章就先给大家分享HTML部分的笔试题(附答案),大家可以自己做做,看看能答对几个!

HTML5中画布标签是“<canvas>”。canvas标签用于图形的绘制,它只是一个矩形的图形容器,绘制图形必须通过脚本(通常是JavaScript)来完成;开发者可利用多种js方法来在canvas中绘制路径、盒、圆、字符以及添加图像等。

html5废弃了dir列表标签。dir标签被用来定义目录列表,一般和li标签配合使用,在dir标签对中通过li标签来设置列表项,语法“<dir><li>列表项值</li>...</dir>”。HTML5已经不支持dir,可使用ul标签取代。

在html中,document是文档对象的意思,代表浏览器窗口的文档;document对象是window对象的子对象,所以可通过“window.document”属性对其进行访问,每个载入浏览器的HTML文档都会成为Document对象。

html5支持boolean值属性;boolean值属性指是属性值为true或者false的属性,如input元素中的disabled属性,不使用该属性表示值为flase,不禁用元素,使用该属性可以不设置属性值表示值为true,禁用元素。


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),
