I've been struggling for a while now trying to get the correct regular expression for the following task:
I want to remove data from table tags in html file using python. My approach to this is to do the following recursively (store the HTML lines between tags as strings):
s = "
s = re.sub('<{1}( is not '<' 也不是 '>').*>{1}', '', s)
My question is how to implement the bold part in brackets. Thanks. Your text
I tried
import re test_str = '<td style="color:blue">Hello</td>' test_str = re.sub('<{1}^[<>].*>{1}','',test_str) print(test_str)
You can see that my test string remains the same. What did i do wrong?
The above code I expect gives me test_str = "Hello", I'll feed that back into this method, which then extracts the "", giving me "Hello".
P粉3480889952023-09-15 09:00:18
To negate a character class, place ^
after [
. Additionally, you do not need to specify {1}
for characters that occur once.
test_str = re.sub('<[^<>]*>', '', test_str)
However, please note that it is more appropriate to use a dedicated HTML parser like BeautifulSoup instead of regular expressions to get data from HTML.