When using python to do CSDN web crawler, when crawling the title of the web page, I always use the regular expression (?<=\<title\>). ?(?=\< )
cannot be used in CSDN. Go to the CSDN source code and see that the title breaks into new lines and displays
As a result, the original regular expression cannot be used. Then, the question arises. The title of a webpage like this contains a newline. How to extract it with a regular expression?
PS:
I don’t want to use xpath or beautifulsoup methods, I just need regular expressions
CSDN itself has an anti-crawler mechanism. It’s not because of this anti-crawler that I couldn’t crawl the title
thank you all
Referring to @caimaoy's method, I changed the regular expression to (?<=\<title\>)(?:.|\n) ?(?=\<)## After #, the title is extracted perfectly.
Thank you all again.
仅有的幸福2017-06-22 11:53:43
re.M Multi-line mode
Write multi-line matching by yourself http://python3-cookbook.readt...
曾经蜡笔没有小新2017-06-22 11:53:43
Add a flag
to the expression
tite = '......'
print(re.findall('(?<=\<title\>).+?(?=\<)', title, re.S))