Home  >  Q&A  >  body text

Python - The title of the web page contains a newline. How to extract it using regular expressions?

When using python to do CSDN web crawler, when crawling the title of the web page, I always use the regular expression (?<=\<title\>). ?(?=\< ) cannot be used in CSDN. Go to the CSDN source code and see that the title breaks into new lines and displays

As a result, the original regular expression cannot be used. Then, the question arises. The title of a webpage like this contains a newline. How to extract it with a regular expression?

PS:

  1. I don’t want to use xpath or beautifulsoup methods, I just need regular expressions

  2. CSDN itself has an anti-crawler mechanism. It’s not because of this anti-crawler that I couldn’t crawl the title

thank you all

Referring to @caimaoy's method, I changed the regular expression to (?<=\<title\>)(?:.|\n) ?(?=\<)## After #, the title is extracted perfectly. Thank you all again.

女神的闺蜜爱上我女神的闺蜜爱上我2649 days ago923

reply all(2)I'll reply

  • 仅有的幸福

    仅有的幸福2017-06-22 11:53:43

    1. re.M Multi-line mode

    2. Write multi-line matching by yourself http://python3-cookbook.readt...

    reply
    0
  • 曾经蜡笔没有小新

    曾经蜡笔没有小新2017-06-22 11:53:43

    Add a flag to the expression

    tite = '......'
    print(re.findall('(?<=\<title\>).+?(?=\<)', title, re.S))

    reply
    0
  • Cancelreply