Home > Article > Backend Development > [python tutorial] Web page text and content image extraction algorithm
Regular matching is usually used when crawling the web content of a single website. However, the structures of different websites are so strange that it is difficult to match them with a unified regular expression. The author of "General Web Page Text Extraction Algorithm Based on Line Block Distribution Function" summarized the general methods of extracting article text from web pages, proposed a text extraction algorithm based on line block distribution, and provided implementations in PHP, Java, etc. The main principles of this algorithm are based on two points: 1. Text area density: after removing all tags in HTML, the character density in the text area is higher and there are fewer multiple lines of blanks; 2. Line block length: the content in non-text areas is average Shorter in individual labels (line blocks). The algorithm steps are as follows:
1. Remove all tags, including styles, Js script content, etc., but retain the original line breaks\n
##2. Split the web page content by lines, define the line block $block_i$ as the sum of the $[i, i + blockSize]$ lines of text and give the distribution function of the line block length based on the line number: