Home > Article > Backend Development > python text content extraction
When you usually open a webpage, in addition to the main text of the article, there is usually a lot of navigation, advertisements and other information. The purpose of this blog is to explain how to extract the text content of an article from a web page and transition away other irrelevant information.
This method is based on text density. The original idea comes from the "General Web Page Text Extraction Algorithm Based on Line Block Distribution Function" of Harbin Institute of Technology. This article makes some minor modifications based on this.
Convention:
This article makes statistics based on different lines of the web page. Therefore, assuming that the web page content is not compressed, it means that the web page has normal line breaks.
For some news web pages, the text content of the news may be relatively short, but a video file is embedded in it. Therefore, I will give the video a higher weight; the same applies to pictures. There is a shortcoming here. It should be displayed according to the picture. The weight is determined by the size, but the method in this article fails to achieve this.这些 Because of advertising, navigation these non -authored contents usually appear in the manner of hyperlinks, so the text of the text will be given to zero the text weight of the hyperlink.
It is assumed here that the content of the text is continuous and does not contain non-text content. Therefore, in fact, extracting the text content is to find the beginning and end of the text content.
Step: 清 First remove the contents of CSS, JavaScript, Note, Meta, INS tags in the webpage, and remove the blank line.
Calculate the processed value of each line (1)
Calculate the starting and ending position of the maximum positive substring of the number of texts in each line obtained above
The second step needs to be explained:
, We need to calculate a value. The calculation of this value is as follows:
a picture tag IMG, which is equivalent to the text with a length of 50 characters (the weight given), x1,
a video tag EMBED, which is equivalent to the appearance length of 1000 to 1000 Text of characters, x2
’ 2Number of occurrences + x4 – 8
// Explain that -8 Because we want to calculate a largest positive skewers, we have to minus a positive number. As for how big this is, I want to follow experience.
Complete code
#coding:utf-8 import re def remove_js_css (content): """ remove the the javascript and the stylesheet and the comment content (<script>....</script> and <style>....</style> <!-- xxx -->) """ r = re.compile(r'''<script.*?</script>''',re.I|re.M|re.S) s = r.sub ('',content) r = re.compile(r'''<style.*?</style>''',re.I|re.M|re.S) s = r.sub ('', s) r = re.compile(r'''<!--.*?-->''', re.I|re.M|re.S) s = r.sub('',s) r = re.compile(r'''<meta.*?>''', re.I|re.M|re.S) s = r.sub('',s) r = re.compile(r'''<ins.*?</ins>''', re.I|re.M|re.S) s = r.sub('',s) return s def remove_empty_line (content): """remove multi space """ r = re.compile(r'''^\s+$''', re.M|re.S) s = r.sub ('', content) r = re.compile(r'''\n+''',re.M|re.S) s = r.sub('\n',s) return s def remove_any_tag (s): s = re.sub(r'''<[^>]+>''','',s) return s.strip() def remove_any_tag_but_a (s): text = re.findall (r'''<a[^r][^>]*>(.*?)</a>''',s,re.I|re.S|re.S) text_b = remove_any_tag (s) return len(''.join(text)),len(text_b) def remove_image (s,n=50): image = 'a' * n r = re.compile (r'''<img.*?>''',re.I|re.M|re.S) s = r.sub(image,s) return s def remove_video (s,n=1000): video = 'a' * n r = re.compile (r'''<embed.*?>''',re.I|re.M|re.S) s = r.sub(video,s) return s def sum_max (values): cur_max = values[0] glo_max = -999999 left,right = 0,0 for index,value in enumerate (values): cur_max += value if (cur_max > glo_max) : glo_max = cur_max right = index elif (cur_max < 0): cur_max = 0 for i in range(right, -1, -1): glo_max -= values[i] if abs(glo_max < 0.00001): left = i break return left,right+1 def method_1 (content, k=1): if not content: return None,None,None,None tmp = content.split('\n') group_value = [] for i in range(0,len(tmp),k): group = '\n'.join(tmp[i:i+k]) group = remove_image (group) group = remove_video (group) text_a,text_b= remove_any_tag_but_a (group) temp = (text_b - text_a) - 8 group_value.append (temp) left,right = sum_max (group_value) return left,right, len('\n'.join(tmp[:left])), len ('\n'.join(tmp[:right])) def extract (content): content = remove_empty_line(remove_js_css(content)) left,right,x,y = method_1 (content) return '\n'.join(content.split('\n')[left:right])
Code is called starting from the last function.