Home >Backend Development >Python Tutorial >[PYTHON Tutorial] Extract article abstracts

[PYTHON Tutorial] Extract article abstracts

黄舟
黄舟Original
2017-02-07 16:11:181767browse

In the article list of the blog system, in order to present the article content more effectively and allow readers to choose to read more targetedly, the title and abstract of the article are usually provided at the same time.

The content of an article can be in plain text format, but nowadays with the popularity of the Internet, it is more in HTML format. Regardless of the format, the abstract is generally the content at the beginning of the article and can be extracted according to the specified number of words.

Plain text summary

First we extract the plain text summary. The plain text document is a long string, and it is easy to extract its summary:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Get a summary of the TEXT-format document"""

def get_summary(text, count):
u"""Get the first `count` characters from `text`

>>> text = u'Welcome 这是一篇关于Python的文章'
>>> get_summary(text, 12) == u'Welcome 这是一篇'
True
"""
assert(isinstance(text, unicode))
return text[0:count]

if __name__ == '__main__':
import doctest
doctest.testmod()

HTML Summary

HTML documents contain a large number of tags (such as 4a249f0d628e2318394fd9b75b4636b1, e388a4556c0f65e1904146cc1a846bee, 3499910bf9dac5ae3c52d5ede7383485, etc.). These characters are tag instructions and usually appear in pairs. Simple Text interception will destroy the document structure of HTML, causing the summary to be displayed inappropriately in the browser.

In order to intercept the content while following the structure of the HTML document, you need to parse the HTML document. In Python, this can be done with the help of the standard library HTMLParser.

One of the simplest summary extraction functions is to ignore HTML tags and only extract the native text inside the tags. The following is a Python implementation similar to this function:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Get a raw summary of the HTML-format document"""

from HTMLParser import HTMLParser

class SummaryHTMLParser(HTMLParser):
"""Parse HTML text to get a summary

>>> text = u&#39;<p>Hi guys:</p><p>This is a example using SummaryHTMLParser.</p>&#39;
>>> parser = SummaryHTMLParser(10)
>>> parser.feed(text)
>>> parser.get_summary(u&#39;...&#39;)
u&#39;<p>Higuys:Thi...</p>&#39;
"""
def __init__(self, count):
HTMLParser.__init__(self)
self.count = count
self.summary = u&#39;&#39;

def feed(self, data):
"""Only accept unicode `data`"""
assert(isinstance(data, unicode))
HTMLParser.feed(self, data)

def handle_data(self, data):
more = self.count - len(self.summary)
if more > 0:
# Remove possible whitespaces in `data`
data_without_whitespace = u&#39;&#39;.join(data.split())

self.summary += data_without_whitespace[0:more]

def get_summary(self, suffix=u&#39;&#39;, wrapper=u&#39;p&#39;):
return u&#39;<{0}>{1}{2}</{0}>&#39;.format(wrapper, self.summary, suffix)

if __name__ == &#39;__main__&#39;:
import doctest
doctest.testmod()

The above is the content of [PYTHON tutorial] for extracting article abstracts. For more related content, please pay attention to the PHP Chinese website (www.php.cn)!


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn