The Detail of Extracting & Curating Articles_html/css

首頁

web前端

html教學

The Detail of Extracting & Curating Articles_html/css_WEB-ITnose

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 24, 2016 am 11:15 AM

The Detail of Extracting & Curating Articles

[TOC]

最近工作又涉及到Html 页面新闻正文提取的问题。有很多第三方Library 能解决这个问题，但是总有些功能还是不能满足，或者对中文不适用。所以本文将利用Newspaper，python-goose,python-readability 这几个包来解读以下新闻提取的一些细节，我是在阅读源代码所以会有一些其他编程技巧记录在里面，可能会显得混乱，如果只对解析新闻正文感兴趣你可以忽略。

Newspaper 源码简析

代码结构

`├── api.py├── article.py                     所有的功能封装在这个里面├── cleaners.py                清洗HTML页面├── configuration.py         配置├── extractors.py              提取正文等核心功能实现├── images.py                 图片相关，这个我忽略不关心├── __init__.py├── mthreading.py          多线程模块，作者自己实现的线程尺用于HTML下载├── network.py                封装requests 下载HTML├── nlp.py                        简单自然语言处理功能比如关键词提取├── outputformatters.py   格式化输出├── parsers.py                 封装lxml，提供一些方便的方法├── resources                 存放数据文件<br />├── settings.py<br />├── source.py<br />├── text.py                     对词的处理，算分的时候会用到├── urls.py                     一些urls 的方法├── utils.py<br />├── version.py└── videos

` 代码里面的东西还是挺多的，支持很多种语言，忘了说中文还用到了jieba分词。顺着文档中给出的例子

download

这部分是使用了requests , 作者在此封装了一个多线程的功能下面把其中的线程池拿出来玩了下，是Python3实现，写了个测试的。 ```

from threading import Thread import queue import traceback

class Worker(Thread): def init(self, tasks, timeout seconds): Thread. init (self, ) print(self.getName()) self.tasks = tasks self.timeout = timeoutseconds self.daemon = True self.start()

def run(self):    while True:        try:            func, args, kargs = self.tasks.get(timeout=self.timeout)            print(":".join((self.getName(), args[0])))        except queue.Empty:            # Extra thread allocated, no job, exit gracefully            break        try:            func(*args, **kargs)        except Exception:            traceback.print_exc()        self.tasks.task_done()

class ThreadPool: def init(self, num threads, timeoutseconds): self.tasks = queue.Queue(num threads) for _ in range(numthreads): Worker(self.tasks, timeout_seconds)

def add_task(self, func, *args, **kargs):    self.tasks.put((func, args, kargs))def wait_completion(self):    self.tasks.join()

urls = [ 'http://www.baidu.com', 'http://midday.me', 'http://94fzb.com', 'http://jd.com', 'http://tianmao.com', ]

import requests import time

def task(url): return requests.get(url)

def test threadpool(): pool = ThreadPool(2, 10)

start = time.time()for url in urls:    pool.add_task(task, url)pool.wait_completion()print("threadpool spant: ", time.time() - start)

def test singlethread(): start = time.time() for url in urls: task(url) print("singlethread spent: ", time.time() - start)

if name== ' main':

test_single_thread()test_thread_pool()

```

正文提取

标题和作者的提取对我用处并不大，何况作者部分还只支持英文，看了下也是基于规则的，总的来说所有解析都是基于规则的。用到一些统计的方法，但根本上还是规则，肯定有不适用的时候。正文的提取主要分下面几步：

1.清洗掉部分不需要的标签 2.计算获得包含正文内容的根节点 3.利用outpuformat 对第二步中选出的节点的文本输出

核心在第2步里面具体代码在extractors.py 的ContentExtractor 的calculate bestnode方法实现第2步可以详细拆分：

1.选取所有p,pre,td 标签 2.清除连接密集型标签，这里会用到is highlinkdensity方法，如果满足下面这个公式: 所有a标签的词数/所有候选标签次数> 1/a标签总数就认为是连接密集型，会被扔掉。 3.计算节点得分，分为两部分，一部分是包含的stopword的数量，在resource 文件夹下面有对应语言的词表，其实这个词表不是黑名单，更像是白名单。还有一部分叫boost score。 boostscore 这个分数是对文章开头和结尾部分的标签的不同处理，开头的段会获得较多的加分，当候选节点多余15个，最后4分之一的节点都会得到更少的分数(作者解释是可能会是评论) 。

`boost_score = float((1.0 / starting_boost) * 50)

` 上面是根据段的顺序的加分公式，starting_boost 会不断递增，当然还有一个判断节点是不是boost 。逻辑就是判断是否为p标签，包含的stopwords 大于5个(这个是很费解的)。中文的, stopwords, 存在stopwords-zh.txt中我看了下都是些常用词，只有125个(这个怎么来的，并没有找到相关介绍)。下面是作者对这个判断的解释,

Alot of times the first paragraph might be the caption under an image so we'll want to make sure if we're going to boost a parent node that it should be connected to other paragraphs, at least for the first n paragraphs so we'll want to make sure that the next sibling is a paragraph and has at least some substantial weight to it.

本节点的得分都会加到父节点，和父节点的父节点。最终从这些父节点中选出得分最高的解释最终的结果。

python-readability

相对于Newspaper python-readability 的代码会更加清晰明了，组织的也较好。例子中是使用summary()这个方法获得结果，

`readable_article = Document(html).summary()

summary方法还有一个参数``

html_partial```指定返回结果是否需要html 标签。这个实现会有些区别：

1.移除js, 和css 等不需要的标签 2.把所有div 标签都转成了p标签 3.根据标签的class 属性，名称对所有p标签打分.(打分规则详见后面的代码)。 4.选择得分最高格式化并返回

打分规则有两部分第一部分是根据class 属性，第二部分是根据tag名称。下面对不同标签赋予不同权重

`def score_node(self, elem):content_score = self.class_weight(elem)name = elem.tag.lower()if name == "div":content_score += 5elif name in ["pre", "td", "blockquote"]:content_score += 3elif name in ["address", "ol", "ul", "dl", "dd", "dt", "li", "form"]:content_score -= 3elif name in ["h1", "h2", "h3", "h4", "h5", "h6", "th"]:content_score -= 5return {'content_score': content_score,'elem': elem}

class_weight 方法是根据class 的值来打分，打分规则如下``

def class_weight(self, e):    weight = 0    for feature in [e.get('class', None), e.get('id', None)]:        if feature:            if REGEXES['negativeRe'].search(feature):                weight -= 25            if REGEXES['positiveRe'].search(feature):                weight += 25            if self.positive_keywords and self.positive_keywords.search(feature):                weight += 25            if self.negative_keywords and self.negative_keywords.search(feature):                weight -= 25    if self.positive_keywords and self.positive_keywords.match('tag-'+e.tag):        weight += 25    if self.negative_keywords and self.negative_keywords.match('tag-'+e.tag):        weight -= 25    return weight

`其中用到预先定义的正则：

REGEXES = { 'unlikelyCandidatesRe':re.compile('combx|comment|community|disqus|extra|foot|header|menu|remark|rss|shoutbox|sidebar|sponsor|adbreak|agegate|pagination|pager|popup|tweet|twitter', re.I), 'okMaybeItsACandidateRe': re.compile('and|article|body|column|main|shadow', re.I), 'positiveRe': re.compile('article|body|content|entry|hentry|main|page|pagination|post|text|blog|story', re.I), 'negativeRe': re.compile('combx|comment|com|contact|foot|footer|footnote|masthead|media|meta|outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|tool|widget', re.I), 'divToPElementsRe': re.compile('

```

总结

看了两种实现，其实都是基于一些规则，能写出这样的规则，对html 该要有所熟悉才能完成？但是这些规则终归是死的，当然我有见到其他方式实现，使用了简单的机器学习算法，但是可能效果并不会比这里介绍的好多少。

改进方向

看了很多新闻，发现大部分新闻的摘要都是第一段，看新闻只看第一段大概能知道这个新闻在说什么。这里的第一段并不是严格意义上的第一段。而是新闻前面一部分。自动摘要的结果并不会比第一段的好。所以为了满足自己需求，在python-readability 的结果返回后使用一些类似的规则取第一段作为摘要，还有对中文环境下的新闻可以做些适当调整，这都是在使用过程中能改进的地方

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

為什麼HTML屬性對Web開發很重要？May 12, 2025 am 12:01 AM

htmlattributesarecrucialinwebdevelopment forcontrollingBehavior，外觀和功能

Alt屬性的目的是什麼？為什麼重要？May 11, 2025 am 12:01 AM

alt屬性是HTML中標籤的重要部分，用於提供圖片的替代文本。 1.當圖片無法加載時，alt屬性中的文本會顯示，提升用戶體驗。 2.屏幕閱讀器使用alt屬性幫助視障用戶理解圖片內容。 3.搜索引擎索引alt屬性中的文本，提高網頁的SEO排名。

HTML，CSS和JavaScript：示例和實際應用May 09, 2025 am 12:01 AM

HTML、CSS和JavaScript在網頁開發中的作用分別是：1.HTML用於構建網頁結構；2.CSS用於美化網頁外觀；3.JavaScript用於實現動態交互。通過標籤、樣式和腳本，這三者共同構築了現代網頁的核心功能。

如何在標籤上設置lang屬性？為什麼這很重要？May 08, 2025 am 12:03 AM

設置標籤的lang屬性是優化網頁可訪問性和SEO的關鍵步驟。 1)在標籤中設置lang屬性，如。 2)在多語言內容中，為不同語言部分設置lang屬性，如。 3)使用符合ISO639-1標準的語言代碼，如"en"、"fr"、"zh"等。正確設置lang屬性可以提高網頁的可訪問性和搜索引擎排名。

HTML屬性的目的是什麼？May 07, 2025 am 12:01 AM

htmlattributeseresene forenhancingwebelements'functionalityandAppearance.TheyAdDinformationTodeFineBehavior，外觀和互動，使網站互動，響應式，visalalyAppealing.AttributesLikutesLikeSlikEslikesrc，href，href，href，類，類型，類型，和dissabledtransfransformformformformformformformformformformformformformformforment

您如何在HTML中創建列表？May 06, 2025 am 12:01 AM

toCreateAlistInHtml，useforforunordedlistsandfororderedlists：1）forunorderedlists，wrapitemsinanduseforeachItem，RenderingeringAsabulletedList.2）fororderedlists，useandfornumberedlists，useandfornumberedlists，casundfornumberedlists，casundfornthetthetthetthetthetthetthetttributefordforderfordforderforderentnumberingsnumberingsnumberingStys。

HTML行動：網站結構的示例May 05, 2025 am 12:03 AM

HTML用於構建結構清晰的網站。 1)使用標籤如、、定義網站結構。 2)示例展示了博客和電商網站的結構。 3)避免常見錯誤如標籤嵌套不正確。 4)優化性能通過減少HTTP請求和使用語義化標籤。

您如何將圖像插入HTML頁面？May 04, 2025 am 12:02 AM

toinsertanimageIntoanhtmlpage，usethetagwithsrcandaltattributes.1）usealttextforAcccessibilityandseo.2）instementRcsetForresponSiveImages.3）applylazyloadingWithLoadingWithLoading =“ lazy” tooptimizeperformance.4）tooptimizeperformance.4）

See all articles