我需要用bs4来分析一个html,需要写很多 提取语句,大概几十条,格式如下
twitter_url = summary_soup.find('a','twitter_url').get('href')
facebook_url = summary_soup.find('a','facebook_url').get('href')
linkedin_url = summary_soup.find('a','linkedin_url').get('href')
name = summary_soup.find('p', class_='name').find('a').string
但是每个语句都有可能出异常,如果每个语句都加上try except 就太繁琐了,有没有什么好的方法处理每条语句,出异常赋值为None,不中断程序
ringa_lee2017-04-18 09:05:01
I asked a small question in the comment of the question. If you can answer it, it will be easier for everyone to understand your needs.
If you don’t want to think too much and just want to avoid the mistakes that may occur when get
, there is a more stealthy way. If there are not too many strange situations to deal with, maybe you can try:
twitter_url = (summary_soup.find('a','twitter_url') or {}).get('href')
If bs's find
does not find anything, it will return None
. At this time, we use or
to complete a trick first, making find
沒有找到東西的話,會 return None
,此時我們利用先利用 or
來完成一個 trick 使得 get
永遠不會失敗.再利用字典的 get
與 bs tag 的 get
相似的特性就可以處理掉異常,對變數賦值為 None
forever Will not fail. Using the similar features of
of bs tag, you can handle the exception and assign the value to the variable as None
.
If you want to write more stably, it will be very helpful to refer to @prolifes’ suggestions. find
怎麼偷雞,那我這樣偷偷看,你知道的,偷雞的訣竅就是 假資料
:
from bs4 import BeautifulSoup
html = '<p class="name"><a href="www.hello.com">hello world</a></p>'
emptysoup = BeautifulSoup('<a></a>', 'xml')
soup = BeautifulSoup(html, 'xml')
name = (soup.find('p', class_='name') or emptysoup).find('a').string
print(name)
name = (soup.find('p', class_='nam') or emptysoup).find('a').string
print(name)
Result:
hello world
None
Questions I answered🎜: Python-QA🎜
大家讲道理2017-04-18 09:05:01
I think this is not a problem of a large number of exceptions, but a problem of code writing. I will make a bold guess, such as this sentence:
twitter_url = summary_soup.find('a','twitter_url').get('href')
I think the possible reasons for the error are: summary_soup.find('a','twitter_url')
这一句没有找到元素,然后返回了 None
,然后你用这个None
调用 get('href')
, then it must be an error.
If this is the reason, it will be easier to deal with. Write it in two paragraphs:
twitter_url_a = summary_soup.find('a','twitter_url')
twitter_url = twitter_url_a.get('href') if twitter_url_a else None
PHP中文网2017-04-18 09:05:01
The chain call of bs4 is very good, so I packaged the soup
class MY_SOUP():
'''
包装类
'''
def __init__(self,soup):
self.soup = soup
if soup:
if soup.string:
self.string = soup.string.strip()
else:
self.string = None
else:
self.string = None
def find(self, *args, **kw):
ret = self.soup.find(*args, **kw)
if ret:
return FIND_SOUP(ret)
return FIND_SOUP(None)
def find_all(self,*args, **kw):
ret = self.soup.find_all(*args, **kw)
return ret
def get_text(self):
if self.soup:
return self.soup.get_text().strip()
return None
def get(self,*args, **kw):
if self.soup:
return self.soup.get(*args, **kw)
return None
soup = BeautifulSoup(html,'lxml')
summary_soup = soup.find('p', class_='summary')
#把 summary_soup 包装成我的soup
summary_soup = MY_SOUP(summary_soup)
#再也没有None异常了
twitter_url = summary_soup.find('a','twitter_url').get('href')
facebook_url = summary_soup.find('a','facebook_url').get('href')
linkedin_url = summary_soup.find('a','linkedin_url').get('href')
name = summary_soup.find('p', class_='name').find('a').string
...
Reference @prolifes
ringa_lee2017-04-18 09:05:01
Customize a method where errors may be reported and try inside the method
PHPz2017-04-18 09:05:01
Every exception may occur, which is the problem when you analyze the HTML writing. When analyzing the HTML, you should try to consider it as comprehensively as possible, and then a try except contains all the analysis statements, and then capture the errors and write logs. When the more pages are crawled, there are no more Only if you make mistakes can you prove that the analysis statement is well written