python - 爬虫数据替换  

Question

{代码...} 如图想分别获得网站来源和时间，但是用split方法无效，打印出来好像是空格，但是不能匹配替换。源代码中是   ，请问如何匹配替换，分别得到两种数据？

PHPz · Answer

发现每一行的空格用了两种编码进行划分，虽然看起来是一样的。简单改写了一下你的代码

# coding: utf-8
import re
import requests
from bs4 import BeautifulSoup

code = '000917电广传媒'
def getinfo(code,page):
    baseurl = 'http://news.baidu.com/ns?word=title%3A%28{}%29&pn={}&cl=2&ct=0&tn=newstitle&rn=20&ie=utf-8&bt=0&et=0'.format(code,10*(page-1))
    wd = requests.get(baseurl).content
    soup = BeautifulSoup(wd,'lxml')
    title = soup.select('.c-title > a ')
    resource = soup.select('p .c-title-author')
    resource1 = [i.text.encode('utf-8') for i in resource]
    for i in resource1:
        l = re.split("\xa02016|\xc2\xa0\xc2\xa0", i)
        print l[0]
        print l[1]



getinfo(code,1)

输出结果是

中金在线
1小时前
金投网
2016年12月15日 21:31
同花顺金融网
2016年12月15日 21:37
金投网
2016年12月15日 21:08
每日经济新闻
2016年12月15日 20:22
金融界
2016年12月15日 17:23
潇湘晨报数字报
2016年12月13日 03:00
新浪
2016年12月12日 22:48
财新网
2016年12月12日 20:50
同花顺金融服务网
2016年12月12日 18:29
新浪财经
2016年12月12日 17:39
金融界
2016年12月12日 17:35
金投网
2016年12月08日 02:20
金投网
2016年12月08日 02:20
中国经济网
2016年11月29日 07:00
新浪财经
2016年11月29日 02:54
潇湘晨报数字报
2016年11月28日 02:45
证券时报
2016年11月26日 02:48
同花顺金融服务网
2016年11月25日 22:37
同花顺金融网
2016年11月25日 19:48

PHP中文网 · Answer

数据中的时间戳都是数字开头的，例如59分钟和201几年，要不要试试用第一个数字划分。

巴扎黑 · Answer

# coding=utf-8
import requests
from bs4 import BeautifulSoup


code = '000917电广传媒'
def getinfo(code,page):
    baseurl = 'http://news.baidu.com/ns?word=title%3A%28{}%29&pn={}&cl=2&ct=0&tn=newstitle&rn=20&ie=utf-8&bt=0&et=0'.format(code,10*(page-1))
    wd = requests.get(baseurl).content

    soup = BeautifulSoup(wd,'lxml')
    for text in soup.find_all('p', class_='result title', id=True):
        for i in text.find_all('p', class_='c-title-author'):
            print(i.get_text().split('\xa0\xa0'))


getinfo(code,1)

python3环境下可以分割的，

python - 爬虫数据替换&nbsp;&nbsp;

全部回复(3)我来回复

python - 爬虫数据替换