python爬虫 - Python 爬虫提取网页信息

Question

爬取网址是：http://www.xici.net.co/nn/1以上是HTML网页内容，需获取IP地址，端口号，地方，是否高匿，两个时间 一下是我写的Python，但只能实现部分，请各位大神指点下谢谢。。。。 {代码...} 结果是类似下面的...

高洛峰 · Answer

以下代码可以解决了，谢谢各位的解答。。。

import requests
from bs4 import BeautifulSoup


def getInfo(url):
    proxy_info = []
    page_code = requests.get(url).text
    soup = BeautifulSoup(page_code)
    table_soup = soup.find('table')
    proxy_list = table_soup.findAll('tr')[1:]
    for tr in proxy_list:
        td_list = tr.findAll('td')
        ip = td_list[2].string
        port = td_list[3].string
        location = td_list[4].string or td_list[4].find('a').string
        anonymity = td_list[5].string
        proxy_type = td_list[6].string
        speed = td_list[7].find('p', {'class': 'bar'})['title']
        connect_time = td_list[8].find('p', {'class': 'bar'})['title']
        validate_time = td_list[9].string

        # strip
        l = [ip, port, location, anonymity, proxy_type, speed, connect_time, validate_time]
        for i in range( len(l) ):
            if l[i]:
                l[i] = l[i].strip()
        proxy_info.append(l)

    return proxy_info

if __name__ == '__main__':
    url = 'http://www.xici.net.co/nn/1'
    proxy_info = getInfo(url)
    for row in proxy_info:
        for s in row:
            print s,
        print

大家讲道理 · Answer

用xpath去找吧。。 lxml解析

伊谢尔伦 · Answer

感觉正则表达式可能有点问题。

首先看文档结构：

每一个...标签里包含了一列完整的内容,而...标签里是一个单项内容。

建议用正则表达是从标签开始对每一个标签进行解析。

大概这样：r'(.*?(.*?).......)'

这里面(.*?)就是解析出来的ip地址了，后面类似。

写起来有点麻烦，但应该不会错。

其实用BeautifulSoup会简单很多。

大家讲道理 · Answer

用re来操作html，也是醉了，xpath吧。

大家讲道理 · Answer

推荐用BeautifulSoup

大家讲道理 · Answer

BeautifulSoup 是一个很好的选择，自己写正则表达式代码也显得不够优雅。

PHPz · Answer

……scrapy呀

迷茫 · Answer

<p>scrapy...</p>

python爬虫 - Python 爬虫 提取网页信息

répondre à tous(8)je répondrai

python爬虫 - Python 爬虫提取网页信息