Maison > Questions et réponses > le corps du texte
爬取代理IP网址是:http://ip84.com
以上是HTML网页内容,
需获取IP地址,端口号,地方,是否高匿,两个时间
一下是我写的Python,但只能实现部分,请各位大神指点下
谢谢。。。。
import re
import urllib
a = raw_input('input url:')
s = urllib.urlopen(a)
s1 = s.read()
def getinfo(aaa):
#reg = re.compile(r'(?<![\.\d])(?:\d{1,3}\.){3}\d{1,3}(?![\.\d])')
#reg = re.compile(r'<td>(\d+)\.(\d+)\.(\d+)\.(\d+)</td>\s*<td>(\d+)</td>\s*<td>([/u4e00-/u9fa5]+)</td>')
reg = re.compile(r'<td>(\w+)</td>\s*<td>([\u4e00-\u9fa5]+)</td>')
l = re.findall(reg, aaa)
print l
getinfo(s1)
结果是类似下面的,不一定是表格
|ip|端口号|位置|是否高匿|类型|速度|连接时间|验证时间|
|-|-|-|-|-|-|-|-|-|
|122.89.9.70|80|台湾|高匿|HTTP|1.27秒|0.325秒|15-08-28 16:30|
|123.69.48.45|8080|江苏南京|高匿|HTTPS|1.07秒|0.5秒|15-08-28 17:30|
天蓬老师2017-04-17 17:43:39
Bonjour ! Il est recommandé d'utiliser les requêtes et BeautifulSoup pour l'analyse. Voici mon code (Python3) et les résultats :
from bs4 import BeautifulSoup
import requests
r = requests.get("http://ip84.com")
content = r.text
soup = BeautifulSoup(content,"html.parser")
ListTable = soup.find_all("table",class_ = "list")
for table in ListTable:
ListTr = table.find_all("tr")
for tr in ListTr:
try:
ListTd = tr.find_all("td")
ipaddr = str(ListTd[0].get_text()).strip()
port = str(ListTd[1].get_text()).strip()
zone = str(ListTd[2].get_text()).strip().replace("\n","")
nmd = str(ListTd[3].get_text()).strip()
xy = str(ListTd[4].get_text()).strip()
speed = str(ListTd[5].get_text()).strip()
time = str(ListTd[6].get_text()).strip()
print(ipaddr + " " + port + " " + zone + " " + nmd + " " + xy + " " + speed + " " + time)
except Exception as e:
print("---------------------------------------------")
Résultat de l'exécution :
Bonne chance ! ^_<