Home  >  Article  >  Backend Development  >  Python crawls other web pages

Python crawls other web pages

零到壹度
零到壹度Original
2018-03-30 10:38:352273browse

This article mainly shares with you a Python request method for crawling other web pages. It has a good reference value and I hope it will be helpful to everyone. Let’s follow the editor to take a look, I hope it can help everyone.

To put it simply, it is to find the hyperlink 'href' in the web page, then convert the relative URL into an absolute URL, and use a for loop to access it

import requestsfrom bs4 import BeautifulSoup#将字符串转换为Python对象import pandas as pd
url = 'http://www.runoob.com/html/html-tutorial.html'r= requests.get(url)
html=r.text.encode(r.encoding).decode()
soup =BeautifulSoup(html,'lxml')#html放到beatifulsoup对象中l=[x.text for x in soup.findAll('h2')]#提取次标题中所有的文字df = pd.DataFrame(l,columns =[url])#将l变为DataFrame文件,列名为URLx=soup.findAll('a')[1]#查看第二个元素x.has_attr('href')#判断是都有href字符x.attrs['href']#获得超链接 attrs函数返回字典links = [i for i in soup.findAll('a')if i.has_attr('href')and i.attrs['href'][0:5]== '/html']#用if来做一个筛选relative_urls= set([i.attrs['href'] for i in links])
absolute_urls={'http://www.runoob.com'+i for i in relative_urls}
absolute_urls.discard(url)#删除当前所在的urlfor i in absolute_urls:
    ri= requests.get(i)
    soupi =BeautifulSoup(ri.text.encode(ri.encoding),'lxml')
    li=[x.text for x in soupi.findAll('h2')]
    dfi = pd.DataFrame(l,columns =[i])
    df = df.join(dfi,how='outer')
df

Related recommendations :

Python crawls simple web pages

##Python crawler crawls Tencent News

Python crawls Taobao product information

The above is the detailed content of Python crawls other web pages. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn