首頁 >後端開發 >Python教學 >教你用python爬取w3shcool的課程並且儲存到本地的程式碼實例

教你用python爬取w3shcool的課程並且儲存到本地的程式碼實例

Y2J原創: 2017-04-27 11:42:032281瀏覽

本文主要介紹python爬取w3shcool的JQuery的課程並且保存到本地的方法解析。具有很好的參考價值。下面跟著小編一起來看下吧

最近在忙於找工作，閒暇之餘，也找點爬蟲項目練練手，寫寫代碼，知道自己是個菜鳥，但是要多加練習，書山有路勤為徑。各位爺有測試坑可以給我介紹個啊，自動化，功能，介面都可以做。

首先呢，我們明確需求，很多同學呢，有事沒事就想看看一些技術，比如我想看看JQuery的語法呢，可是我現在沒有網絡，手機上也沒有電子書，真的讓我們很難受，那麼別著急啊，你這需求我在這裡滿足你，首先呢，你的需求是獲取JQuery的語法的，那麼我在看到這個需求，我有響應的網站那麼我們接下來去分析這個網站。 www.w3school.com.cn/jquery/jquery_syntax.asp 這是文法url， http://www.w3school.com.cn/jquery/jquery_intro.asp 這是簡介的url，那麼我們拿到很多的url分析到，我們的www.w3school.com.cn/jquery是相同的，那麼我們在來分析在介面怎麼可以獲得得到這些，我們可以看到右面有相應的目標欄，那麼我們去分析下

我們來看下這些鏈接，。我們可以吧這些連結和http://www.w3school.com.cn拼接在一起。然後組成我們新的url，

上程式碼

import urllib.request
from bs4 import BeautifulSoup 
import time
def head():
 headers={
 &#39;User-Agent&#39;:&#39;Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0&#39;
 }
 return headers
def parse_url(url):
 hea=head()
 resposne=urllib.request.Request(url,headers=hea)
 html=urllib.request.urlopen(resposne).read().decode(&#39;gb2312&#39;)
 return html
def url_s():
 url=&#39;http://www.w3school.com.cn/jquery/index.asp&#39;
 html=parse_url(url)
 soup=BeautifulSoup(html)
 me=soup.find_all(id=&#39;course&#39;)
 m_url_text=[]
 m_url=[]
 for link in me:
  m_url_text.append(link.text)
  m=link.find_all(&#39;a&#39;)
  for i in m:
   m_url.append(i.get(&#39;href&#39;))
 for i in m_url_text:
  h=i.encode(&#39;utf-8&#39;).decode(&#39;utf-8&#39;)
  m_url_text=h.split(&#39;\n&#39;)
 return m_url,m_url_text

這樣我們使用url_s這個函數就可以取得我們所有的連結。

[&#39;/jquery/index.asp&#39;, &#39;/jquery/jquery_intro.asp&#39;, &#39;/jquery/jquery_install.asp&#39;, &#39;/jquery/jquery_syntax.asp&#39;, &#39;/jquery/jquery_selectors.asp&#39;, &#39;/jquery/jquery_events.asp&#39;, &#39;/jquery/jquery_hide_show.asp&#39;, &#39;/jquery/jquery_fade.asp&#39;, &#39;/jquery/jquery_slide.asp&#39;, &#39;/jquery/jquery_animate.asp&#39;, &#39;/jquery/jquery_stop.asp&#39;, &#39;/jquery/jquery_callback.asp&#39;, &#39;/jquery/jquery_chaining.asp&#39;, &#39;/jquery/jquery_dom_get.asp&#39;, &#39;/jquery/jquery_dom_set.asp&#39;, &#39;/jquery/jquery_dom_add.asp&#39;, &#39;/jquery/jquery_dom_remove.asp&#39;, &#39;/jquery/jquery_css_classes.asp&#39;, &#39;/jquery/jquery_css.asp&#39;, &#39;/jquery/jquery_dimensions.asp&#39;, &#39;/jquery/jquery_traversing.asp&#39;, &#39;/jquery/jquery_traversing_ancestors.asp&#39;, &#39;/jquery/jquery_traversing_descendants.asp&#39;, &#39;/jquery/jquery_traversing_siblings.asp&#39;, &#39;/jquery/jquery_traversing_filtering.asp&#39;, &#39;/jquery/jquery_ajax_intro.asp&#39;, &#39;/jquery/jquery_ajax_load.asp&#39;, &#39;/jquery/jquery_ajax_get_post.asp&#39;, &#39;/jquery/jquery_noconflict.asp&#39;, &#39;/jquery/jquery_examples.asp&#39;, &#39;/jquery/jquery_quiz.asp&#39;, &#39;/jquery/jquery_reference.asp&#39;, &#39;/jquery/jquery_ref_selectors.asp&#39;, &#39;/jquery/jquery_ref_events.asp&#39;, &#39;/jquery/jquery_ref_effects.asp&#39;, &#39;/jquery/jquery_ref_manipulation.asp&#39;, &#39;/jquery/jquery_ref_attributes.asp&#39;, &#39;/jquery/jquery_ref_css.asp&#39;, &#39;/jquery/jquery_ref_ajax.asp&#39;, &#39;/jquery/jquery_ref_traversing.asp&#39;, &#39;/jquery/jquery_ref_data.asp&#39;, &#39;/jquery/jquery_ref_dom_element_methods.asp&#39;, &#39;/jquery/jquery_ref_core.asp&#39;, &#39;/jquery/jquery_ref_prop.asp&#39;], [&#39;jQuery 教程&#39;, &#39;&#39;, &#39;jQuery 教程&#39;, &#39;jQuery 简介&#39;, &#39;jQuery 安装&#39;, &#39;jQuery 语法&#39;, &#39;jQuery 选择器&#39;, &#39;jQuery 事件&#39;, &#39;&#39;, &#39;jQuery 效果&#39;, &#39;&#39;, &#39;jQuery 隐藏/显示&#39;, &#39;jQuery 淡入淡出&#39;, &#39;jQuery 滑动&#39;, &#39;jQuery 动画&#39;, &#39;jQuery stop()&#39;, &#39;jQuery Callback&#39;, &#39;jQuery Chaining&#39;, &#39;&#39;, &#39;jQuery HTML&#39;, &#39;&#39;, &#39;jQuery 获取&#39;, &#39;jQuery 设置&#39;, &#39;jQuery 添加&#39;, &#39;jQuery 删除&#39;, &#39;jQuery CSS 类&#39;, &#39;jQuery css()&#39;, &#39;jQuery 尺寸&#39;, &#39;&#39;, &#39;jQuery 遍历&#39;, &#39;&#39;, &#39;jQuery 遍历&#39;, &#39;jQuery 祖先&#39;, &#39;jQuery 后代&#39;, &#39;jQuery 同胞&#39;, &#39;jQuery 过滤&#39;, &#39;&#39;, &#39;jQuery AJAX&#39;, &#39;&#39;, &#39;jQuery AJAX 简介&#39;, &#39;jQuery 加载&#39;, &#39;jQuery Get/Post&#39;, &#39;&#39;, &#39;jQuery 杂项&#39;, &#39;&#39;, &#39;jQuery noConflict()&#39;, &#39;&#39;, &#39;jQuery 实例&#39;, &#39;&#39;, &#39;jQuery 实例&#39;, &#39;jQuery 测验&#39;, &#39;&#39;, &#39;jQuery 参考手册&#39;, &#39;&#39;, &#39;jQuery 参考手册&#39;, &#39;jQuery 选择器&#39;, &#39;jQuery 事件&#39;, &#39;jQuery 效果&#39;, &#39;jQuery 文档操作&#39;, &#39;jQuery 属性操作&#39;, &#39;jQuery CSS 操作&#39;, &#39;jQuery Ajax&#39;, &#39;jQuery 遍历&#39;, &#39;jQuery 数据&#39;, &#39;jQuery DOM 元素&#39;, &#39;jQuery 核心&#39;, &#39;jQuery 属性&#39;, &#39;&#39;, &#39;&#39;])

這是所有連結還有對應連結的所對應的語法模組的名字。那我們接下來就是去拼接urls，使用的是str的拼接

 [&#39;http://www.w3school.com.cn//jquery/index.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_intro.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_install.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_syntax.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_selectors.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_events.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_hide_show.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_fade.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_slide.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_animate.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_stop.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_callback.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_chaining.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_dom_get.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_dom_set.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_dom_add.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_dom_remove.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_css_classes.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_css.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_dimensions.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_traversing.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_traversing_ancestors.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_traversing_descendants.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_traversing_siblings.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_traversing_filtering.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ajax_intro.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ajax_load.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ajax_get_post.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_noconflict.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_examples.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_quiz.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_reference.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ref_selectors.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ref_events.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ref_effects.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ref_manipulation.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ref_attributes.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ref_css.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ref_ajax.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ref_traversing.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ref_data.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ref_dom_element_methods.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ref_core.asp&#39;, &#39;http://www.w3school.com.cn//jquery/jquery_ref_prop.asp&#39;]

那麼我們有這個所有的urls，那麼我們來分析下，文章正文。

分析可以得到我們的所有的正文都是在一個id=maincontent中，那麼我們直接解析每個介面中的id=maincontent的標籤，取得回應的text文檔，並且保存就好。

所以我們所有的程式碼如下：

import urllib.request
from bs4 import BeautifulSoup 
import time
def head():
 headers={
 &#39;User-Agent&#39;:&#39;Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0&#39;
 }
 return headers
def parse_url(url):
 hea=head()
 resposne=urllib.request.Request(url,headers=hea)
 html=urllib.request.urlopen(resposne).read().decode(&#39;gb2312&#39;)
 return html
def url_s():
 url=&#39;http://www.w3school.com.cn/jquery/index.asp&#39;
 html=parse_url(url)
 soup=BeautifulSoup(html)
 me=soup.find_all(id=&#39;course&#39;)
 m_url_text=[]
 m_url=[]
 for link in me:
  m_url_text.append(link.text)
  m=link.find_all(&#39;a&#39;)
  for i in m:
   m_url.append(i.get(&#39;href&#39;))
 for i in m_url_text:
  h=i.encode(&#39;utf-8&#39;).decode(&#39;utf-8&#39;)
  m_url_text=h.split(&#39;\n&#39;)
 return m_url,m_url_text
def xml():
 url,url_text=url_s()
 url_jque=[]
 for link in url:
  url_jque.append('http://www.w3school.com.cn/'+link)
 return url_jque
def xiazai():
 urls=xml()
 i=0
 for url in urls:
  html=parse_url(url)
  soup=BeautifulSoup(html)
  me=soup.find_all(id='maincontent')
  with open(r'%s.txt'%i,'wb') as f:
   for h in me:
    f.write(h.text.encode('utf-8'))
    print(i)
  i+=1
if __name__ == '__main__':
 xiazai()

import urllib.request
from bs4 import BeautifulSoup 
import time
def head():
 headers={
 &#39;User-Agent&#39;:&#39;Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0&#39;
 }
 return headers
def parse_url(url):
 hea=head()
 resposne=urllib.request.Request(url,headers=hea)
 html=urllib.request.urlopen(resposne).read().decode(&#39;gb2312&#39;)
 return html
def url_s():
 url=&#39;http://www.w3school.com.cn/jquery/index.asp&#39;
 html=parse_url(url)
 soup=BeautifulSoup(html)
 me=soup.find_all(id=&#39;course&#39;)
 m_url_text=[]
 m_url=[]
 for link in me:
  m_url_text.append(link.text)
  m=link.find_all(&#39;a&#39;)
  for i in m:
   m_url.append(i.get(&#39;href&#39;))
 for i in m_url_text:
  h=i.encode(&#39;utf-8&#39;).decode(&#39;utf-8&#39;)
  m_url_text=h.split(&#39;\n&#39;)
 return m_url,m_url_text

def xml():
 url,url_text=url_s()
 url_jque=[]
 for link in url:
  url_jque.append('http://www.w3school.com.cn/'+link)
 return url_jque
def xiazai():
 urls=xml()
 i=0
 for url in urls:
  html=parse_url(url)
  soup=BeautifulSoup(html)
  me=soup.find_all(id='maincontent')
  with open(r'%s.txt'%i,'wb') as f:
   for h in me:
    f.write(h.text.encode('utf-8'))
    print(i)
  i+=1
if __name__ == '__main__':
 xiazai()

結果

#好了至此，我們的爬取工作完成，剩下的就是小修小布，大的內容我們都應該完成了。

其實python的爬蟲還是很簡單的，只要我們會分析網站的元素，找出所有元素的通項就可以很好的去分析和解決我們的問題

以上是教你用python爬取w3shcool的課程並且儲存到本地的程式碼實例的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述：

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

上一篇：詳細介紹如何在python中使用樸素貝葉斯演算法下一篇：詳細介紹如何在python中使用樸素貝葉斯演算法

看更多