首頁  >  文章  >  後端開發  >  對知乎內容使用爬蟲爬取數據,為什麼會遇到403問題?

對知乎內容使用爬蟲爬取數據,為什麼會遇到403問題?

WBOY
WBOY原創
2016-08-17 10:01:254290瀏覽

我想抓取知乎上用戶的關注信息,如查看A關注了哪些人,通過www.zhihu.com/people/XXX/followees這個頁面來獲得followee的列表,但是在抓取中遇到了403問題。
1.爬蟲只是為了蒐集用戶關注信息,用於學術研究,絕非商業或其他目的
2.使用PHP,利用curl構造請求,使用simple_html_dom來解析文檔
3.在用戶的關注者(Followees)列表,應該是使用Ajax進行動態加載更多的followees,於是我想直接爬接口的數據,通過firebug查看到,加載更多的關注者似乎是通過zhihu.com/ node/ProfileF 進行的,並且post的資料有_xsrf,method,parmas,於是我在模擬保持登入的情況下,對這個連結提交請求,並帶有post過去的所需要的參數,但是回傳的是403。
4.但是我同樣模擬登入的情況下,可以解析到如讚同數、感謝數這些不需要Ajax的資料
5.我使用curl_setopt($ch, CURLOPT_HTTPHEADER, $header );來設定請求頭,使其與我在瀏覽器中提交的請求的請求頭一致,但是這樣任然導致403錯誤
6.我嘗試打印出curl的請求頭與瀏覽器發出的請求頭進行比較,但是沒有找到正確的方式(百度出的curl_getinfo()似乎打印出的相應報文)
7.有許多人曾因為沒有設定User-Agent或X-Requested-With遭遇403,但是我在5中描述設定請求頭時都設定了
8 .如果敘述不詳需要貼出代碼,我可以貼出代碼
9.這個爬蟲是我畢設的一部分,需要獲取數據來進行接下來的工作,如1所說,爬取數據純粹是為了學術研究

回覆內容:

如果有防火牆功能的伺服器,連續抓取可能會被幹掉,除非你有很多代理伺服器。或最簡單用adsl不斷重新撥號更換ip 你先找個瀏覽器,研究一下request的HTTP Header再來抓 這兩天剛好做了一個抓取用戶的關注著和追隨者的的爬蟲在抓數據,使用的是Python。這裡給你一段python的程式碼,你可以對著程式碼看一下你的程式碼問題。
403應該就是請求的時候一些數據發錯了,下面的代碼中涉及到一個打開的文本,文本中的內容是用戶的id,文本裡面的內容樣式我截了圖放在最後面。
<code class="language-python3"><span class="c">#encoding=utf8</span>
<span class="kn">import</span> <span class="nn">urllib2</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">bs4</span> <span class="k">import</span> <span class="n">BeautifulSoup</span>

<span class="n">Default_Header</span> <span class="o">=</span> <span class="p">{</span><span class="s">'X-Requested-With'</span><span class="p">:</span> <span class="s">'XMLHttpRequest'</span><span class="p">,</span>
                  <span class="s">'Referer'</span><span class="p">:</span> <span class="s">'http://www.zhihu.com'</span><span class="p">,</span>
                  <span class="s">'User-Agent'</span><span class="p">:</span> <span class="s">'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; '</span>
                                <span class="s">'rv:39.0) Gecko/20100101 Firefox/39.0'</span><span class="p">,</span>
                  <span class="s">'Host'</span><span class="p">:</span> <span class="s">'www.zhihu.com'</span><span class="p">}</span>
<span class="n">_session</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">session</span><span class="p">()</span>
<span class="n">_session</span><span class="o">.</span><span class="n">headers</span><span class="o">.</span><span class="n">update</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span> 
<span class="n">resourceFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/UserId.text'</span><span class="p">,</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">resourceLines</span> <span class="o">=</span> <span class="n">resourceFile</span><span class="o">.</span><span class="n">readlines</span><span class="p">()</span>
<span class="n">resultFollowerFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowees.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span>
<span class="n">resultFolloweeFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowers.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span>

<span class="n">BASE_URL</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/'</span>
<span class="n">CAPTURE_URL</span> <span class="o">=</span> <span class="n">BASE_URL</span><span class="o">+</span><span class="s">'captcha.gif?r=1466595391805&type=login'</span>
<span class="n">PHONE_LOGIN</span> <span class="o">=</span> <span class="n">BASE_URL</span> <span class="o">+</span> <span class="s">'login/phone_num'</span>

<span class="k">def</span> <span class="nf">login</span><span class="p">():</span>
    <span class="sd">'''登录知乎'''</span>
    <span class="n">username</span> <span class="o">=</span> <span class="s">''</span><span class="c">#用户名</span>
    <span class="n">password</span> <span class="o">=</span> <span class="s">''</span><span class="c">#密码,注意我这里用的是手机号登录,用邮箱登录需要改一下下面登录地址</span>
    <span class="n">cap_content</span> <span class="o">=</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">CAPTURE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
    <span class="n">cap_file</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/cap.gif'</span><span class="p">,</span><span class="s">'wb'</span><span class="p">)</span>
    <span class="n">cap_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">cap_content</span><span class="p">)</span>
    <span class="n">cap_file</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
    <span class="n">captcha</span> <span class="o">=</span> <span class="n">raw_input</span><span class="p">(</span><span class="s">'capture:'</span><span class="p">)</span>
    <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"phone_num"</span><span class="p">:</span><span class="n">username</span><span class="p">,</span><span class="s">"password"</span><span class="p">:</span><span class="n">password</span><span class="p">,</span><span class="s">"captcha"</span><span class="p">:</span><span class="n">captcha</span><span class="p">}</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">PHONE_LOGIN</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
    <span class="nb">print</span> <span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span>
    
<span class="k">def</span> <span class="nf">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">):</span>
    <span class="sd">'''读取每一位用户的关注者和追随者,根据type进行判断'''</span>
    <span class="nb">print</span> <span class="n">followerId</span>
    <span class="n">personUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/people/'</span> <span class="o">+</span> <span class="n">followerId</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
    <span class="n">xsrf</span> <span class="o">=</span><span class="n">getXsrf</span><span class="p">()</span>
    <span class="n">hash_id</span> <span class="o">=</span> <span class="n">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span>
    <span class="n">headers</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span>
    <span class="n">headers</span><span class="p">[</span><span class="s">'Referer'</span><span class="p">]</span><span class="o">=</span> <span class="n">personUrl</span> <span class="o">+</span> <span class="s">'/follow'</span><span class="o">+</span><span class="n">followType</span>
    <span class="n">followerUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/node/ProfileFollow'</span><span class="o">+</span><span class="n">followType</span><span class="o">+</span><span class="s">'ListV2'</span>
    <span class="n">params</span> <span class="o">=</span> <span class="p">{</span><span class="s">"offset"</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="s">"order_by"</span><span class="p">:</span><span class="s">"created"</span><span class="p">,</span><span class="s">"hash_id"</span><span class="p">:</span><span class="n">hash_id</span><span class="p">}</span>
    <span class="n">params_encode</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
    <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"method"</span><span class="p">:</span><span class="s">"next"</span><span class="p">,</span><span class="s">"params"</span><span class="p">:</span><span class="n">params_encode</span><span class="p">,</span><span class="s">'_xsrf'</span><span class="p">:</span><span class="n">xsrf</span><span class="p">}</span>
    
    <span class="n">signIndex</span> <span class="o">=</span> <span class="mi">20</span>
    <span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">while</span> <span class="n">signIndex</span> <span class="o">==</span> <span class="mi">20</span><span class="p">:</span>
        <span class="n">params</span><span class="p">[</span><span class="s">'offset'</span><span class="p">]</span> <span class="o">=</span> <span class="n">offset</span>
        <span class="n">data</span><span class="p">[</span><span class="s">'params'</span><span class="p">]</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
        <span class="n">followerUrlJSON</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">followerUrl</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span><span class="n">headers</span> <span class="o">=</span> <span class="n">headers</span><span class="p">)</span>
        <span class="n">signIndex</span> <span class="o">=</span> <span class="nb">len</span><span class="p">((</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">])</span>
        <span class="n">offset</span> <span class="o">=</span> <span class="n">offset</span> <span class="o">+</span> <span class="n">signIndex</span>
        <span class="n">followerHtml</span> <span class="o">=</span>  <span class="p">(</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span>
        <span class="k">for</span> <span class="n">everHtml</span> <span class="ow">in</span> <span class="n">followerHtml</span><span class="p">:</span>
            <span class="n">everHtmlSoup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">everHtml</span><span class="p">)</span>
            <span class="n">personId</span> <span class="o">=</span>  <span class="n">everHtmlSoup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">'href'</span><span class="p">]</span>
            <span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">personId</span><span class="o">+</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
            <span class="nb">print</span> <span class="n">personId</span>
            
    
<span class="k">def</span> <span class="nf">getXsrf</span><span class="p">():</span>
    <span class="sd">'''获取用户的xsrf这个是当前用户的'''</span>
    <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">BASE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
    <span class="n">_xsrf</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'input'</span><span class="p">,</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'name'</span><span class="p">:</span><span class="s">'_xsrf'</span><span class="p">})[</span><span class="s">'value'</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">_xsrf</span>
    
<span class="k">def</span> <span class="nf">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">):</span>
    <span class="sd">'''这个是需要抓取的用户的hashid,不是当前登录用户的hashid'''</span>
    <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
    <span class="n">hashIdText</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'script'</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'data-name'</span><span class="p">:</span> <span class="s">'current_people'</span><span class="p">})</span>
    <span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">hashIdText</span><span class="o">.</span><span class="n">text</span><span class="p">)[</span><span class="mi">3</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">login</span><span class="p">()</span>
    <span class="n">followType</span> <span class="o">=</span> <span class="nb">input</span><span class="p">(</span><span class="s">'请配置抓取类别:0-抓取关注了谁 其它-被哪些人关注'</span><span class="p">)</span>
    <span class="n">followType</span> <span class="o">=</span> <span class="s">'ees'</span> <span class="k">if</span> <span class="n">followType</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="s">'ers'</span>
    <span class="k">for</span> <span class="n">followerId</span> <span class="ow">in</span> <span class="n">resourceLines</span><span class="p">:</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">)</span>
            <span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="k">pass</span>
   
<span class="k">if</span> <span class="n">__name__</span><span class="o">==</span><span class="s">'__main__'</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code>
無非就是那些, useragent,referer,token,ooe 覺得可能會是 2 個原因造成的:
  1. 沒帶 cookies
  2. _xsrf 或 hash_id 錯誤
這個問題我來回答下吧,知乎在「_xsrf」這個欄位搞了個小動作,並不是首頁頁面取到的那個_xsrf 的值,而是在登入成功後透過cookie傳回的那個「_xsrf 」的值,所以你需要取得正確的這個值,不然一直會報403錯誤(我是在Post提問時發現的,相信你遇到的問題類似,直接上代碼):

///
// / 知乎提問
///

/// 提問標題
/// 詳細內容
/ //
//遍歷cookie,取得_xsrf 的值
var list = GetAllCookies(cookie);
foreach (var item in list)
{
if (item.Name == "_xsrf")
{
if (item.Name == "_xsrf")
=
x item.Value;
break;
}
}
//發文
var FaTiePostUrl = "
http://www.



zhihu.com/question/addvar


"; ;
。 = nhp.PostResultHtml(FaTiePostUrl, cookie, "http://www.zhihu.com/", FaTiePostStr);
}


///
/// 遍歷CsumContainer
///
/// 遍歷CsumContainer
///
///
public static List GetAllCookies(CookieContainer cc)
{
List lstCookies = new Listookies = new List ();

Hashtable table = (Hashtable)cc.GetType().InvokeMember("m_domainTable",
System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.GetFieldp. Instance, null, cc, new object[] { });

foreach (object pathList in table.Values)
{
SortedList lstCookieCol = (SortedList)pathList.GetType().InvokeMember("m_Col = (SortedList)pathList.GetType().InvokeMember("m_p." .BindingFlags.NonPublic | System.Reflection.BindingFlags.GetField
| System.Reflection.BindingFlags.Instance, null, pathList, new object[] { });
foreach (CookieCollection colCookie inlst colCookies colCookies) lstCookies.Add(c);
}
return lstCookies;🎜 } 修改header的X-Forwarded-For字段偽裝ip 真的很巧,昨天晚上才剛遇到了這個問題。原因可能有很多,我只說我遇到的,僅供參考,提供一個想法。我爬取的是新浪微博,使用了代理。出現403是因為造訪時網站拒絕,我在瀏覽器上操作也是一樣,隨便看裡面幾個網頁就會出現403,不過刷新幾次就好了。在程式碼中實作就是多請求幾次。 看了樓上的答案,瞬間被鎮住了。大牛真多,不過我建議題主去問李開復好了~哈哈 話說接口是怎麼抓到的...為何我用firebug抓不到接口..chrome的network也抓不到接口🎜話說直接請求followees也可以直接獲取到,剩下的也就是正則了
陳述:
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn