Home > Article > Backend Development > Why do you encounter 403 problems when using a crawler to crawl data on Zhihu content?
I want to capture the follow information of users on Zhihu, such as checking who A follows, and get the list of followees through the page www.zhihu.com/people/XXX/followees, but I encountered a 403 problem during the capture.
1. The crawler is only for collecting user attention information, for academic research, and is not for commercial or other purposes.
2. Use PHP, use curl to construct requests, and use simple_html_dom to parse documents. 3. In the user's followers (Followees) The list should use Ajax to dynamically load more followees, so I want to directly crawl the interface data. I can see through firebug that loading more followers seems to be done through
http://www.zhihu.com/ node/ProfileFolloweesListV2, and the post data includes _xsrf, method, parmas, so I submitted a request to this link while simulating that I stayed logged in, with the parameters required for the post. , but 403 is returned. 4. But when I also simulate login, I can parse data that does not require Ajax, such as the number of likes and thanks.
5. I use curl_setopt($ch, CURLOPT_HTTPHEADER, $header); to set the request header so that The request headers are consistent with the request headers I submitted in the browser, but this still resulted in a 403 error
6. I tried to print out the curl request headers and compare them with the request headers sent by the browser, but did not find the correct way (Baidu The curl_getinfo() seems to print out the corresponding message)
7. Many people have encountered 403 because User-Agent or X-Requested-With was not set, but I set it when setting the request header as described in 5
8 .If the description is unclear and you need to post the code, I can post the code
9. This crawler is part of my graduation project and I need to obtain data for the next work. As 1 said, crawling data is purely for academic research
Reply content:
<code class="language-python3"><span class="c">#encoding=utf8</span>
<span class="kn">import</span> <span class="nn">urllib2</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">bs4</span> <span class="k">import</span> <span class="n">BeautifulSoup</span>
<span class="n">Default_Header</span> <span class="o">=</span> <span class="p">{</span><span class="s">'X-Requested-With'</span><span class="p">:</span> <span class="s">'XMLHttpRequest'</span><span class="p">,</span>
<span class="s">'Referer'</span><span class="p">:</span> <span class="s">'http://www.zhihu.com'</span><span class="p">,</span>
<span class="s">'User-Agent'</span><span class="p">:</span> <span class="s">'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; '</span>
<span class="s">'rv:39.0) Gecko/20100101 Firefox/39.0'</span><span class="p">,</span>
<span class="s">'Host'</span><span class="p">:</span> <span class="s">'www.zhihu.com'</span><span class="p">}</span>
<span class="n">_session</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">session</span><span class="p">()</span>
<span class="n">_session</span><span class="o">.</span><span class="n">headers</span><span class="o">.</span><span class="n">update</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span>
<span class="n">resourceFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/UserId.text'</span><span class="p">,</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">resourceLines</span> <span class="o">=</span> <span class="n">resourceFile</span><span class="o">.</span><span class="n">readlines</span><span class="p">()</span>
<span class="n">resultFollowerFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowees.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span>
<span class="n">resultFolloweeFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowers.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span>
<span class="n">BASE_URL</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/'</span>
<span class="n">CAPTURE_URL</span> <span class="o">=</span> <span class="n">BASE_URL</span><span class="o">+</span><span class="s">'captcha.gif?r=1466595391805&type=login'</span>
<span class="n">PHONE_LOGIN</span> <span class="o">=</span> <span class="n">BASE_URL</span> <span class="o">+</span> <span class="s">'login/phone_num'</span>
<span class="k">def</span> <span class="nf">login</span><span class="p">():</span>
<span class="sd">'''登录知乎'''</span>
<span class="n">username</span> <span class="o">=</span> <span class="s">''</span><span class="c">#用户名</span>
<span class="n">password</span> <span class="o">=</span> <span class="s">''</span><span class="c">#密码,注意我这里用的是手机号登录,用邮箱登录需要改一下下面登录地址</span>
<span class="n">cap_content</span> <span class="o">=</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">CAPTURE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">cap_file</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/cap.gif'</span><span class="p">,</span><span class="s">'wb'</span><span class="p">)</span>
<span class="n">cap_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">cap_content</span><span class="p">)</span>
<span class="n">cap_file</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="n">captcha</span> <span class="o">=</span> <span class="n">raw_input</span><span class="p">(</span><span class="s">'capture:'</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"phone_num"</span><span class="p">:</span><span class="n">username</span><span class="p">,</span><span class="s">"password"</span><span class="p">:</span><span class="n">password</span><span class="p">,</span><span class="s">"captcha"</span><span class="p">:</span><span class="n">captcha</span><span class="p">}</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">PHONE_LOGIN</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
<span class="nb">print</span> <span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">):</span>
<span class="sd">'''读取每一位用户的关注者和追随者,根据type进行判断'''</span>
<span class="nb">print</span> <span class="n">followerId</span>
<span class="n">personUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/people/'</span> <span class="o">+</span> <span class="n">followerId</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="n">xsrf</span> <span class="o">=</span><span class="n">getXsrf</span><span class="p">()</span>
<span class="n">hash_id</span> <span class="o">=</span> <span class="n">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span>
<span class="n">headers</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span>
<span class="n">headers</span><span class="p">[</span><span class="s">'Referer'</span><span class="p">]</span><span class="o">=</span> <span class="n">personUrl</span> <span class="o">+</span> <span class="s">'/follow'</span><span class="o">+</span><span class="n">followType</span>
<span class="n">followerUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/node/ProfileFollow'</span><span class="o">+</span><span class="n">followType</span><span class="o">+</span><span class="s">'ListV2'</span>
<span class="n">params</span> <span class="o">=</span> <span class="p">{</span><span class="s">"offset"</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="s">"order_by"</span><span class="p">:</span><span class="s">"created"</span><span class="p">,</span><span class="s">"hash_id"</span><span class="p">:</span><span class="n">hash_id</span><span class="p">}</span>
<span class="n">params_encode</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"method"</span><span class="p">:</span><span class="s">"next"</span><span class="p">,</span><span class="s">"params"</span><span class="p">:</span><span class="n">params_encode</span><span class="p">,</span><span class="s">'_xsrf'</span><span class="p">:</span><span class="n">xsrf</span><span class="p">}</span>
<span class="n">signIndex</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="n">signIndex</span> <span class="o">==</span> <span class="mi">20</span><span class="p">:</span>
<span class="n">params</span><span class="p">[</span><span class="s">'offset'</span><span class="p">]</span> <span class="o">=</span> <span class="n">offset</span>
<span class="n">data</span><span class="p">[</span><span class="s">'params'</span><span class="p">]</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
<span class="n">followerUrlJSON</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">followerUrl</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span><span class="n">headers</span> <span class="o">=</span> <span class="n">headers</span><span class="p">)</span>
<span class="n">signIndex</span> <span class="o">=</span> <span class="nb">len</span><span class="p">((</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">])</span>
<span class="n">offset</span> <span class="o">=</span> <span class="n">offset</span> <span class="o">+</span> <span class="n">signIndex</span>
<span class="n">followerHtml</span> <span class="o">=</span> <span class="p">(</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span>
<span class="k">for</span> <span class="n">everHtml</span> <span class="ow">in</span> <span class="n">followerHtml</span><span class="p">:</span>
<span class="n">everHtmlSoup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">everHtml</span><span class="p">)</span>
<span class="n">personId</span> <span class="o">=</span> <span class="n">everHtmlSoup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">'href'</span><span class="p">]</span>
<span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">personId</span><span class="o">+</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="nb">print</span> <span class="n">personId</span>
<span class="k">def</span> <span class="nf">getXsrf</span><span class="p">():</span>
<span class="sd">'''获取用户的xsrf这个是当前用户的'''</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">BASE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
<span class="n">_xsrf</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'input'</span><span class="p">,</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'name'</span><span class="p">:</span><span class="s">'_xsrf'</span><span class="p">})[</span><span class="s">'value'</span><span class="p">]</span>
<span class="k">return</span> <span class="n">_xsrf</span>
<span class="k">def</span> <span class="nf">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">):</span>
<span class="sd">'''这个是需要抓取的用户的hashid,不是当前登录用户的hashid'''</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
<span class="n">hashIdText</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'script'</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'data-name'</span><span class="p">:</span> <span class="s">'current_people'</span><span class="p">})</span>
<span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">hashIdText</span><span class="o">.</span><span class="n">text</span><span class="p">)[</span><span class="mi">3</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
<span class="n">login</span><span class="p">()</span>
<span class="n">followType</span> <span class="o">=</span> <span class="nb">input</span><span class="p">(</span><span class="s">'请配置抓取类别:0-抓取关注了谁 其它-被哪些人关注'</span><span class="p">)</span>
<span class="n">followType</span> <span class="o">=</span> <span class="s">'ees'</span> <span class="k">if</span> <span class="n">followType</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="s">'ers'</span>
<span class="k">for</span> <span class="n">followerId</span> <span class="ow">in</span> <span class="n">resourceLines</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">)</span>
<span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">pass</span>
<span class="k">if</span> <span class="n">__name__</span><span class="o">==</span><span class="s">'__main__'</span><span class="p">:</span>
<span class="n">main</span><span class="p">()</span>
</code>
It’s nothing more than those, useragent, referrer, token, cookie
I think it may be caused by 2 reasons: