Home  >  Article  >  Backend Development  >  python crawls google search results

python crawls google search results

高洛峰
高洛峰Original
2016-10-18 10:46:352002browse

I have been studying how to use python to crawl search engine results some time ago. I encountered a lot of problems during the implementation process. I recorded all the problems I encountered. I hope that children’s shoes who encounter the same problems in the future will not take detours again. .

1. Selection of search engine

Choosing a good search engine means you can get more accurate search results. There are four search engines I have used: Google, Bing, Baidu, and Yahoo!. As a programmer, Google is my first choice. But when I saw that my favorite Google returned to me only a bunch of js code, there were no search results I wanted at all. So I turned to the Bing camp. After using it for a while, I found that the search results returned by Bing were not ideal for my problem. Just when I was about to despair, Google saved me. It turns out that in order to take care of those users who prohibit the use of js in their browsers, Google has another search method. Please see the following search URL:

https://www.google.com.hk/search?hl=en&q=hello

 hl specifies the language to be searched, and q is the keyword you want to search for. Well, thanks to Google, the search results page contains the content I want to crawl.

  PS: Many methods on the Internet that use python to crawl Google search results are https://ajax.googleapis.com/ajax/services/search/web... . It should be noted that this method is no longer recommended by Google, see https://developers.google.com/web-search/docs/. Google now provides a Custom Search API, but the API is limited to 100 requests per day. If you need more, you can only pay for it.

2. Python crawls and analyzes web pages

It is very convenient to use Python to crawl web pages. Not much to say, see the code:

def search(self, queryStr):
     queryStr = urllib2.quote(queryStr)
     url = 'https://www.google.com.hk/search?hl=en&q=%s' % queryStr
     request = urllib2.Request(url)
     response = urllib2.urlopen(request)
     html = response.read()
     results = self.extractSearchResults(html)

The html in line 6 is the source code of the search results page we crawled. Students who have used Python will find that Python provides both urllib and urllib2 modules, both of which are related to URL requests. However, they provide different functions. urllib can only receive URLs, while urllib2 can accept an instance of the Request class. to set the headers of the URL request, which means you can disguise your user agent, etc. (will be used below).

Now that we can use Python to crawl web pages and save them, we can then extract the search results we want from the source code page. Python provides the htmlparser module, but it is relatively troublesome to use. Here we recommend BeautifulSoup, a very useful web page analysis package. The judge website has a detailed introduction to the use of BeautifulSoup, so I won’t say more here.

Using the above code, it is relatively OK for a small number of queries, but if you want to perform thousands of queries, the above method is no longer effective. Google will detect the source of your request. If we use the machine to crawl frequently Get Google's search results, and soon Google will block your IP and return you a 503 Error page. This is not the result we want, so we have to continue to explore

  As mentioned earlier, using urllib2 we can set the headers of the URL request and disguise our user agent. Simply put, user agent is a special network protocol used by applications such as client browsers. It is sent to the server every time the browser (email client/search engine spider) makes an HTTP request, and the server knows the user. What browser (email client/search engine spider) is used to access. Sometimes in order to achieve some purposes, we have to deceive the server in good faith and tell it that I am not using a machine to access you.

So, our code looks like this:

user_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130406 Firefox/23.0', \
         'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0', \
         'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533+ \
         (KHTML, like Gecko) Element Browser 5.0', \
         'IBM WebExplorer /v0.94', 'Galaxy/1.0 [en] (Mac OS X 10.5.6; U; en)', \
         'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)', \
         'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14', \
         'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) \
         Version/6.0 Mobile/10A5355d Safari/8536.25', \
         'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) \
         Chrome/28.0.1468.0 Safari/537.36', \
         'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; TheWorld)']
 def search(self, queryStr):
     queryStr = urllib2.quote(queryStr)
     url = 'https://www.google.com.hk/search?hl=en&q=%s' % queryStr
     request = urllib2.Request(url)
     index = random.randint(0, 9)
     user_agent = user_agents[index]
     request.add_header('User-agent', user_agent)
     response = urllib2.urlopen(request)
     html = response.read()
     results = self.extractSearchResults(html)

Don’t be scared by the user_agents list, it is actually 10 user agent strings. This allows us to disguise it better. If you need more For more user agents, please see UserAgentString here.

Lines 17-19 indicate randomly selecting a user agent string, and then using the add_header method of request to disguise a user agent.

  By disguising the user agent, we can continue to crawl search engine results. If this does not work, then I recommend randomly sleeping for a period of time between every two queries. This will affect the crawling speed, but it will allow you to crawl more continuously. Fetching results, if you have multiple IPs, the crawling speed will also increase.


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn