python crawls google search results-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

python crawls google search results

高洛峰

Oct 18, 2016 am 10:46 AM

I have been studying how to use python to crawl search engine results some time ago. I encountered a lot of problems during the implementation process. I recorded all the problems I encountered. I hope that children’s shoes who encounter the same problems in the future will not take detours again. .

1. Selection of search engine

Choosing a good search engine means you can get more accurate search results. There are four search engines I have used: Google, Bing, Baidu, and Yahoo!. As a programmer, Google is my first choice. But when I saw that my favorite Google returned to me only a bunch of js code, there were no search results I wanted at all. So I turned to the Bing camp. After using it for a while, I found that the search results returned by Bing were not ideal for my problem. Just when I was about to despair, Google saved me. It turns out that in order to take care of those users who prohibit the use of js in their browsers, Google has another search method. Please see the following search URL:

https://www.google.com.hk/search?hl=en&q=hello

　hl specifies the language to be searched, and q is the keyword you want to search for. Well, thanks to Google, the search results page contains the content I want to crawl.

　 PS: Many methods on the Internet that use python to crawl Google search results are https://ajax.googleapis.com/ajax/services/search/web... . It should be noted that this method is no longer recommended by Google, see https://developers.google.com/web-search/docs/. Google now provides a Custom Search API, but the API is limited to 100 requests per day. If you need more, you can only pay for it.

2. Python crawls and analyzes web pages

It is very convenient to use Python to crawl web pages. Not much to say, see the code:

def search(self, queryStr):
     queryStr = urllib2.quote(queryStr)
     url = &#39;https://www.google.com.hk/search?hl=en&q=%s&#39; % queryStr
     request = urllib2.Request(url)
     response = urllib2.urlopen(request)
     html = response.read()
     results = self.extractSearchResults(html)

The html in line 6 is the source code of the search results page we crawled. Students who have used Python will find that Python provides both urllib and urllib2 modules, both of which are related to URL requests. However, they provide different functions. urllib can only receive URLs, while urllib2 can accept an instance of the Request class. to set the headers of the URL request, which means you can disguise your user agent, etc. (will be used below).

Now that we can use Python to crawl web pages and save them, we can then extract the search results we want from the source code page. Python provides the htmlparser module, but it is relatively troublesome to use. Here we recommend BeautifulSoup, a very useful web page analysis package. The judge website has a detailed introduction to the use of BeautifulSoup, so I won’t say more here.

Using the above code, it is relatively OK for a small number of queries, but if you want to perform thousands of queries, the above method is no longer effective. Google will detect the source of your request. If we use the machine to crawl frequently Get Google's search results, and soon Google will block your IP and return you a 503 Error page. This is not the result we want, so we have to continue to explore

　 As mentioned earlier, using urllib2 we can set the headers of the URL request and disguise our user agent. Simply put, user agent is a special network protocol used by applications such as client browsers. It is sent to the server every time the browser (email client/search engine spider) makes an HTTP request, and the server knows the user. What browser (email client/search engine spider) is used to access. Sometimes in order to achieve some purposes, we have to deceive the server in good faith and tell it that I am not using a machine to access you.

So, our code looks like this:

user_agents = [&#39;Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130406 Firefox/23.0&#39;, \
         &#39;Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0&#39;, \
         &#39;Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533+ \
         (KHTML, like Gecko) Element Browser 5.0&#39;, \
         &#39;IBM WebExplorer /v0.94&#39;, &#39;Galaxy/1.0 [en] (Mac OS X 10.5.6; U; en)&#39;, \
         &#39;Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)&#39;, \
         &#39;Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14&#39;, \
         &#39;Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) \
         Version/6.0 Mobile/10A5355d Safari/8536.25&#39;, \
         &#39;Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) \
         Chrome/28.0.1468.0 Safari/537.36&#39;, \
         &#39;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; TheWorld)&#39;]
 def search(self, queryStr):
     queryStr = urllib2.quote(queryStr)
     url = &#39;https://www.google.com.hk/search?hl=en&q=%s&#39; % queryStr
     request = urllib2.Request(url)
     index = random.randint(0, 9)
     user_agent = user_agents[index]
     request.add_header(&#39;User-agent&#39;, user_agent)
     response = urllib2.urlopen(request)
     html = response.read()
     results = self.extractSearchResults(html)

Don’t be scared by the user_agents list, it is actually 10 user agent strings. This allows us to disguise it better. If you need more For more user agents, please see UserAgentString here.

Lines 17-19 indicate randomly selecting a user agent string, and then using the add_header method of request to disguise a user agent.

　 By disguising the user agent, we can continue to crawl search engine results. If this does not work, then I recommend randomly sleeping for a period of time between every two queries. This will affect the crawling speed, but it will allow you to crawl more continuously. Fetching results, if you have multiple IPs, the crawling speed will also increase.

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

How to Use Python to Find the Zipf Distribution of a Text FileMar 05, 2025 am 09:58 AM

This tutorial demonstrates how to use Python to process the statistical concept of Zipf's law and demonstrates the efficiency of Python's reading and sorting large text files when processing the law. You may be wondering what the term Zipf distribution means. To understand this term, we first need to define Zipf's law. Don't worry, I'll try to simplify the instructions. Zipf's Law Zipf's law simply means: in a large natural language corpus, the most frequently occurring words appear about twice as frequently as the second frequent words, three times as the third frequent words, four times as the fourth frequent words, and so on. Let's look at an example. If you look at the Brown corpus in American English, you will notice that the most frequent word is "th

How Do I Use Beautiful Soup to Parse HTML?Mar 10, 2025 pm 06:54 PM

This article explains how to use Beautiful Soup, a Python library, to parse HTML. It details common methods like find(), find_all(), select(), and get_text() for data extraction, handling of diverse HTML structures and errors, and alternatives (Sel

How to Perform Deep Learning with TensorFlow or PyTorch?Mar 10, 2025 pm 06:52 PM

This article compares TensorFlow and PyTorch for deep learning. It details the steps involved: data preparation, model building, training, evaluation, and deployment. Key differences between the frameworks, particularly regarding computational grap

Serialization and Deserialization of Python Objects: Part 1Mar 08, 2025 am 09:39 AM

Serialization and deserialization of Python objects are key aspects of any non-trivial program. If you save something to a Python file, you do object serialization and deserialization if you read the configuration file, or if you respond to an HTTP request. In a sense, serialization and deserialization are the most boring things in the world. Who cares about all these formats and protocols? You want to persist or stream some Python objects and retrieve them in full at a later time. This is a great way to see the world on a conceptual level. However, on a practical level, the serialization scheme, format or protocol you choose may determine the speed, security, freedom of maintenance status, and other aspects of the program

Mathematical Modules in Python: StatisticsMar 09, 2025 am 11:40 AM

Python's statistics module provides powerful data statistical analysis capabilities to help us quickly understand the overall characteristics of data, such as biostatistics and business analysis. Instead of looking at data points one by one, just look at statistics such as mean or variance to discover trends and features in the original data that may be ignored, and compare large datasets more easily and effectively. This tutorial will explain how to calculate the mean and measure the degree of dispersion of the dataset. Unless otherwise stated, all functions in this module support the calculation of the mean() function instead of simply summing the average. Floating point numbers can also be used. import random import statistics from fracti

Professional Error Handling With PythonMar 04, 2025 am 10:58 AM

In this tutorial you'll learn how to handle error conditions in Python from a whole system point of view. Error handling is a critical aspect of design, and it crosses from the lowest levels (sometimes the hardware) all the way to the end users. If y

What are some popular Python libraries and their uses?Mar 21, 2025 pm 06:46 PM

The article discusses popular Python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Django, Flask, and Requests, detailing their uses in scientific computing, data analysis, visualization, machine learning, web development, and H

Scraping Webpages in Python With Beautiful Soup: Search and DOM ModificationMar 08, 2025 am 10:36 AM

This tutorial builds upon the previous introduction to Beautiful Soup, focusing on DOM manipulation beyond simple tree navigation. We'll explore efficient search methods and techniques for modifying HTML structure. One common DOM search method is ex

See all articles