search
HomeBackend DevelopmentPython Tutorial编写Python脚本来获取Google搜索结果的示例

前一段时间一直在研究如何用python抓取搜索引擎结果,在实现的过程中遇到了很多的问题,我把我遇到的问题都记录下来,希望以后遇到同样问题的童鞋不要再走弯路。

1. 搜索引擎的选取

  选择一个好的搜索引擎意味着你能够得到更准确的搜索结果。我用过的搜索引擎有四种:Google、Bing、Baidu、Yahoo!。 作为程序员,我首选Google。但当我看见我最爱的Google返回给我的全是一堆的js代码,根本没我想要的搜索结果。于是我转而投向了Bing的阵营,在用过一段时间后我发现Bing返回的搜索结果对于我的问题来说不太理想。正当我要绝望时,Google拯救了我。原来Google为了照顾那些禁止浏览器使用js的用户,还有另外一种搜索方式,请看下面的搜索URL:

https://www.google.com.hk/search?hl=en&q=hello

  hl指定要搜索的语言,q就是你要搜索的关键字。 好了,感谢Google,搜索结果页面包含我要抓取的内容。

  PS: 网上很多利用python抓取Google搜索结果还是利用 https://ajax.googleapis.com/ajax/services/search/web... 的方法。需要注意的是这个方法Google已经不再推荐使用了,见 https://developers.google.com/web-search/docs/ 。Google现在提供了Custom Search API, 不过API限制每天100次请求,如果需要更多则只能花钱买。

2. Python抓取并分析网页

  利用Python抓取网页很方便,不多说,见代码:

def search(self, queryStr):
   queryStr = urllib2.quote(queryStr)
   url = 'https://www.google.com.hk/search?hl=en&q=%s' % queryStr
   request = urllib2.Request(url)
   response = urllib2.urlopen(request)
   html = response.read()
   results = self.extractSearchResults(html)


  第6行的 html 就是我们抓取的搜索结果页面源码。使用过Python的同学会发现,Python同时提供了urllib 和 urllib2两个模块,都是和URL请求相关的模块,不过提供了不同的功能,urllib只可以接收URL,而urllib2可以接受一个Request类的实例来设置URL请求的headers,这意味着你可以伪装你的user agent 等(下面会用到)。

  现在我们已经可以用Python抓取网页并保存下来,接下来我们就可以从源码页面中抽取我们想要的搜索结果。Python提供了htmlparser模块,不过用起来相对比较麻烦,这里推荐一个很好用的网页分析包BeautifulSoup,关于BeautifulSoup的用法官网有详细的介绍,这里我不再多说。

  利用上面的代码,对于少量的查询还比较OK,但如果要进行上千上万次的查询,上面的方法就不再有效了, Google会检测你请求的来源,如果我们利用机器频繁爬取Google的搜索结果,不多久就Google会block你的IP,并给你返回503 Error页面。这不是我们想要的结果,于是我们还要继续探索

  前面提到利用urllib2我们可以设置URL请求的headers,  伪装我们的user agent。简单的说,user agent就是客户端浏览器等应用程序使用的一种特殊的网络协议, 在每次浏览器(邮件客户端/搜索引擎蜘蛛)进行 HTTP 请求时发送到服务器,服务器就知道了用户是使用什么浏览器(邮件客户端/搜索引擎蜘蛛)来访问的。 有时候为了达到一些目的,我们不得不去善意的欺骗服务器告诉它我不是在用机器访问你。

  于是,我们的代码就成了下面这个样子:

user_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130406 Firefox/23.0', \
     'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0', \
     'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533+ \
     (KHTML, like Gecko) Element Browser 5.0', \
     'IBM WebExplorer /v0.94', 'Galaxy/1.0 [en] (Mac OS X 10.5.6; U; en)', \
     'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)', \
     'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14', \
     'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) \
     Version/6.0 Mobile/10A5355d Safari/8536.25', \
     'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) \
     Chrome/28.0.1468.0 Safari/537.36', \
     'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; TheWorld)']
 def search(self, queryStr):
   queryStr = urllib2.quote(queryStr)
   url = 'https://www.google.com.hk/search?hl=en&q=%s' % queryStr
   request = urllib2.Request(url)
   index = random.randint(0, 9)
   user_agent = user_agents[index]
   request.add_header('User-agent', user_agent)
   response = urllib2.urlopen(request)
   html = response.read()
   results = self.extractSearchResults(html)


  不要被user_agents那个list吓到,那其实就是10个user agent 字符串,这么做是让我们伪装的更好一些,如果你需要更多的user agent 请看这里 UserAgentString。

17-19行表示随机选择一个user agent 字符串,然后用request 的add_header方法伪装一个user agent。

  通过伪装user agent能够让我们持续抓取搜索引擎结果,如果这样还不行,那我建议在每两次查询间随机休眠一段时间,这样会影响抓取速度,但是能够让你更持续的抓取结果,如果你有多个IP,那抓取的速度也就上来了。

  github上有本文所有源代码,需要的同学可从下面的网址下载:

https://github.com/meibenjin/GoogleSearchCrawler

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Learning Python: Is 2 Hours of Daily Study Sufficient?Learning Python: Is 2 Hours of Daily Study Sufficient?Apr 18, 2025 am 12:22 AM

Is it enough to learn Python for two hours a day? It depends on your goals and learning methods. 1) Develop a clear learning plan, 2) Select appropriate learning resources and methods, 3) Practice and review and consolidate hands-on practice and review and consolidate, and you can gradually master the basic knowledge and advanced functions of Python during this period.

Python for Web Development: Key ApplicationsPython for Web Development: Key ApplicationsApr 18, 2025 am 12:20 AM

Key applications of Python in web development include the use of Django and Flask frameworks, API development, data analysis and visualization, machine learning and AI, and performance optimization. 1. Django and Flask framework: Django is suitable for rapid development of complex applications, and Flask is suitable for small or highly customized projects. 2. API development: Use Flask or DjangoRESTFramework to build RESTfulAPI. 3. Data analysis and visualization: Use Python to process data and display it through the web interface. 4. Machine Learning and AI: Python is used to build intelligent web applications. 5. Performance optimization: optimized through asynchronous programming, caching and code

Python vs. C  : Exploring Performance and EfficiencyPython vs. C : Exploring Performance and EfficiencyApr 18, 2025 am 12:20 AM

Python is better than C in development efficiency, but C is higher in execution performance. 1. Python's concise syntax and rich libraries improve development efficiency. 2.C's compilation-type characteristics and hardware control improve execution performance. When making a choice, you need to weigh the development speed and execution efficiency based on project needs.

Python in Action: Real-World ExamplesPython in Action: Real-World ExamplesApr 18, 2025 am 12:18 AM

Python's real-world applications include data analytics, web development, artificial intelligence and automation. 1) In data analysis, Python uses Pandas and Matplotlib to process and visualize data. 2) In web development, Django and Flask frameworks simplify the creation of web applications. 3) In the field of artificial intelligence, TensorFlow and PyTorch are used to build and train models. 4) In terms of automation, Python scripts can be used for tasks such as copying files.

Python's Main Uses: A Comprehensive OverviewPython's Main Uses: A Comprehensive OverviewApr 18, 2025 am 12:18 AM

Python is widely used in data science, web development and automation scripting fields. 1) In data science, Python simplifies data processing and analysis through libraries such as NumPy and Pandas. 2) In web development, the Django and Flask frameworks enable developers to quickly build applications. 3) In automated scripts, Python's simplicity and standard library make it ideal.

The Main Purpose of Python: Flexibility and Ease of UseThe Main Purpose of Python: Flexibility and Ease of UseApr 17, 2025 am 12:14 AM

Python's flexibility is reflected in multi-paradigm support and dynamic type systems, while ease of use comes from a simple syntax and rich standard library. 1. Flexibility: Supports object-oriented, functional and procedural programming, and dynamic type systems improve development efficiency. 2. Ease of use: The grammar is close to natural language, the standard library covers a wide range of functions, and simplifies the development process.

Python: The Power of Versatile ProgrammingPython: The Power of Versatile ProgrammingApr 17, 2025 am 12:09 AM

Python is highly favored for its simplicity and power, suitable for all needs from beginners to advanced developers. Its versatility is reflected in: 1) Easy to learn and use, simple syntax; 2) Rich libraries and frameworks, such as NumPy, Pandas, etc.; 3) Cross-platform support, which can be run on a variety of operating systems; 4) Suitable for scripting and automation tasks to improve work efficiency.

Learning Python in 2 Hours a Day: A Practical GuideLearning Python in 2 Hours a Day: A Practical GuideApr 17, 2025 am 12:05 AM

Yes, learn Python in two hours a day. 1. Develop a reasonable study plan, 2. Select the right learning resources, 3. Consolidate the knowledge learned through practice. These steps can help you master Python in a short time.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Will R.E.P.O. Have Crossplay?
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.