Home >Backend Development >Python Tutorial >How to use Scrapy to get Google mirror page data?
With the development of the Internet, we increasingly rely on search engines to obtain information. However, many countries or regions have blocked or restricted access to search engines such as Google for various reasons, which makes it difficult for us to obtain information. In this case, we can use Google Mirror for access. This article will introduce how to use Scrapy to obtain Google mirror page data.
1. What is Google Mirroring
Google mirroring refers to storing Google search results in some way on a website that is accessible to users. By visiting this website, users can get the same search results as visiting Google. Typically, these mirror websites are created voluntarily by individuals or groups, and they usually do not have any official connection with Google.
2. Preparation work
Before using Scrapy to crawl data, we need to do some preparation work. First, we need to make sure that our system has Python and Scrapy framework installed. Secondly, we need the address of a Google mirror website. Usually, the addresses of these mirror websites are prone to change, and we need to find updates in time. Here we take the website "https://g.cactus.tw/" as an example.
3. Create a Scrapy project
After ensuring that the system environment and website address are ready, we can quickly create a Scrapy project through the Scrapy command line tool. The specific operations are as follows:
$ scrapy startproject google_mirror
This will create a project directory named google_mirror in the current directory. The directory structure is as follows:
google_mirror/ scrapy.cfg google_mirror/ __init__.py items.py middlewares.py pipelines.py settings.py spiders/ __init__.py
Among them, scrapy.cfg is the Scrapy configuration file. The google_mirror directory is our project root directory. items.py, middlewares.py, pipelines.py and settings.py are some of the core files of Scrapy, which are used to define data models, write middleware, write pipelines and configure some parameters of Scrapy respectively. The spiders directory is where we write crawler code.
4. Write crawler code
In the project directory, we can quickly create a Scrapy crawler through the command line tool. The specific operations are as follows:
$ cd google_mirror $ scrapy genspider google g.cactus.tw
This will create a crawler named google in the spiders directory. We can write our crawling code in this crawler. The specific code is as follows:
import scrapy class GoogleSpider(scrapy.Spider): name = 'google' allowed_domains = ['g.cactus.tw'] start_urls = ['https://g.cactus.tw/search'] def parse(self, response): results = response.css('div.g') for result in results: title = result.css('a::text').get() url = result.css('a::attr(href)').get() summary = result.css('div:nth-child(2) > div > div:nth-child(2) > span::text').get() yield { 'title': title, 'url': url, 'summary': summary, }
This crawler will request the https://g.cactus.tw/search page, and then crawl the title, URL and summary information in the search results. When writing crawler code, we used the CSS Selector provided by Scrapy to locate page elements.
5. Run the crawler
After writing the crawler code, we can run the crawler through the following command:
$ scrapy crawl google
Scrapy will automatically execute the crawler code we wrote, and Output the crawled results. The output results are as follows:
{'title': 'Scrapy | An open source web scraping framework for Python', 'url': 'http://scrapy.org/', 'summary': "Scrapy is an open source and collaborative web crawling framework for Python. In this post I'm sharing what motivated us to create it, why we think it is important, and what we have planned for the future."} {'title': 'Scrapinghub: Data Extraction Services, Web Crawling & Scraping', 'url': 'https://scrapinghub.com/', 'summary': 'Scrapinghub is a cloud-based data extraction platform that helps companies extract and use data from the web. Our web crawling services are trusted by Fortune 500 companies and startups.'} {'title': 'GitHub - scrapy/scrapy: Scrapy, a fast high-level web crawling & scraping framework for Python.', 'url': 'https://github.com/scrapy/scrapy', 'summary': 'Scrapy, a fast high-level web crawling & scraping framework for Python. - scrapy/scrapy'} {'title': 'Scrapy Tutorial | Web Scraping Using Scrapy Python - DataCamp', 'url': 'https://www.datacamp.com/community/tutorials/scraping-websites-scrapy-python', 'summary': 'This tutorial assumes you already know how to code in Python. Web scraping is an automatic way to extract large amounts of data from websites. Since data on websites is unstructured, web scraping enables us to convert that data into structured form. This tutorial is all about using ...'} ...
These result data include the title, URL and summary information of each search result, which can be processed and analyzed as needed.
6. Summary
This article introduces how to use Scrapy to obtain Google mirror page data. We first understood the concept and advantages of Google mirroring, and then wrote a crawler through the Scrapy framework to crawl search result data. By leveraging the powerful programming capabilities of Python and the excellent functions of the Scrapy framework, we can obtain large amounts of data quickly and efficiently. Of course, in practical applications, we also need to follow some ethical and legal requirements for data acquisition.
The above is the detailed content of How to use Scrapy to get Google mirror page data?. For more information, please follow other related articles on the PHP Chinese website!