Home >Backend Development >Python Tutorial >Scrapy case analysis: How to crawl company information on LinkedIn
Scrapy is a Python-based crawler framework that can quickly and easily obtain relevant information on the Internet. In this article, we will use a Scrapy case to analyze in detail how to crawl company information on LinkedIn.
First of all, we need to make it clear that our target is the company information on LinkedIn. Therefore, we need to find the URL of the LinkedIn company information page. Open the LinkedIn website, enter the company name in the search box, and select the "Company" option in the drop-down box to enter the company introduction page. On this page, we can see the company's basic information, number of employees, affiliated companies and other information. At this point, we need to obtain the URL of the page from the browser's developer tools for subsequent use. The structure of this URL is:
https://www.linkedin.com/search/results/companies/?keywords=xxx
Among them, keywords=xxx represents the keywords we searched for, xxx can be replaced with any company name.
Next, we need to create a Scrapy project. Enter the following command on the command line:
scrapy startproject linkedin
This command will create a Scrapy project named linkedin in the current directory.
After creating the project, enter the following command in the project root directory to create a new crawler:
scrapy genspider company_spider www. linkedin.com
This will create a spider named company_spider and locate it on the Linkedin company page.
In Spider, we need to configure some basic information, such as the URL to be crawled and how to parse the data in the page. Add the following code to the company_spider.py file you just created:
import scrapy class CompanySpider(scrapy.Spider): name = "company" allowed_domains = ["linkedin.com"] start_urls = [ "https://www.linkedin.com/search/results/companies/?keywords=apple" ] def parse(self, response): pass
In the above code, we define the site URL to be crawled and the parsing function. In the above code, we have only defined the site URL to be crawled and the parsing function, and have not added the specific implementation of the crawler. Now we need to write the parse function to capture and process LinkedIn company information.
In the parse function, we need to write code to capture and process LinkedIn company information. We can use XPath or CSS selectors to parse HTML code. Basic information in the LinkedIn company information page can be extracted using the following XPath:
//*[@class="org-top-card-module__name ember-view"]/text()
This XPath will select the element with class "org-top-card-module__name ember-view" and return its text value.
The following is the complete company_spider.py file:
import scrapy class CompanySpider(scrapy.Spider): name = "company" allowed_domains = ["linkedin.com"] start_urls = [ "https://www.linkedin.com/search/results/companies/?keywords=apple" ] def parse(self, response): # 获取公司名称 company_name = response.xpath('//*[@class="org-top-card-module__name ember-view"]/text()') # 获取公司简介 company_summary = response.css('.org-top-card-summary__description::text').extract_first().strip() # 获取公司分类标签 company_tags = response.css('.org-top-card-category-list__top-card-category::text').extract() company_tags = ','.join(company_tags) # 获取公司员工信息 employees_section = response.xpath('//*[@class="org-company-employees-snackbar__details-info"]') employees_current = employees_section.xpath('.//li[1]/span/text()').extract_first() employees_past = employees_section.xpath('.//li[2]/span/text()').extract_first() # 数据处理 company_name = company_name.extract_first() company_summary = company_summary if company_summary else "N/A" company_tags = company_tags if company_tags else "N/A" employees_current = employees_current if employees_current else "N/A" employees_past = employees_past if employees_past else "N/A" # 输出抓取结果 print('Company Name: ', company_name) print('Company Summary: ', company_summary) print('Company Tags: ', company_tags) print(' Employee Information Current: ', employees_current) print('Past: ', employees_past)
In the above code, we use XPath and CSS selectors to extract the basic information, company profile, tags and employee information in the page, And performed some basic data processing and output on them.
Now, we have completed crawling and processing the LinkedIn company information page. Next, we need to run Scrapy to execute the crawler. Enter the following command on the command line:
scrapy crawl company
After executing this command, Scrapy will begin to crawl and process the data in the LinkedIn company information page, and output the crawl results.
Summary
The above is how to use Scrapy to crawl LinkedIn company information. With the help of the Scrapy framework, we can easily carry out large-scale data scraping, and at the same time be able to process and transform data, saving our time and energy and improving data collection efficiency.
The above is the detailed content of Scrapy case analysis: How to crawl company information on LinkedIn. For more information, please follow other related articles on the PHP Chinese website!