Home >Backend Development >Python Tutorial >Scrapy case analysis: How to crawl company information on LinkedIn

Scrapy case analysis: How to crawl company information on LinkedIn

王林
王林Original
2023-06-23 10:04:402117browse

Scrapy is a Python-based crawler framework that can quickly and easily obtain relevant information on the Internet. In this article, we will use a Scrapy case to analyze in detail how to crawl company information on LinkedIn.

  1. Determine the target URL

First of all, we need to make it clear that our target is the company information on LinkedIn. Therefore, we need to find the URL of the LinkedIn company information page. Open the LinkedIn website, enter the company name in the search box, and select the "Company" option in the drop-down box to enter the company introduction page. On this page, we can see the company's basic information, number of employees, affiliated companies and other information. At this point, we need to obtain the URL of the page from the browser's developer tools for subsequent use. The structure of this URL is:

https://www.linkedin.com/search/results/companies/?keywords=xxx

Among them, keywords=xxx represents the keywords we searched for, xxx can be replaced with any company name.

  1. Create a Scrapy project

Next, we need to create a Scrapy project. Enter the following command on the command line:

scrapy startproject linkedin

This command will create a Scrapy project named linkedin in the current directory.

  1. Create a crawler

After creating the project, enter the following command in the project root directory to create a new crawler:

scrapy genspider company_spider www. linkedin.com

This will create a spider named company_spider and locate it on the Linkedin company page.

  1. Configuring Scrapy

In Spider, we need to configure some basic information, such as the URL to be crawled and how to parse the data in the page. Add the following code to the company_spider.py file you just created:

import scrapy

class CompanySpider(scrapy.Spider):
    name = "company"
    allowed_domains = ["linkedin.com"]
    start_urls = [
        "https://www.linkedin.com/search/results/companies/?keywords=apple"
    ]

    def parse(self, response):
        pass

In the above code, we define the site URL to be crawled and the parsing function. In the above code, we have only defined the site URL to be crawled and the parsing function, and have not added the specific implementation of the crawler. Now we need to write the parse function to capture and process LinkedIn company information.

  1. Write parsing function

In the parse function, we need to write code to capture and process LinkedIn company information. We can use XPath or CSS selectors to parse HTML code. Basic information in the LinkedIn company information page can be extracted using the following XPath:

//*[@class="org-top-card-module__name ember-view"]/text()

This XPath will select the element with class "org-top-card-module__name ember-view" and return its text value.

The following is the complete company_spider.py file:

import scrapy

class CompanySpider(scrapy.Spider):
    name = "company"
    allowed_domains = ["linkedin.com"]
    start_urls = [
        "https://www.linkedin.com/search/results/companies/?keywords=apple"
    ]

    def parse(self, response):
        # 获取公司名称
        company_name = response.xpath('//*[@class="org-top-card-module__name ember-view"]/text()')
        
        # 获取公司简介
        company_summary = response.css('.org-top-card-summary__description::text').extract_first().strip()
        
        # 获取公司分类标签
        company_tags = response.css('.org-top-card-category-list__top-card-category::text').extract()
        company_tags = ','.join(company_tags)

        # 获取公司员工信息
        employees_section = response.xpath('//*[@class="org-company-employees-snackbar__details-info"]')
        employees_current = employees_section.xpath('.//li[1]/span/text()').extract_first()
        employees_past = employees_section.xpath('.//li[2]/span/text()').extract_first()

        # 数据处理
        company_name = company_name.extract_first()
        company_summary = company_summary if company_summary else "N/A"
        company_tags = company_tags if company_tags else "N/A"
        employees_current = employees_current if employees_current else "N/A"
        employees_past = employees_past if employees_past else "N/A"

        # 输出抓取结果
        print('Company Name: ', company_name)
        print('Company Summary: ', company_summary)
        print('Company Tags: ', company_tags)
        print('
Employee Information
Current: ', employees_current)
        print('Past: ', employees_past)

In the above code, we use XPath and CSS selectors to extract the basic information, company profile, tags and employee information in the page, And performed some basic data processing and output on them.

  1. Run Scrapy

Now, we have completed crawling and processing the LinkedIn company information page. Next, we need to run Scrapy to execute the crawler. Enter the following command on the command line:

scrapy crawl company

After executing this command, Scrapy will begin to crawl and process the data in the LinkedIn company information page, and output the crawl results.

Summary

The above is how to use Scrapy to crawl LinkedIn company information. With the help of the Scrapy framework, we can easily carry out large-scale data scraping, and at the same time be able to process and transform data, saving our time and energy and improving data collection efficiency.

The above is the detailed content of Scrapy case analysis: How to crawl company information on LinkedIn. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn