Scrapy案例解析：如何抓取LinkedIn上公司訊息-Python教學-PHP中文網

首頁

後端開發

Python教學

Scrapy案例解析：如何抓取LinkedIn上公司訊息

王林

Jun 23, 2023 am 10:04 AM

linkedin抓取scrapy

Scrapy是一個基於Python的爬蟲框架，可以快速且方便地取得網路上的相關資訊。在本篇文章中，我們將透過Scrapy案例來詳細解析如何抓取LinkedIn上的公司資訊。

確定目標URL

首先，我們需要明確我們的目標是LinkedIn上的公司資訊。因此，我們需要找到LinkedIn公司資訊頁面的URL。開啟LinkedIn網站，在搜尋框中輸入公司名稱，在下拉方塊中選擇「公司」選項，即可進入至公司介紹頁面。在此頁面上，我們可以看到該公司的基本資訊、員工人數、關聯公司等資訊。此時，我們需要從瀏覽器的開發者工具中取得該頁面的URL，以便後續使用。這個URL的結構為：

https://www.linkedin.com/search/results/companies/?keywords=xxx

其中，keywords=xxx代表我們搜尋的關鍵字， xxx可以替換成任何公司名稱。

建立Scrapy專案

接下來，我們需要建立一個Scrapy專案。在命令列輸入以下指令：

scrapy startproject linkedin

該指令將會在目前目錄下建立一個名為linkedin的Scrapy專案。

建立爬蟲

建立專案後，在專案根目錄下輸入下列指令來建立新的爬蟲：

scrapy genspider company_spider www. linkedin.com

這將會建立一個名為company_spider的爬蟲，並將其定位到Linkedin公司頁面。

配置Scrapy

在Spider中，我們需要配置一些基本信息，例如要抓取的URL，以及如何解析頁面中的資料等。在剛才建立的company_spider.py檔案中加入以下程式碼：

import scrapy

class CompanySpider(scrapy.Spider):
    name = "company"
    allowed_domains = ["linkedin.com"]
    start_urls = [
        "https://www.linkedin.com/search/results/companies/?keywords=apple"
    ]

    def parse(self, response):
        pass

在上述程式碼中，我們定義了要抓取的網站URL和解析函數。在上述程式碼中，我們只定義了要抓取的網站URL和解析函數，還沒有加入爬蟲的具體實作。現在我們需要編寫parse函數來實現LinkedIn公司資訊的抓取和處理。

編寫解析函數

在parse函數中，我們需要編寫抓取和處理LinkedIn公司資訊的程式碼。我們可以使用XPath或CSS選擇器來解析HTML程式碼。 LinkedIn公司資訊頁面中的基本資訊可以使用以下XPath來提取：

//*[@class="org-top-card-module__name ember-view"]/text()

該XPath將選取class為「org-top-card-module__name ember-view」的元素，並傳回它的文字值。

以下是完整的company_spider.py檔案：

import scrapy

class CompanySpider(scrapy.Spider):
    name = "company"
    allowed_domains = ["linkedin.com"]
    start_urls = [
        "https://www.linkedin.com/search/results/companies/?keywords=apple"
    ]

    def parse(self, response):
        # 获取公司名称
        company_name = response.xpath('//*[@class="org-top-card-module__name ember-view"]/text()')
        
        # 获取公司简介
        company_summary = response.css('.org-top-card-summary__description::text').extract_first().strip()
        
        # 获取公司分类标签
        company_tags = response.css('.org-top-card-category-list__top-card-category::text').extract()
        company_tags = ','.join(company_tags)

        # 获取公司员工信息
        employees_section = response.xpath('//*[@class="org-company-employees-snackbar__details-info"]')
        employees_current = employees_section.xpath('.//li[1]/span/text()').extract_first()
        employees_past = employees_section.xpath('.//li[2]/span/text()').extract_first()

        # 数据处理
        company_name = company_name.extract_first()
        company_summary = company_summary if company_summary else "N/A"
        company_tags = company_tags if company_tags else "N/A"
        employees_current = employees_current if employees_current else "N/A"
        employees_past = employees_past if employees_past else "N/A"

        # 输出抓取结果
        print('Company Name: ', company_name)
        print('Company Summary: ', company_summary)
        print('Company Tags: ', company_tags)
        print('
Employee Information
Current: ', employees_current)
        print('Past: ', employees_past)

上述程式碼中，我們使用了XPath和CSS選擇器來提取頁面中的基本資訊、公司簡介、標籤和員工信息，並對它們進行了一些基本的數據處理和輸出。

運行Scrapy

現在，我們已經完成了對LinkedIn公司資訊頁面的抓取和處理。接下來，我們需要運行Scrapy來執行該爬蟲。在命令列中輸入以下命令：

scrapy crawl company

執行該命令後，Scrapy將會開始抓取並處理LinkedIn公司資訊頁面中的數據，並輸出抓取結果。

總結

以上就是使用Scrapy抓取LinkedIn公司資訊的方法。在Scrapy框架的幫助下，我們可以輕鬆地進行大規模的數據抓取，同時還能夠處理和轉換數據，節省我們的時間和精力，提高數據收集效率。

以上是Scrapy案例解析：如何抓取LinkedIn上公司訊息的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

您如何切成python列表？May 02, 2025 am 12:14 AM

SlicingaPythonlistisdoneusingthesyntaxlist[start:stop:step].Here'showitworks:1)Startistheindexofthefirstelementtoinclude.2)Stopistheindexofthefirstelementtoexclude.3)Stepistheincrementbetweenelements.It'susefulforextractingportionsoflistsandcanuseneg

在Numpy陣列上可以執行哪些常見操作？May 02, 2025 am 12:09 AM

numpyallowsforvariousoperationsonArrays：1）basicarithmeticlikeaddition，減法，乘法和division; 2）evationAperationssuchasmatrixmultiplication; 3）element-wiseOperations wiseOperationswithOutexpliitloops; 4）

Python的數據分析中如何使用陣列？May 02, 2025 am 12:09 AM

Arresinpython，尤其是Throughnumpyandpandas，weessentialFordataAnalysis，offeringSpeedAndeffied.1）NumpyArseNable efflaysenable efficefliceHandlingAtaSetSetSetSetSetSetSetSetSetSetSetsetSetSetSetSetsopplexoperationslikemovingaverages.2）

列表的內存足跡與python數組的內存足跡相比如何？May 02, 2025 am 12:08 AM

列表sandnumpyArraysInpythonHavedIfferentMemoryfootprints：listSaremoreFlexibleButlessMemory-效率，而alenumpyArraySareSareOptimizedFornumericalData.1）listsStorReereReereReereReereFerenceStoObjects，with withOverHeadeBheadaroundAroundaround64byty64-bitsysysysysysysysysyssyssyssyssysssyssys2）

部署可執行的Python腳本時，如何處理特定環境的配置？May 02, 2025 am 12:07 AM

toensurepythonscriptsbehavecorrectlyacrycrosdevelvermations，分期和生產，USETHESTERTATE：1）Environment varriablesForsimplesettings，2）configurationfilesfilesForcomPlexSetups，3）dynamiCofforComplexSetups，dynamiqualloadingForaptaptibality.eachmethodoffersuniquebeneiquebeneqeniquebenefitsandrefitsandrequiresandrequiresandrequiresca

您如何切成python陣列？May 01, 2025 am 12:18 AM

Python列表切片的基本語法是list[start:stop:step]。 1.start是包含的第一個元素索引，2.stop是排除的第一個元素索引，3.step決定元素之間的步長。切片不僅用於提取數據，還可以修改和反轉列表。

在什麼情況下，列表的表現比數組表現更好？May 01, 2025 am 12:06 AM

ListSoutPerformarRaysin：1）DynamicsizicsizingandFrequentInsertions/刪除，2）儲存的二聚體和3）MemoryFeliceFiceForceforseforsparsedata，butmayhaveslightperformancecostsinclentoperations。

如何將Python數組轉換為Python列表？May 01, 2025 am 12:05 AM

toConvertapythonarraytoalist，usEthelist（）constructororageneratorexpression.1）intimpthearraymoduleandcreateanArray.2）USELIST（ARR）或[XFORXINARR] to ConconverTittoalist，請考慮performorefformanceandmemoryfformanceandmemoryfformienceforlargedAtasetset。

See all articles