介紹
在本文中,我們深入研究從 LinkedIn 提取和分析工作資料的過程,利用 Python、Nu shell 和 ChatGPT 的組合來簡化和增強我們的工作流程。
我將引導您完成我進行研究的步驟,向您展示如何使用這些技術來探索不同國家甚至其他領域的就業市場。透過結合這些工具和方法,您可以收集和分析數據,以獲得對您感興趣的任何就業市場的寶貴見解。
技術概述
Python
Python 因其多功能庫而被選中,特別是 linkedin_jobs_scraper 和 openai。這些套件簡化了作業資料的抓取和處理。
核殼
對 Nu shell 進行了實驗,以將其功能與傳統的 bash 堆疊進行比較。該實驗旨在探索其在處理和操作數據方面的潛在好處。
聊天GPT
ChatGPT 用於協助從收集的資料中提取特定的工作特徵,例如經驗年限、學位要求、技術堆疊、職位層級和核心職責。
資料擷取
啟動需要一些資料。 LinkedIn 是我想到的第一個網站,並且已經準備好使用 Python 套件。我複製了範例程式碼,對其進行了一些修改,並準備使用腳本來獲取包含職位描述清單的 JSON 檔案。來源如下:
import json import logging import os from threading import Lock from dotenv import load_dotenv # linkedin_jobs_scraper loads env statically # So dotenv should be loaded before imports load_dotenv() from linkedin_jobs_scraper import LinkedinScraper from linkedin_jobs_scraper.events import EventData, Events from linkedin_jobs_scraper.filters import ExperienceLevelFilters, TypeFilters from linkedin_jobs_scraper.query import Query, QueryFilters, QueryOptions CHROMEDRIVER_PATH = os.environ["CHROMEDRIVER_PATH"] RESULT_FILE_PATH = "result.json" KEYWORDS = ("Python", "PHP", "Java", "Rust") LOCATIONS = ("South Korea",) TYPE_FILTERS = (TypeFilters.FULL_TIME,) EXPERIENCE = (ExperienceLevelFilters.MID_SENIOR,) LIMIT = 500 logging.basicConfig(level=logging.INFO) log = logging.getLogger(__name__) def main(): result_lock = Lock() result = [] def on_data(data: EventData): with result_lock: result.append(data._asdict()) log.info( "[JOB]", data.title, data.company, len(data.description), ) def on_error(error): log.error("[ERROR]", error) def on_end(): log.info("Scraping finished") if not result: return with open(RESULT_FILE_PATH, "w") as f: json.dump(result, f) queries = [ Query( query=keyword, options=QueryOptions( limit=LIMIT, locations=[*LOCATIONS], filters=QueryFilters( type=[*TYPE_FILTERS], experience=[*EXPERIENCE], ), ), ) for keyword in KEYWORDS ] scraper = LinkedinScraper( chrome_executable_path=CHROMEDRIVER_PATH, headless=True, max_workers=len(queries), slow_mo=0.5, page_load_timeout=40, ) scraper.on(Events.DATA, on_data) scraper.on(Events.ERROR, on_error) scraper.on(Events.END, on_end) scraper.run(queries) if __name__ == "__main__": main()
為了下載 chrome 驅動程序,我製作了以下 bash 腳本:
#!/usr/bin/env bash stable_version=$(curl 'https://googlechromelabs.github.io/chrome-for-testing/LATEST_RELEASE_STABLE') driver_url=$(curl 'https://googlechromelabs.github.io/chrome-for-testing/known-good-versions-with-downloads.json' \ | jq -r ".versions[] | select(.version == \"${stable_version}\") | .downloads.chromedriver[0] | select(.platform == \"linux64\") | .url") wget "$driver_url" driver_zip_name=$(echo "$driver_url" | awk -F'/' '{print $NF}') unzip "$driver_zip_name" rm "$driver_zip_name"
我的 .env 檔案看起來像這樣:
CHROMEDRIVER_PATH="chromedriver-linux64/chromedriver" LI_AT_COOKIE=
linkedin_jobs_scraper 將作業序列化到以下 DTO:
class EventData(NamedTuple): query: str = '' location: str = '' job_id: str = '' job_index: int = -1 # Only for debug link: str = '' apply_link: str = '' title: str = '' company: str = '' company_link: str = '' company_img_link: str = '' place: str = '' description: str = '' description_html: str = '' date: str = '' insights: List[str] = [] skills: List[str] = []
範例(為了更好的可讀性,描述已替換為 ...):
query | location | job_id | job_index | link | apply_link | title | company | company_link | company_img_link | place | description | description_html | date | insights | skills |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Python | South Korea | 3959499221 | 0 | https://www.linkedin.com/jobs/view/3959499221/?trk=flagship3_search_srp_jobs | Senior Python Software Engineer | Canonical | https://media.licdn.com/dms/image/v2/C560BAQEbIYAkAURcYw/company-logo_100_100/company-logo_100_100/0/1650566107463/canonical_logo?e=1734566400&v=beta&t=emb8cxAFwBnOGwJ8nTftd8ODTFDkC_5SQNz-Jcd8zRU | Seoul, Seoul, South Korea (Remote) | ... | ... | [Remote Full-time Mid-Senior level, Skills: Python (Programming Language), Computer Science, 8 more, See how you compare to 18 applicants. Try Premium for RSD0, , Am I a good fit for this job?, How can I best position myself for this job?, Tell me more about Canonical] | [Back-End Web Development, Computer Science, Engineering Documentation, Kubernetes, Linux, MLOps, OpenStack, Python (Programming Language), Technical Documentation, Web Services] |
Was generated with the following nu shell command:
# Replaces description of a job with elipsis def hide-description [] { update description { |row| '...' } | update description_html { |row| '...' } } cat result.json | from json | first | hide-description | to md --pretty
Last steps before analysis
We already have several ready to use features (title and skills), but I want more:
- Years of experience
- Degree
- Tech stack
- Position
- Responsibilities
So let's add them with help of ChatGPT!
import json import logging import os from dotenv import load_dotenv from linkedin_jobs_scraper.events import EventData from openai import OpenAI from tqdm import tqdm load_dotenv() client = OpenAI( api_key=os.environ["OPENAI_API_KEY"], ) with open("result.json", "rb") as f: jobs = json.load(f) parsed_descriptions = [] for job in tqdm(jobs): job = EventData(**job) chat_completion = client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "user", "content": """ Process given IT job description. Output only raw JSON with the following fields: - Experience (amount of years or null) - Degree requirement (str if found else null) - Tech stack (array of strings) - Position (middle, senior, lead, manager, other (describe it)) - Core responsibilites (array of strings) Output will be passed directrly to the Python's `json.loads` function. So DO NOT APPLY MARKDOWN FORMATTING Example: ``` { "experience": 5, "degree": "bachelor", "stack": ["Python", "FastAPI", "Docker"], "position": "middle", "responsibilities": ["Deliver features", "break production"] } ``` Here is a job description: """ + "\n\n" + job.description_html, } ], ) content = chat_completion.choices[0].message.content try: if not content: print("Empty result from ChatGPT") continue result = json.loads(content) except json.decoder.JSONDecodeError as e: logging.error(e, chat_completion) continue result["job_id"] = job.job_id parsed_descriptions.append(result) with open("job_descriptions_analysis.json", "w") as f: json.dump(parsed_descriptions, f)
Do not forget to add OPENAI_API_KEY to the .env file
Now we can merge by job_id results with data from LinkedIn:
cat job_descriptions_analysis.json | from json | merge (cat result.json | from json) | to json | save full.json
Our data is ready to analyze!
cat full.json | from json | columns ╭────┬──────────────────╮ │ 0 │ experience │ │ 1 │ degree │ │ 2 │ stack │ │ 3 │ position │ │ 4 │ responsibilities │ │ 5 │ job_id │ │ 6 │ query │ │ 7 │ location │ │ 8 │ job_index │ │ 9 │ link │ │ 10 │ apply_link │ │ 11 │ title │ │ 12 │ company │ │ 13 │ company_link │ │ 14 │ company_img_link │ │ 15 │ place │ │ 16 │ description │ │ 17 │ description_html │ │ 18 │ date │ │ 19 │ insights │ │ 20 │ skills │ ╰────┴──────────────────╯
Analysis
For the start
let df = cat full.json | from json
Now we can see technologies frequency:
$df | get 'stack' | flatten | uniq --count | sort-by count --reverse | first 20 | to md --pretty
value | count |
---|---|
Python | 185 |
Java | 70 |
AWS | 65 |
Kubernetes | 61 |
SQL | 54 |
C++ | 46 |
Docker | 42 |
Linux | 41 |
React | 37 |
Kotlin | 34 |
JavaScript | 30 |
C | 30 |
Kafka | 28 |
TypeScript | 26 |
GCP | 25 |
Azure | 24 |
Tableau | 22 |
Hadoop | 21 |
Spark | 21 |
R | 20 |
With Python:
$df | filter-by-intersection 'stack' ['python'] | get 'stack' | flatten | where $it != 'Python' # Exclude python itself | uniq --count | sort-by count --reverse | first 10 | to md --pretty
value | count |
---|---|
Java | 44 |
AWS | 43 |
SQL | 40 |
Kubernetes | 36 |
Docker | 27 |
C++ | 26 |
Linux | 24 |
R | 20 |
GCP | 20 |
C | 18 |
Without Python:
$df | filter-by-intersection 'stack' ['python'] --invert | get 'stack' | flatten | uniq --count | sort-by count --reverse | first 10 | to md --pretty
value | count |
---|---|
React | 31 |
Java | 26 |
Kubernetes | 25 |
TypeScript | 23 |
AWS | 22 |
Kotlin | 21 |
C++ | 20 |
Linux | 17 |
Docker | 15 |
Next.js | 15 |
The most of the jobs require Python, but there are some front-end, Java and C++ jobs
Magic filter-by-intersection function is a custom one and allow filtering list values that include given set of elements:
# Filters rows by intersecting given `column` with `requirements` # Case insensitive and works only if ALL requirements exist in a `column` value # If `--invert` then works as symmetric difference def filter-by-intersection [ column: string requirements: list<string> --invert (-i) ] { let required_stack = $requirements | par-each { |el| str downcase } let required_len = if $invert { 0 } else { ($requirements | length )} $in | filter { |row| $required_len == ( $row | get $column | par-each { |el| str downcase } | where ($it in $requirements) | length ) } } </string>
What about experience and degree requirement for each position in Python?
$df | filter-by-intersection 'stack' ['python'] | group-by 'position' --to-table | insert 'group_size' { |group| $group.items | length } | where 'group_size' >= 10 | insert 'experience' { |group| $group.items | get 'experience' | uniq --count | sort-by 'count' --reverse | update 'value' { |row| if $row.value == null { 0 } else { $row.value }} | rename --column { 'value': 'years' } | first 3 } | insert 'degree_requirement' { |group| $group.items | each { |row| $row.degree != null } | uniq --count | sort-by 'value' | rename --column { 'value': 'required' } } | sort-by 'group_size' --reverse | select 'group' 'group_size' 'experience' 'degree_requirement'
Output:
╭───┬────────┬────────────┬───────────────────────┬──────────────────────────╮ │ # │ group │ group_size │ experience │ degree_requirement │ ├───┼────────┼────────────┼───────────────────────┼──────────────────────────┤ │ 0 │ senior │ 83 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │ │ │ │ │ │ # │ years │ count │ │ │ # │ required │ count │ │ │ │ │ │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │ │ │ │ │ │ 0 │ 5 │ 30 │ │ │ 0 │ false │ 26 │ │ │ │ │ │ │ 1 │ 0 │ 11 │ │ │ 1 │ true │ 57 │ │ │ │ │ │ │ 2 │ 7 │ 11 │ │ ╰───┴──────────┴───────╯ │ │ │ │ │ ╰───┴───────┴───────╯ │ │ │ 1 │ other │ 14 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │ │ │ │ │ │ # │ years │ count │ │ │ # │ required │ count │ │ │ │ │ │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │ │ │ │ │ │ 0 │ 0 │ 8 │ │ │ 0 │ false │ 12 │ │ │ │ │ │ │ 1 │ 5 │ 1 │ │ │ 1 │ true │ 2 │ │ │ │ │ │ │ 2 │ 3 │ 1 │ │ ╰───┴──────────┴───────╯ │ │ │ │ │ ╰───┴───────┴───────╯ │ │ │ 2 │ lead │ 12 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │ │ │ │ │ │ # │ years │ count │ │ │ # │ required │ count │ │ │ │ │ │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │ │ │ │ │ │ 0 │ 0 │ 5 │ │ │ 0 │ false │ 6 │ │ │ │ │ │ │ 1 │ 10 │ 4 │ │ │ 1 │ true │ 6 │ │ │ │ │ │ │ 2 │ 5 │ 1 │ │ ╰───┴──────────┴───────╯ │ │ │ │ │ ╰───┴───────┴───────╯ │ │ │ 3 │ middle │ 10 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │ │ │ │ │ │ # │ years │ count │ │ │ # │ required │ count │ │ │ │ │ │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │ │ │ │ │ │ 0 │ 3 │ 4 │ │ │ 0 │ false │ 4 │ │ │ │ │ │ │ 1 │ 5 │ 3 │ │ │ 1 │ true │ 6 │ │ │ │ │ │ │ 2 │ 2 │ 2 │ │ ╰───┴──────────┴───────╯ │ │ │ │ │ ╰───┴───────┴───────╯ │ │ ╰───┴────────┴────────────┴───────────────────────┴──────────────────────────╯
Extraction of the most common requirements wasn't as easy as previous steps. So I've met a classification problem, and I'm going to describe my solution in the next chapter of this article.
Conclusion
We successfully extracted and analyzed job data from LinkedIn using the linkedin_jobs_scraper package. Responsibilities in the actual dataset are too sparse and need better processing to make functional classes that will help in CV creation. But the given steps already help me a lot with monitoring and applying to the jobs in half-auto mode.
以上是探索軟體工程師的就業市場的詳細內容。更多資訊請關注PHP中文網其他相關文章!

Tomergelistsinpython,YouCanusethe操作員,estextMethod,ListComprehension,Oritertools

在Python3中,可以通過多種方法連接兩個列表:1)使用 運算符,適用於小列表,但對大列表效率低;2)使用extend方法,適用於大列表,內存效率高,但會修改原列表;3)使用*運算符,適用於合併多個列表,不修改原列表;4)使用itertools.chain,適用於大數據集,內存效率高。

使用join()方法是Python中從列表連接字符串最有效的方法。 1)使用join()方法高效且易讀。 2)循環使用 運算符對大列表效率低。 3)列表推導式與join()結合適用於需要轉換的場景。 4)reduce()方法適用於其他類型歸約,但對字符串連接效率低。完整句子結束。

pythonexecutionistheprocessoftransformingpypythoncodeintoExecutablestructions.1)InternterPreterReadSthecode,ConvertingTingitIntObyTecode,whepythonvirtualmachine(pvm)theglobalinterpreterpreterpreterpreterlock(gil)the thepythonvirtualmachine(pvm)

Python的關鍵特性包括:1.語法簡潔易懂,適合初學者;2.動態類型系統,提高開發速度;3.豐富的標準庫,支持多種任務;4.強大的社區和生態系統,提供廣泛支持;5.解釋性,適合腳本和快速原型開發;6.多範式支持,適用於各種編程風格。

Python是解釋型語言,但也包含編譯過程。 1)Python代碼先編譯成字節碼。 2)字節碼由Python虛擬機解釋執行。 3)這種混合機制使Python既靈活又高效,但執行速度不如完全編譯型語言。

UseeAforloopWheniteratingOveraseQuenceOrforAspecificnumberoftimes; useAwhiLeLoopWhenconTinuingUntilAcIntiment.forloopsareIdealForkNownsences,而WhileLeleLeleLeleLeleLoopSituationSituationsItuationsItuationSuationSituationswithUndEtermentersitations。

pythonloopscanleadtoerrorslikeinfiniteloops,modifyingListsDuringteritation,逐個偏置,零indexingissues,andnestedloopineflinefficiencies


熱AI工具

Undresser.AI Undress
人工智慧驅動的應用程序,用於創建逼真的裸體照片

AI Clothes Remover
用於從照片中去除衣服的線上人工智慧工具。

Undress AI Tool
免費脫衣圖片

Clothoff.io
AI脫衣器

Video Face Swap
使用我們完全免費的人工智慧換臉工具,輕鬆在任何影片中換臉!

熱門文章

熱工具

SAP NetWeaver Server Adapter for Eclipse
將Eclipse與SAP NetWeaver應用伺服器整合。

SublimeText3 英文版
推薦:為Win版本,支援程式碼提示!

SecLists
SecLists是最終安全測試人員的伙伴。它是一個包含各種類型清單的集合,這些清單在安全評估過程中經常使用,而且都在一個地方。 SecLists透過方便地提供安全測試人員可能需要的所有列表,幫助提高安全測試的效率和生產力。清單類型包括使用者名稱、密碼、URL、模糊測試有效載荷、敏感資料模式、Web shell等等。測試人員只需將此儲存庫拉到新的測試機上,他就可以存取所需的每種類型的清單。

SublimeText3 Mac版
神級程式碼編輯軟體(SublimeText3)

Safe Exam Browser
Safe Exam Browser是一個安全的瀏覽器環境,安全地進行線上考試。該軟體將任何電腦變成一個安全的工作站。它控制對任何實用工具的訪問,並防止學生使用未經授權的資源。