搜索
首页后端开发Python教程Exploring Job Market for Software Engineers

Exploring Job Market for Software Engineers

Introduction

In this article, we dive into the process of extracting and analyzing job data from LinkedIn, leveraging a combination of Python, Nu shell, and ChatGPT to streamline and enhance our workflow.

I’ll walk you through the steps I took to carry out my research, showing how you can use these techniques to explore job markets in different countries or even in other fields. By combining these tools and methods, you can gather and analyze data to gain valuable insights into any job market you're interested in.

Technologies overview

Python

Python was chosen for its versatile libraries, particularly linkedin_jobs_scraper and openai. These packages streamlined the scraping and processing of job data.

Nu Shell

Nu shell was experimented with to compare its functionality against the traditional bash stack. This experiment aimed to explore its potential benefits in handling and manipulating data.

ChatGPT

ChatGPT was employed to assist in the extraction of specific job features from the collected data, such as years of experience, degree requirements, tech stack, position levels, and core responsibilities.

Data extraction

To start some data is required. LinkedIn was the first website that came to my mind and there was ready to use Python package. I've copied example code, modified it a little and got ready to use script to get a JSON file with a list of job descriptions. Here it's source:

import json
import logging
import os
from threading import Lock

from dotenv import load_dotenv

# linkedin_jobs_scraper loads env statically
# So dotenv should be loaded before imports
load_dotenv()

from linkedin_jobs_scraper import LinkedinScraper
from linkedin_jobs_scraper.events import EventData, Events
from linkedin_jobs_scraper.filters import ExperienceLevelFilters, TypeFilters
from linkedin_jobs_scraper.query import Query, QueryFilters, QueryOptions

CHROMEDRIVER_PATH = os.environ["CHROMEDRIVER_PATH"]

RESULT_FILE_PATH = "result.json"
KEYWORDS = ("Python", "PHP", "Java", "Rust")
LOCATIONS = ("South Korea",)
TYPE_FILTERS = (TypeFilters.FULL_TIME,)
EXPERIENCE = (ExperienceLevelFilters.MID_SENIOR,)
LIMIT = 500

logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)


def main():
    result_lock = Lock()
    result = []

    def on_data(data: EventData):
        with result_lock:
            result.append(data._asdict())

        log.info(
            "[JOB]",
            data.title,
            data.company,
            len(data.description),
        )

    def on_error(error):
        log.error("[ERROR]", error)

    def on_end():
        log.info("Scraping finished")

        if not result:
            return

        with open(RESULT_FILE_PATH, "w") as f:
            json.dump(result, f)

    queries = [
        Query(
            query=keyword,
            options=QueryOptions(
                limit=LIMIT,
                locations=[*LOCATIONS],
                filters=QueryFilters(
                    type=[*TYPE_FILTERS],
                    experience=[*EXPERIENCE],
                ),
            ),
        )
        for keyword in KEYWORDS
    ]

    scraper = LinkedinScraper(
        chrome_executable_path=CHROMEDRIVER_PATH,
        headless=True,
        max_workers=len(queries),
        slow_mo=0.5,
        page_load_timeout=40,
    )

    scraper.on(Events.DATA, on_data)
    scraper.on(Events.ERROR, on_error)
    scraper.on(Events.END, on_end)

    scraper.run(queries)


if __name__ == "__main__":
    main()

To download chrome driver I've made the following bash script:

#!/usr/bin/env bash
stable_version=$(curl 'https://googlechromelabs.github.io/chrome-for-testing/LATEST_RELEASE_STABLE')
driver_url=$(curl 'https://googlechromelabs.github.io/chrome-for-testing/known-good-versions-with-downloads.json' \
    | jq -r ".versions[] | select(.version == \"${stable_version}\") | .downloads.chromedriver[0] | select(.platform == \"linux64\") | .url")
wget "$driver_url"
driver_zip_name=$(echo "$driver_url" | awk -F'/' '{print $NF}')
unzip "$driver_zip_name"
rm "$driver_zip_name"

And my .env file looks like that:

CHROMEDRIVER_PATH="chromedriver-linux64/chromedriver"
LI_AT_COOKIE=

linkedin_jobs_scraper serializes jobs to the following DTO:

class EventData(NamedTuple):
    query: str = ''
    location: str = ''
    job_id: str = ''
    job_index: int = -1  # Only for debug
    link: str = ''
    apply_link: str = ''
    title: str = ''
    company: str = ''
    company_link: str = ''
    company_img_link: str = ''
    place: str = ''
    description: str = ''
    description_html: str = ''
    date: str = ''
    insights: List[str] = []
    skills: List[str] = []

Example sample (description was replaced with ... for better readability):

query location job_id job_index link apply_link title company company_link company_img_link place description description_html date insights skills
Python South Korea 3959499221 0 https://www.linkedin.com/jobs/view/3959499221/?trk=flagship3_search_srp_jobs Senior Python Software Engineer Canonical https://media.licdn.com/dms/image/v2/C560BAQEbIYAkAURcYw/company-logo_100_100/company-logo_100_100/0/1650566107463/canonical_logo?e=1734566400&v=beta&t=emb8cxAFwBnOGwJ8nTftd8ODTFDkC_5SQNz-Jcd8zRU Seoul, Seoul, South Korea (Remote) ... ... [Remote Full-time Mid-Senior level, Skills: Python (Programming Language), Computer Science, +8 more, See how you compare to 18 applicants. Try Premium for RSD0, , Am I a good fit for this job?, How can I best position myself for this job?, Tell me more about Canonical] [Back-End Web Development, Computer Science, Engineering Documentation, Kubernetes, Linux, MLOps, OpenStack, Python (Programming Language), Technical Documentation, Web Services]

Was generated with the following nu shell command:

# Replaces description of a job with elipsis
def hide-description [] {
    update description { |row| '...' } 
    | update description_html { |row| '...' } 
}

cat result.json 
| from  json 
| first 
| hide-description
| to md --pretty 

Last steps before analysis

We already have several ready to use features (title and skills), but I want more:

  • Years of experience
  • Degree
  • Tech stack
  • Position
  • Responsibilities

So let's add them with help of ChatGPT!

import json
import logging
import os

from dotenv import load_dotenv
from linkedin_jobs_scraper.events import EventData
from openai import OpenAI
from tqdm import tqdm

load_dotenv()

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
)

with open("result.json", "rb") as f:
    jobs = json.load(f)

parsed_descriptions = []

for job in tqdm(jobs):
    job = EventData(**job)
    chat_completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": """
                    Process given IT job description. 
                    Output only raw JSON with the following fields:
                        - Experience (amount of years or null)
                        - Degree requirement (str if found else null)
                        - Tech stack (array of strings)
                        - Position (middle, senior, lead, manager, other (describe it))
                        - Core responsibilites (array of strings)

                    Output will be passed directrly to the
                    Python's `json.loads` function. So DO NOT APPLY MARKDOWN FORMATTING
                    Example:
                    ```


                    {
                        "experience": 5, 
                        "degree": "bachelor", 
                        "stack": ["Python", "FastAPI", "Docker"], 
                        "position": "middle",
                        "responsibilities": ["Deliver features", "break production"]
                    }


                    ```

                    Here is a job description:
                """
                + "\n\n"
                + job.description_html,
            }
        ],
    )

    content = chat_completion.choices[0].message.content
    try:
        if not content:
            print("Empty result from ChatGPT")
            continue
        result = json.loads(content)
    except json.decoder.JSONDecodeError as e:
        logging.error(e, chat_completion)
        continue

    result["job_id"] = job.job_id
    parsed_descriptions.append(result)

with open("job_descriptions_analysis.json", "w") as f:
    json.dump(parsed_descriptions, f)

Do not forget to add OPENAI_API_KEY to the .env file

Now we can merge by job_id results with data from LinkedIn:

cat job_descriptions_analysis.json 
| from json 
| merge (cat result.json | from json)
| to json
| save full.json

Our data is ready to analyze!

cat full.json | from json | columns
╭────┬──────────────────╮
│  0 │ experience       │
│  1 │ degree           │
│  2 │ stack            │
│  3 │ position         │
│  4 │ responsibilities │
│  5 │ job_id           │
│  6 │ query            │
│  7 │ location         │
│  8 │ job_index        │
│  9 │ link             │
│ 10 │ apply_link       │
│ 11 │ title            │
│ 12 │ company          │
│ 13 │ company_link     │
│ 14 │ company_img_link │
│ 15 │ place            │
│ 16 │ description      │
│ 17 │ description_html │
│ 18 │ date             │
│ 19 │ insights         │
│ 20 │ skills           │
╰────┴──────────────────╯

Analysis

For the start

let df = cat full.json | from json

Now we can see technologies frequency:

$df
| get 'stack' 
| flatten 
| uniq --count 
| sort-by count --reverse 
| first 20 
| to md --pretty
value count
Python 185
Java 70
AWS 65
Kubernetes 61
SQL 54
C++ 46
Docker 42
Linux 41
React 37
Kotlin 34
JavaScript 30
C 30
Kafka 28
TypeScript 26
GCP 25
Azure 24
Tableau 22
Hadoop 21
Spark 21
R 20

With Python:

$df
| filter-by-intersection 'stack' ['python']
| get 'stack' 
| flatten 
| where $it != 'Python' # Exclude python itself
| uniq --count 
| sort-by count --reverse 
| first 10
| to md --pretty
value count
Java 44
AWS 43
SQL 40
Kubernetes 36
Docker 27
C++ 26
Linux 24
R 20
GCP 20
C 18

Without Python:

$df
| filter-by-intersection 'stack' ['python'] --invert
| get 'stack' 
| flatten 
| uniq --count 
| sort-by count --reverse 
| first 10
| to md --pretty
value count
React 31
Java 26
Kubernetes 25
TypeScript 23
AWS 22
Kotlin 21
C++ 20
Linux 17
Docker 15
Next.js 15

The most of the jobs require Python, but there are some front-end, Java and C++ jobs

Magic filter-by-intersection function is a custom one and allow filtering list values that include given set of elements:

# Filters rows by intersecting given `column` with `requirements`
# Case insensitive and works only if ALL requirements exist in a `column` value
# If `--invert` then works as symmetric difference
def filter-by-intersection [
    column: string
    requirements: list<string>
   --invert (-i)
] {
    let required_stack = $requirements | par-each { |el| str downcase }
    let required_len = if $invert { 0 } else { ($requirements | length )}
    $in
    | filter { |row| 
        $required_len == (
            $row 
            | get $column 
            | par-each { |el| str downcase } 
            | where ($it in $requirements) 
            | length
        )
    }
}

What about experience and degree requirement for each position in Python?

$df
| filter-by-intersection 'stack' ['python'] 
| group-by 'position' --to-table
| insert 'group_size' { |group| $group.items | length } 
| where 'group_size' >= 10
| insert 'experience' { |group| 
    $group.items 
    | get 'experience'
    | uniq --count  
    | sort-by 'count' --reverse 
    | update 'value' { |row| if $row.value == null { 0 } else { $row.value }}
    | rename --column { 'value': 'years' }
    | first 3 
} 
| insert 'degree_requirement' { |group| 
    $group.items 
    | each { |row| $row.degree != null } 
    | uniq --count 
    | sort-by 'value'
    | rename --column { 'value': 'required' }
}
| sort-by 'group_size' --reverse 
| select 'group' 'group_size' 'experience' 'degree_requirement'

Output:

╭───┬────────┬────────────┬───────────────────────┬──────────────────────────╮
│ # │ group  │ group_size │      experience       │    degree_requirement    │
├───┼────────┼────────────┼───────────────────────┼──────────────────────────┤
│ 0 │ senior │         83 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │
│   │        │            │ │ # │ years │ count │ │ │ # │ required │ count │ │
│   │        │            │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │
│   │        │            │ │ 0 │     5 │    30 │ │ │ 0 │ false    │    26 │ │
│   │        │            │ │ 1 │     0 │    11 │ │ │ 1 │ true     │    57 │ │
│   │        │            │ │ 2 │     7 │    11 │ │ ╰───┴──────────┴───────╯ │
│   │        │            │ ╰───┴───────┴───────╯ │                          │
│ 1 │ other  │         14 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │
│   │        │            │ │ # │ years │ count │ │ │ # │ required │ count │ │
│   │        │            │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │
│   │        │            │ │ 0 │     0 │     8 │ │ │ 0 │ false    │    12 │ │
│   │        │            │ │ 1 │     5 │     1 │ │ │ 1 │ true     │     2 │ │
│   │        │            │ │ 2 │     3 │     1 │ │ ╰───┴──────────┴───────╯ │
│   │        │            │ ╰───┴───────┴───────╯ │                          │
│ 2 │ lead   │         12 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │
│   │        │            │ │ # │ years │ count │ │ │ # │ required │ count │ │
│   │        │            │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │
│   │        │            │ │ 0 │     0 │     5 │ │ │ 0 │ false    │     6 │ │
│   │        │            │ │ 1 │    10 │     4 │ │ │ 1 │ true     │     6 │ │
│   │        │            │ │ 2 │     5 │     1 │ │ ╰───┴──────────┴───────╯ │
│   │        │            │ ╰───┴───────┴───────╯ │                          │
│ 3 │ middle │         10 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │
│   │        │            │ │ # │ years │ count │ │ │ # │ required │ count │ │
│   │        │            │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │
│   │        │            │ │ 0 │     3 │     4 │ │ │ 0 │ false    │     4 │ │
│   │        │            │ │ 1 │     5 │     3 │ │ │ 1 │ true     │     6 │ │
│   │        │            │ │ 2 │     2 │     2 │ │ ╰───┴──────────┴───────╯ │
│   │        │            │ ╰───┴───────┴───────╯ │                          │
╰───┴────────┴────────────┴───────────────────────┴──────────────────────────╯

Extraction of the most common requirements wasn't as easy as previous steps. So I've met a classification problem, and I'm going to describe my solution in the next chapter of this article.

Conclusion

We successfully extracted and analyzed job data from LinkedIn using the linkedin_jobs_scraper package. Responsibilities in the actual dataset are too sparse and need better processing to make functional classes that will help in CV creation. But the given steps already help me a lot with monitoring and applying to the jobs in half-auto mode.

以上是Exploring Job Market for Software Engineers的详细内容。更多信息请关注PHP中文网其他相关文章!

声明
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn
Python中的合并列表:选择正确的方法Python中的合并列表:选择正确的方法May 14, 2025 am 12:11 AM

Tomergelistsinpython,YouCanusethe操作员,estextMethod,ListComprehension,Oritertools

如何在Python 3中加入两个列表?如何在Python 3中加入两个列表?May 14, 2025 am 12:09 AM

在Python3中,可以通过多种方法连接两个列表:1)使用 运算符,适用于小列表,但对大列表效率低;2)使用extend方法,适用于大列表,内存效率高,但会修改原列表;3)使用*运算符,适用于合并多个列表,不修改原列表;4)使用itertools.chain,适用于大数据集,内存效率高。

Python串联列表字符串Python串联列表字符串May 14, 2025 am 12:08 AM

使用join()方法是Python中从列表连接字符串最有效的方法。1)使用join()方法高效且易读。2)循环使用 运算符对大列表效率低。3)列表推导式与join()结合适用于需要转换的场景。4)reduce()方法适用于其他类型归约,但对字符串连接效率低。完整句子结束。

Python执行,那是什么?Python执行,那是什么?May 14, 2025 am 12:06 AM

pythonexecutionistheprocessoftransformingpypythoncodeintoExecutablestructions.1)InternterPreterReadSthecode,ConvertingTingitIntObyTecode,whepythonvirtualmachine(pvm)theglobalinterpreterpreterpreterpreterlock(gil)the thepythonvirtualmachine(pvm)

Python:关键功能是什么Python:关键功能是什么May 14, 2025 am 12:02 AM

Python的关键特性包括:1.语法简洁易懂,适合初学者;2.动态类型系统,提高开发速度;3.丰富的标准库,支持多种任务;4.强大的社区和生态系统,提供广泛支持;5.解释性,适合脚本和快速原型开发;6.多范式支持,适用于各种编程风格。

Python:编译器还是解释器?Python:编译器还是解释器?May 13, 2025 am 12:10 AM

Python是解释型语言,但也包含编译过程。1)Python代码先编译成字节码。2)字节码由Python虚拟机解释执行。3)这种混合机制使Python既灵活又高效,但执行速度不如完全编译型语言。

python用于循环与循环时:何时使用哪个?python用于循环与循环时:何时使用哪个?May 13, 2025 am 12:07 AM

useeAforloopWheniteratingOveraseQuenceOrforAspecificnumberoftimes; useAwhiLeLoopWhenconTinuingUntilAcIntiment.ForloopSareIdeAlforkNownsences,而WhileLeleLeleLeleLoopSituationSituationSituationsItuationSuationSituationswithUndEtermentersitations。

Python循环:最常见的错误Python循环:最常见的错误May 13, 2025 am 12:07 AM

pythonloopscanleadtoerrorslikeinfiniteloops,modifyingListsDuringteritation,逐个偏置,零indexingissues,andnestedloopineflinefficiencies

See all articles

热AI工具

Undresser.AI Undress

Undresser.AI Undress

人工智能驱动的应用程序,用于创建逼真的裸体照片

AI Clothes Remover

AI Clothes Remover

用于从照片中去除衣服的在线人工智能工具。

Undress AI Tool

Undress AI Tool

免费脱衣服图片

Clothoff.io

Clothoff.io

AI脱衣机

Video Face Swap

Video Face Swap

使用我们完全免费的人工智能换脸工具轻松在任何视频中换脸!

热门文章

热工具

适用于 Eclipse 的 SAP NetWeaver 服务器适配器

适用于 Eclipse 的 SAP NetWeaver 服务器适配器

将Eclipse与SAP NetWeaver应用服务器集成。

SublimeText3 英文版

SublimeText3 英文版

推荐:为Win版本,支持代码提示!

SecLists

SecLists

SecLists是最终安全测试人员的伙伴。它是一个包含各种类型列表的集合,这些列表在安全评估过程中经常使用,都在一个地方。SecLists通过方便地提供安全测试人员可能需要的所有列表,帮助提高安全测试的效率和生产力。列表类型包括用户名、密码、URL、模糊测试有效载荷、敏感数据模式、Web shell等等。测试人员只需将此存储库拉到新的测试机上,他就可以访问到所需的每种类型的列表。

SublimeText3 Mac版

SublimeText3 Mac版

神级代码编辑软件(SublimeText3)

安全考试浏览器

安全考试浏览器

Safe Exam Browser是一个安全的浏览器环境,用于安全地进行在线考试。该软件将任何计算机变成一个安全的工作站。它控制对任何实用工具的访问,并防止学生使用未经授权的资源。