집 >백엔드 개발 >파이썬 튜토리얼 >Python 다중 프로세스로 CSV 데이터 가져오기

Python 다중 프로세스로 CSV 데이터 가져오기

高洛峰원래의: 2017-02-28 09:13:391759검색

얼마 전 동료가 MySQL로 CSV 데이터를 가져오는 작업을 처리하도록 도왔습니다. 두 개의 큰 CSV 파일(각각 2,100만 개의 레코드가 포함된 3GB 및 3,500만 개의 레코드가 포함된 7GB)입니다. 이 정도 규모의 데이터의 경우 간단한 단일 프로세스/단일 스레드 가져오기에는 시간이 오래 걸리므로 이를 구현하기 위해 마지막으로 다중 프로세스 접근 방식이 사용되었습니다. 구체적인 과정을 자세히 설명하지는 않지만 몇 가지 핵심 사항을 기록해 두겠습니다.

하나씩 삽입하지 말고 일괄 삽입
삽입 속도를 높이려면 먼저 인덱스를 작성하지 마세요
생산자 및 소비자 모델에서는 기본 프로세스가 파일을 읽고 여러 작업자 프로세스가 삽입을 수행합니다
작업자 수 제어에 주의하고 MySQL에 과도한 부담을 주지 마십시오
더티 데이터 처리로 인해 발생하는 예외에 주의
원본 데이터는 GBK로 인코딩되어 있으므로 주의하세요. UTF-8로 변환
클릭하여 명령줄 도구 캡슐화

구체적인 코드 구현은 다음과 같습니다.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import codecs
import csv
import logging
import multiprocessing
import os
import warnings

import click
import MySQLdb
import sqlalchemy

warnings.filterwarnings(&#39;ignore&#39;, category=MySQLdb.Warning)

# 批量插入的记录数量
BATCH = 5000

DB_URI = &#39;mysql://root@localhost:3306/example?charset=utf8&#39;

engine = sqlalchemy.create_engine(DB_URI)


def get_table_cols(table):
  sql = &#39;SELECT * FROM `{table}` LIMIT 0&#39;.format(table=table)
  res = engine.execute(sql)
  return res.keys()


def insert_many(table, cols, rows, cursor):
  sql = &#39;INSERT INTO `{table}` ({cols}) VALUES ({marks})&#39;.format(
      table=table,
      cols=&#39;, &#39;.join(cols),
      marks=&#39;, &#39;.join([&#39;%s&#39;] * len(cols)))
  cursor.execute(sql, *rows)
  logging.info(&#39;process %s inserted %s rows into table %s&#39;, os.getpid(), len(rows), table)


def insert_worker(table, cols, queue):
  rows = []
  # 每个子进程创建自己的 engine 对象
  cursor = sqlalchemy.create_engine(DB_URI)
  while True:
    row = queue.get()
    if row is None:
      if rows:
        insert_many(table, cols, rows, cursor)
      break

    rows.append(row)
    if len(rows) == BATCH:
      insert_many(table, cols, rows, cursor)
      rows = []


def insert_parallel(table, reader, w=10):
  cols = get_table_cols(table)

  # 数据队列，主进程读文件并往里写数据，worker 进程从队列读数据
  # 注意一下控制队列的大小，避免消费太慢导致堆积太多数据，占用过多内存
  queue = multiprocessing.Queue(maxsize=w*BATCH*2)
  workers = []
  for i in range(w):
    p = multiprocessing.Process(target=insert_worker, args=(table, cols, queue))
    p.start()
    workers.append(p)
    logging.info(&#39;starting # %s worker process, pid: %s...&#39;, i + 1, p.pid)

  dirty_data_file = &#39;./{}_dirty_rows.csv&#39;.format(table)
  xf = open(dirty_data_file, &#39;w&#39;)
  writer = csv.writer(xf, delimiter=reader.dialect.delimiter)

  for line in reader:
    # 记录并跳过脏数据: 键值数量不一致
    if len(line) != len(cols):
      writer.writerow(line)
      continue

    # 把 None 值替换为 &#39;NULL&#39;
    clean_line = [None if x == &#39;NULL&#39; else x for x in line]

    # 往队列里写数据
    queue.put(tuple(clean_line))
    if reader.line_num % 500000 == 0:
      logging.info(&#39;put %s tasks into queue.&#39;, reader.line_num)

  xf.close()

  # 给每个 worker 发送任务结束的信号
  logging.info(&#39;send close signal to worker processes&#39;)
  for i in range(w):
    queue.put(None)

  for p in workers:
    p.join()


def convert_file_to_utf8(f, rv_file=None):
  if not rv_file:
    name, ext = os.path.splitext(f)
    if isinstance(name, unicode):
      name = name.encode(&#39;utf8&#39;)
    rv_file = &#39;{}_utf8{}&#39;.format(name, ext)
  logging.info(&#39;start to process file %s&#39;, f)
  with open(f) as infd:
    with open(rv_file, &#39;w&#39;) as outfd:
      lines = []
      loop = 0
      chunck = 200000
      first_line = infd.readline().strip(codecs.BOM_UTF8).strip() + &#39;\n&#39;
      lines.append(first_line)
      for line in infd:
        clean_line = line.decode(&#39;gb18030&#39;).encode(&#39;utf8&#39;)
        clean_line = clean_line.rstrip() + &#39;\n&#39;
        lines.append(clean_line)
        if len(lines) == chunck:
          outfd.writelines(lines)
          lines = []
          loop += 1
          logging.info(&#39;processed %s lines.&#39;, loop * chunck)

      outfd.writelines(lines)
      logging.info(&#39;processed %s lines.&#39;, loop * chunck + len(lines))


@click.group()
def cli():
  logging.basicConfig(level=logging.INFO,
            format=&#39;%(asctime)s - %(levelname)s - %(name)s - %(message)s&#39;)


@cli.command(&#39;gbk_to_utf8&#39;)
@click.argument(&#39;f&#39;)
def convert_gbk_to_utf8(f):
  convert_file_to_utf8(f)


@cli.command(&#39;load&#39;)
@click.option(&#39;-t&#39;, &#39;--table&#39;, required=True, help=&#39;表名&#39;)
@click.option(&#39;-i&#39;, &#39;--filename&#39;, required=True, help=&#39;输入文件&#39;)
@click.option(&#39;-w&#39;, &#39;--workers&#39;, default=10, help=&#39;worker 数量，默认 10&#39;)
def load_fac_day_pro_nos_sal_table(table, filename, workers):
  with open(filename) as fd:
    fd.readline()  # skip header
    reader = csv.reader(fd)
    insert_parallel(table, reader, w=workers)


if __name__ == &#39;__main__&#39;:
  cli()

위 내용은 이 글에서 공유한 전부입니다. 모두 마음에 드셨으면 좋겠습니다

관련 기사에 Python 다중 프로세스로 CSV 데이터를 가져오려면 PHP 중국어 넷에 주목하세요!

성명：

이전 기사：Python 스레드를 강제로 종료하지 마십시오.다음 기사：Python 스레드를 강제로 종료하지 마십시오.