>  Q&A  >  본문

python爬虫 - 一个python的多线程爬虫,daemon=False主程序无法退出,daemon=Ture程序可以退出

代码在2.7下测试了可以直接运行

大神指点下,对于daemon查了很久了,但是还是没想明白,拜托看一下放在shomy答主的下面的几条评论,补充了一些内容。

问题描述

如果将mutiple.py的第54行改为t.daemon=False,那么所有图片下载完成后,程序会一直卡在这里,不会退出。

$ python mutiple.py
一共下载了 253 张图片
Took 57.710124015808105s
...现在卡死不动了,只能通过kill -9来杀

接下来我用$ pstree -h | grep python,显然主线程和它的子线程现在没有退出,这是为什么呢?因为Queue已经设置了join(),而且print语句也成功打印出来,所以说子线程应该已经完工了呀。

python(6591)-+-{python}(6596)
            |-{python}(6597)
            |-{python}(6598)
            |-{python}(6599)
            |-{python}(6600)
            |-{python}(6601)
            |-{python}(6602)
            '-{python}(6603)

mutiple.py的代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from Queue import Queue
from threading import Thread
from time import time
from itertools import chain
from download import setup_download_dir, get_links, download_link


class DownloadWorker(Thread):
    def __init__(self, queue):
        Thread.__init__(self)
        self.queue = queue

    def run(self):
        while True:
            # Get the work from the queue and expand the tuple
            item = self.queue.get()
            if item is None:
                break
            directory, link = item
            download_link(directory, link)
            self.queue.task_done()


def main():
    ts = time()

    url1 = 'http://www.toutiao.com/a6333981316853907714'
    url2 = 'http://www.toutiao.com/a6334459308533350658'
    url3 = 'http://www.toutiao.com/a6313664289211924737'
    url4 = 'http://www.toutiao.com/a6334337170774458625'
    url5 = 'http://www.toutiao.com/a6334486705982996738'
    download_dir = setup_download_dir('thread_imgs')
    # Create a queue to communicate with the worker threads
    queue = Queue()

    links = list(chain(
        get_links(url1),
        get_links(url2),
        get_links(url3),
        get_links(url4),
        get_links(url5),
    ))

    # Create 8 worker threads
    for x in range(8):
        worker = DownloadWorker(queue)
        # Setting daemon to True will let the main thread exit even though the
        # workers are blocking
        worker.daemon = True
        worker.start()

    # Put the tasks into the queue as a tuple
    for link in links:
        queue.put((download_dir, link))

    # Causes the main thread to wait for the queue to finish processing all
    # the tasks
    queue.join()
    print u'一共下载了 {} 张图片'.format(len(links))
    print u'Took {}s'.format(time() - ts)


if __name__ == '__main__':
    main()

"""
一共下载了 253 张图片
Took 57.710124015808105s
"""

download.py的代码

#!/usr/bin/env python
import os
import requests
from pathlib import Path
from bs4 import BeautifulSoup


def get_links(url):
    '''
    return the links in a list
    '''
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "html.parser")
    return [img.attrs.get('src') for img in
            soup.find_all('p', class_='img-wrap')
            if img.attrs.get('src') is not None]


def download_link(directory, link):
    '''
    download the img by the link and save it
    '''
    img_name = '{}.jpg'.format(os.path.basename(link))
    download_path = directory / img_name
    r = requests.get(link)
    with download_path.open('wb') as fd:
        fd.write(r.content)


def setup_download_dir(directory):
    '''
    set the dir and create a new dir if not exists
    '''
    download_dir = Path(directory)
    if not download_dir.exists():
        download_dir.mkdir()
    return download_dir

程序运行中,执行一个主线程,如果主线程又创建一个子线程,主线程和子线程就分兵两路,分别运行,那么当主线程完成想退出时,会检验子线程是否完成。如果子线程未完成,则主线程会等待子线程完成后再退出。但是有时候我们需要的是,只要主线程完成了,不管子线程是否完成,都要和主线程一起退出,这时就可以用setDaemon(True)方法了。

迷茫迷茫2763일 전701

모든 응답(1)나는 대답할 것이다

  • ringa_lee

    ringa_lee2017-04-18 09:50:59

    제가 이해한 바는 다음과 같습니다.

    1. setdaemon(True)은 데몬 스레드를 의미합니다. 즉, True로 설정하면 메인 스레드가 종료되면 하위 스레드가 강제 종료됩니다.

    2. queue.join()은 메인 스레드가 실행을 계속하기 전에 모든 하위 스레드가 완료될 때까지 메인 스레드가 기다리게 합니다.

    3. 스레드에서는 종료 기능을 제공하지 않습니다

    위의 세 가지 사항을 요약하면 setdaemon(False)을 사용하면 메인 스레드는 하위 스레드가 종료될 때까지 기다립니다. 너무 막혔어요

    회신하다
    0
  • 취소회신하다