python中urllib2与BeautifulSoup爬取数据保存MongoDB-php手册-php.cn

Home

php教程

php手册

python中urllib2与BeautifulSoup爬取数据保存MongoDB

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 13, 2016 am 09:24 AM

beautifulsoupmongodbpythonuandkeepdataCrawling

python中urllib2与BeautifulSoup爬取数据保存MongoDB

　　Beautiful Soup是一个用来解析HTML和XML的python库，它可以按照你喜欢的方式去解析文件，查找并修改解析树。它可以很好的处理不规范标记并生成剖析树(parse tree). 它提供简单又常用的导航(navigating)，搜索以及修改剖析树的操作。

　　如图使用urllib2与BS4模块爬取html页面数据，分别为标题、内容、股票名称、股票ID、发布时间、围观人数。

　　Example:

代码如下

##-coding:utf-8-##
import time
from bs4 import BeautifulSoup
import urllib2
import pymongo
import re
import datetime

def update():
    datas = {}
    connection = pymongo.Connection('192.168.1.2', 27017)
#连接mongodb
    db = connection.test_hq
#创建或连接test_hq库
    for i in soup.find_all("div", class_="item"):
        datas['_id'] = str(i.h2.a['href']).split('/')[-1].split('.')[0]
#获取html页面名称为id号
        datas['title'] = i.h2.get_text()
#获取标题
        url2 = i.h2.a['href']
#获取标题内容url地址
        html2 = urllib2.urlopen(url2)
        html_doc2 = html2.read()
        soup2 = BeautifulSoup(html_doc2)
        datas['content'] = soup2.find(attrs={"name":"description"})['content']
#获取文章内容
        stock_name = []
        stock_id = []
        for name in re.findall(u"[u4e00-u9fa5]+",i.find(class_="stocks").get_text()):
            stock_name.append(name)
#获取影响股票名称，已数组方式保存对应股票id号，mongo支持数组插入
        datas['stock_name'] = stock_name
        for id in re.findall("d+",i.find(class_="stocks").get_text()):
            stock_id.append(id)
#获取影响股票id
        datas['stock_id'] = stock_id
        datas['update_time'] = datetime.datetime.strptime(re.search("w+.*w+", i.find(class_="fl date").span.get_text()).group(), '%Y-%m-%d %H:%M') - datetime.timedelta(hours=8)
#获取发布时间，转换为mongo时间格式
        datas['onlooker'] = int(re.search("d+",i.find(class_="icons ic-wg").get_text()).group())
#获取围观数
        db.test.save(datas)
#插入数据库

def get_data():

    title = str(soup.h2.a['href']).split('/')[-1].split('.')[0]
    #获取html页面名称做更新判断
    with open('update.txt', 'r') as f:
        time = f.readline()
    if title == time:
        print 'currently no update', title
    else:
        with open('update.txt', 'w') as f:
            f.write(title)
        update()

while True:
    if __name__ == '__main__':
        url = 'http://www.ipython.me/qingbao/'
        html = urllib2.urlopen(url)
        html_doc = html.read()
        soup = BeautifulSoup(html_doc)
        get_data()
        time.sleep(30)
#每30秒刷新一次

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks agoByDDD

Hot Tools

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version

Useful JavaScript development tools

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Zend Studio 13.0.1

Powerful PHP integrated development environment

Hot Topics

Where is the login entrance for gmail email?

7597

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

123