search

Home  >  Q&A  >  body text

python - 网络数据采集的例子,有关find函数等等的疑问

来自 Python网络数据采集的例子:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

random.seed(datetime.datetime.now())
def getLinks(articleUrl):
html = urlopen("http://en.wikipedia.org"+articleUrl)
bsObj = BeautifulSoup(html)
return bsObj.find("p", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon")
while len(links) > 0:
newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
print(newArticle)
links = getLinks(newArticle)

问题一: return bsObj.find("p", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

这段代码里面, find函数后面为什么可以加findAll,即写成 XXX.find().findAall() 的形式?

问题二:newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
此段代码 像 links[].attrs[] 之类的写法是如何依据的?可以这样写的原理?

新人求教~~谢谢!

大家讲道理大家讲道理2898 days ago473

reply all(2)I'll reply

  • ringa_lee

    ringa_lee2017-04-18 09:55:50

    It is recommended to use the Shenjianshou Cloud Crawler (http://www.shenjianshou.cn). The crawler is completely written and executed on the cloud. There is no need to configure any development environment, and rapid development and implementation are possible.

    A complex crawler can be implemented with just a few lines of javascript, and it also provides many functional functions: anti-crawler, js rendering, data publishing, chart analysis, anti-leeching, etc. These problems that are often encountered in the process of developing crawlers are all solved by Archer will help you solve it.

    reply
    0
  • 大家讲道理

    大家讲道理2017-04-18 09:55:50

    The find function also returns an HTML document and can be connected to the find function and find_all function;
    After the array value is obtained, it can be treated directly as the element of the value, for example:

    a = ['ab',1,[1,2]]
    a[0].upper() # 'AB'
    a[2].append(1) # a == ['ab',1,[1,2,1]]
    

    reply
    0
  • Cancelreply