Python实现：如何获取网站中所有XPath的树形结构？

方法一

在尝试使用 Python 获取网站 (https://startpagina.nl) 中所有 xpath 的分层树时，我首先尝试使用以下方法获取分支的 xpath：/html/body：

from selenium import webdriver
 
url = 'https://startpagina.nl'
 
driver = webdriver.Firefox()
driver.get(url)
 
test = driver.find_elements_by_xpath('//*')
print(len(test))
driver.close()

根据 @Prophet 的回答，这会生成网站中所有元素的列表。但是，我还没有确定如何获取这些元素的 xpath，也没有确定如何将它们排序成树状结构。

并且 /html/body/div[6] 选项生成长度为 1 而不是树。

方法二

根据 @Micheal Kay 的回答，我尝试使用以下 Python 代码“遍历 xml”：

import requests
from bs4 import BeautifulSoup
import xml.etree.cElementTree as ET
from lxml import etree
 
 
unformatted_filename = "first.xml"
formatted_filename = "first.xml"
 
# Get XML from url.
resp = requests.get("https://startpagina.nl")
# resp = requests.get('https://stackoverflow.com')
with open(unformatted_filename, "wb") as foutput:
    foutput.write(resp.content)
 
# Improve XML formatting
with open(unformatted_filename) as fp:
    soup = BeautifulSoup(fp, "xml")
    print(f"soup={soup}")
    with open(formatted_filename, "w") as f:
        f.write(soup.prettify())
 
 
# Parse XML
tree = ET.parse(formatted_filename, parser=ET.XMLParser(encoding="utf-8"))
root = tree.getroot()
for child in root:
    child.tag, child.attrib
 
tree = ET.parse(formatted_filename)
for elem in tree.getiterator():
    if elem.tag:
        print("my name:")
        print("\t" + elem.tag)
    if elem.text:
        print("my text:")
        print("\t" + (elem.text).strip())
    if elem.attrib.items():
        print("my attributes:")
        for key, value in elem.attrib.items():
            print("\t" + "\t" + key + " : " + value)
    if list(elem):  # use elem.getchildren() for python2.6 or before
        print("my no of child: %d" % len(list(elem)))
    else:
        print("No child")
    if elem.tail:
        print("my tail:")
        print("\t" + "%s" % elem.tail.strip())
    print("$$$$$$$$$$")

但是，我还没有确定如何获取各个元素的 xpath。

问题

所以我想问一下：

如何使用 Python 获取网站中所有 xpath 的树？ （我想知道这棵树是否是循环的，尽管我希望一旦我知道如何获得这棵树我就会知道。）。

预期输出

基于手动浏览 HTML：我希望输出看起来像这样：

| /html
 
|-- //*[@id="browser-upgrade-notification"]
 
|-- //*[@id="app"]
 
|-- /html/head
 
|-- /html/body
|--/-- /html/body/noscript
|--/-- /html/body/div[2]
 
|--/-- /html/body/header/section
|--/--/-- /html/body/header/section/div
|--/--/--/-- /html/body/header/section/div/div[1]
....

这将是树列表的示例。

P粉155832941379 天前443

Python实现：如何获取网站中所有XPath的树形结构？

方法一

方法二

问题

预期输出

全部回复(1)我来回复