Home  >  Q&A  >  body text

Python implementation: How to get the tree structure of all XPaths in the website?

method one

When trying to get a hierarchical tree of all xpaths in a website (https://startpagina.nl) using Python, I first tried to get the xpaths of branches using: /html/body:

from selenium import webdriver

url = 'https://startpagina.nl'

driver = webdriver.Firefox()
driver.get(url)

test = driver.find_elements_by_xpath('//*')
print(len(test))
driver.close()

Based on @Prophet's answer, this will generate a list of all elements in the website. However, I haven't figured out how to get the xpath of these elements, nor how to sort them into a tree structure.

And the /html/body/div[6] option generates a tree of length 1 instead.

Method Two

Based on @Micheal Kay's answer, I tried "traversing xml" using the following Python code:

import requests
from bs4 import BeautifulSoup
import xml.etree.cElementTree as ET
from lxml import etree


unformatted_filename = "first.xml"
formatted_filename = "first.xml"

# Get XML from url.
resp = requests.get("https://startpagina.nl")
# resp = requests.get('https://stackoverflow.com')
with open(unformatted_filename, "wb") as foutput:
    foutput.write(resp.content)

# Improve XML formatting
with open(unformatted_filename) as fp:
    soup = BeautifulSoup(fp, "xml")
    print(f"soup={soup}")
    with open(formatted_filename, "w") as f:
        f.write(soup.prettify())


# Parse XML
tree = ET.parse(formatted_filename, parser=ET.XMLParser(encoding="utf-8"))
root = tree.getroot()
for child in root:
    child.tag, child.attrib

tree = ET.parse(formatted_filename)
for elem in tree.getiterator():
    if elem.tag:
        print("my name:")
        print("\t" + elem.tag)
    if elem.text:
        print("my text:")
        print("\t" + (elem.text).strip())
    if elem.attrib.items():
        print("my attributes:")
        for key, value in elem.attrib.items():
            print("\t" + "\t" + key + " : " + value)
    if list(elem):  # use elem.getchildren() for python2.6 or before
        print("my no of child: %d" % len(list(elem)))
    else:
        print("No child")
    if elem.tail:
        print("my tail:")
        print("\t" + "%s" % elem.tail.strip())
    print("$$$$$$$$$$")

However, I haven't figured out how to get the xpath of the individual elements.

question

So I want to ask:

How to use Python to get the tree of all xpaths in the website? (I wonder if the tree is cyclic, although I hope I will know once I figure out how to obtain the tree.).

Expected output

Based on manually browsing HTML: I want the output to look like this:

| /html

|-- //*[@id="browser-upgrade-notification"]

|-- //*[@id="app"]

|-- /html/head

|-- /html/body
|--/-- /html/body/noscript
|--/-- /html/body/div[2]

|--/-- /html/body/header/section
|--/--/-- /html/body/header/section/div
|--/--/--/-- /html/body/header/section/div/div[1]
....

This will be an example of a tree list.

P粉155832941P粉155832941240 days ago314

reply all(1)I'll reply

  • P粉127901279

    P粉1279012792024-02-22 13:34:15

    The total number of XPaths that select one or more elements is unlimited (for example, it will include paths like /a/b/../b/../b/../b) , but if you restrict yourself to paths of the form /a[i]/b[j]/c[k], then the number of paths equals the number of elements, and the "tree" of XPaths is the same as the original XML Tree isomorphism.

    If you want different paths without numeric predicates, such as /a/b/c, /a/b/d, then the easiest way is probably to iterate over the XML document , get the path of each element (in this form) and eliminate duplicates. If you want a tree structure instead of a simple list of paths, use nested maps/dictionaries to build it.

    The reason it complains about /html/body/ is that a legal XPath expression cannot contain the trailing /.

    reply
    0
  • Cancelreply