I'm trying to scrape a website using BeautifulSoup in Python. All data is ingested, including all links I try to access. However, when I use the .findAll() function, it only returns part of the link I'm looking for. That is to say, only the links in the following xpath are returned
/html/body/div[1]/div/div[2]/div/div[2]/div[1]
This will ignore links in /html/body/div[1]/div/div[2]/div/div[2]/div[2] /html/body/div[1]/div/div[2]/div/div[2]/div[3] etc
import requests from bs4 import BeautifulSoup url = "https://www.riksdagen.se/sv/ledamoter-och-partier/ledamoterna/" response = requests.get(url) content = BeautifulSoup(response.content, "html.parser") mp_pages = [] mps = content.findAll(attrs = {'class': 'sc-907102a3-0 sc-e6d2fd61-0 gOAsvA jBTDjv'}) for x in mps: mp_pages.append(x.get('href')) print(mp_pages)
I want all links to be appended to the mp_pages list, but it only goes to one parent (the links starting with A) and seems to stop at the last child instead of continuing.
I've seen similar questions where the answer was to use selenium due to javascript, but since all the links are within the content it doesn't make sense.
P粉5534287802023-09-15 11:25:57
The data you see on the page is stored as Json in elements. To parse it you can use next example:
import json import requests import pandas as pd from bs4 import BeautifulSoup url = 'https://www.riksdagen.se/sv/ledamoter-och-partier/ledamoterna/' soup = BeautifulSoup(requests.get(url).content, 'html.parser') data = json.loads(soup.select_one('#__NEXT_DATA__').text) # print(json.dumps(data, indent=4)) all_data = [] for c in data['props']['pageProps']['contentApiData']['commissioners']: all_data.append((f'{c["callingName"]} {c["surname"]}', c['url'])) df = pd.DataFrame(all_data, columns=['Name', 'URL']) print(df)
Print:
Name URL 0 Fredrik Ahlstedt https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/fredrik-ahlstedt_8403346f-0f0c-4d48-bbd0-f6b43b368873/ 1 Emma Ahlström Köster https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/emma-ahlstrom-koster_e09d9076-28c7-4583-a17f-7a776de7f01f/ 2 Alireza Akhondi https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/alireza-akhondi_4099ff9c-5d27-4605-b018-98fb229d94fa/ 3 Anders Alftberg https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/anders-alftberg_f0d945f3-9449-458e-ba40-1a0da1a72303/ 4 Leila Ali Elmi https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/leila-ali-elmi_5997ba96-4f01-46f4-8bd8-e1411a9d503b/ 5 Janine Alm Ericson https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/janine-alm-ericson_7e408079-a5cd-432a-a30e-fd61fd15c65a/ 6 Ann-Sofie Alm https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/ann-sofie-alm_f91f6a86-591c-449c-b3dd-1fdaa86338cd/ 7 Sofia Amloh https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/sofia-amloh_359e75f3-519e-49d7-b155-ada488e621ea/ 8 Andrea Andersson Tay https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/andrea-andersson-tay_352b875d-e44d-43f5-bf93-e507770c12de/ ...and so on.