I'm trying to learn how to extract data from this url: https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview
However, the problem is that when I try to switch pages, the URL doesn't change, so I'm not sure how to enumerate or loop over it. Since the web page has 3000 sales data points, trying to find a better way is being done.
This is my starting code, it's very simple but I would appreciate any help I can provide or any tips. I think I might need to switch to another bag but I'm not sure which one might be beautifulsoup?
导入请求 url =“https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview” html = requests.get(url).content df_list = pd.read_html(html,标题 = 1)[0] df_list = df_list.drop([0,1,2]) #删除不需要的行
P粉6008451632024-02-18 09:42:37
To get data from more pages, you can use the following example:
导入请求 将 pandas 导入为 pd 从 bs4 导入 BeautifulSoup 数据 = { "folder": "拍卖结果", “登录ID”:“00”, "页数": "1", "orderBy": "AdvNum", "orderDir": "asc", "justFirstCertOnGroups": "1", "doSearch": "真", "itemIDList": "", "itemSetIDList": "", “兴趣”: ””, “优质的”: ””, "itemSetDID": "", } url =“https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview” 所有数据 = [] for data["pageNum"] in range(1, 3): # <-- 增加此处的页数。 soup = BeautifulSoup(requests.post(url, data=data).content, "html.parser") 对于 soup.select("#searchResults tr")[2:] 中的行: tds = [td.text.strip() for row.select("td") 中的 td] all_data.append(tds) 列= [ “序列号”, “纳税年度”, “通知”, “包裹ID”, “面额”, “中标”, “卖给”, ] df = pd.DataFrame(all_data, columns=columns) # 打印数据框中的最后 10 项: 打印(df.tail(10).to_markdown())
Print: