Home  >  Article  >  Backend Development  >  Download PDF files using Python's Requests and BeautifulSoup

Download PDF files using Python's Requests and BeautifulSoup

王林
王林forward
2023-08-30 15:25:06839browse

Download PDF files using Pythons Requests and BeautifulSoup

Request and BeautifulSoup are Python libraries that can download any file or PDF online. The requests library is used to send HTTP requests and receive responses. BeautifulSoup library is used to parse the HTML received in the response and get the downloadable pdf link. In this article, we will learn how to download PDF using Request and Beautiful Soup in Python.

Install dependencies

Before using the BeautifulSoup and Request libraries in Python, we need to install these libraries in the system using the pip command. To install request and the BeautifulSoup and Request libraries, run the following commands in the terminal.

pip install requests
pip install beautifulsoup4

Use Request and Beautiful Soup to download PDF

To download a PDF from the internet, you need to first find the URL of the pdf file using the requests library. We can then use Beautiful Soup to parse the HTML response and extract the link to the PDF file. The base URL and the PDF link received after parsing are then combined to get the URL of the PDF file. Now we can use the request method to send a Get request to download the file.

Example

In the code below, place the valid URL of the page containing the PDF file URL at "https://example.com/document.pdf"

import requests
from bs4 import BeautifulSoup

# Step 1: Fetch the PDF URL
url = 'https://example.com/document.pdf'
response = requests.get(url)

if response.status_code == 200:
   # Step 2: Parse the HTML to get the PDF link
   soup = BeautifulSoup(response.text, 'html.parser')
   link = soup.find('a')['href']

   # Step 3: Download the PDF
   pdf_url = url + link
   pdf_response = requests.get(pdf_url)

   if pdf_response.status_code == 200:
      with open('document.pdf', 'wb') as f:
         f.write(pdf_response.content)
      print('PDF downloaded successfully.')
   else:
      print('Error:', pdf_response.status_code)
else:
   print('Error:', response.status_code)

Output

PDF downloaded successfully.

in conclusion

In this article, we discussed how to download PDF files from the internet using the Request and Beautiful Soup libraries in Python. Through the request method, we can send an HTTP request to verify the PDF link. Once we find a page that contains a link to a PDF file, we can use Beautiful Soup Download to parse the page and get the PDF downloadable link.

The above is the detailed content of Download PDF files using Python's Requests and BeautifulSoup. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:tutorialspoint.com. If there is any infringement, please contact admin@php.cn delete