This Python script is designed to generate clean, printable PDFs from web pages. It utilizes Selenium WebDriver with Chrome to navigate to specified URLs, apply some basic cleanup to the page structure (optional), and then generates a PDF using Chrome's built-in PDF printing capabilities.
- Headless Chrome: Operates without a visible browser window (can be disabled for debugging).
- @media print Optimization: Takes advantage of websites' print stylesheets for cleaner PDFs.
- Customizable: Easily modify the script to remove specific elements or adjust page settings.
- Error Handling: Includes basic error handling for timeouts and other exceptions.
This script was used to download and convert the VISA pages from the Australian Department of Home Affairs website (https://immi.homeaffairs.gov.au) into PDFs.
-
Clone the Repository:
git clone https://your-repository-url.git cd your-repository-name
-
Install Dependencies:
pip install -r requirements.txt
-
Download Chrome Driver:
- Visit the Chrome for Testing website.
- Download the ChromeDriver version that matches your Chrome browser installation.
- Place the ChromeDriver executable (
chromedriver
) in your project's root folder (or modify thechrome_driver_path
variable in the script accordingly).
-
Verify Chrome Version:
- Open your Chrome browser and type
chrome://version/
in the address bar. - Make sure the version of Chrome Driver matches the version of your Chrome browser.
- Open your Chrome browser and type
-
Modify URL List (Optional):
Edit the
urls
array in the script to include the web pages you want to convert. -
Customize Cleaning (Optional):
The script includes example functions (
remove_elements_by_id
,remove_elements_by_class
) to demonstrate how to remove specific elements from the page before generating the PDF. Modify or add your own functions to tailor the cleaning process. -
Run the Script:
python your_script_name.py
The generated PDFs will be saved in a folder named generated_pdfs
in your project directory.
- Print Media Optimization: This script works best with websites that are well-optimized for printing (using
@media print
CSS). If a website's print view isn't ideal, you'll need to customize the cleaning process to remove unwanted elements. - DOM Manipulation: You can use Selenium's powerful DOM manipulation capabilities to interact with and modify the page content before generating the PDF. Refer to the Selenium documentation for more information.
- Error Handling: While the script includes basic error handling, it's recommended to add more robust error checks and logging for production use.