The Scraping Interface project is a cross-platform desktop application developed using Python and the PyQt5 library. It provides a user-friendly interface for web scraping, allowing users to extract information from web pages easily.
- Web Scraping: Extract online data using a browser-like interface.
- Dynamic Browsing: Browse the web with Chromium and perform standard actions like navigation, page reloads and searching.
- XPath Selection: Highlight and select elements on web pages using generalized XPath expressions.
- Table Preview: Select data from sites and view it in a table format for easy extraction.
- Pagination Support: Extract data from multiple pages with consistent structures, including automatic handling of pagination buttons.
- Data Export: Save scraped data in popular formats such as Excel, CSV, JSON, or XML.
- Template Management: Save and load scraping configurations for reuse, allowing quick access to previously configured selections.
- Authentication Support: Securely store and use encrypted login credentials to access authenticated web pages.
- CAPTCHA Handling: Solutions to handle CAPTCHA-protected pages for uninterrupted data extraction.
- Process Monitoring: Track and manage scraping processes with progress indicators about the ongoing tasks.
To install and run the Scraping Interface application from source, follow these steps:
- Clone the repository:
git clone https://github.com/gonzalopezgil/scraping-interface.git
- Install the required dependencies:
pip install -r requirements.txt
- Run the application:
python main.py
- Launch the application, and you will be presented with a user-friendly interface with four tabs: Home, Browser, Processes and Settings.
- Use the browser tab to navigate, search and interact with sites.
- When you're ready to extract data, navigate to the desired web page and click the "Scrape" button. The program will display the extracted data in a table for preview.
- Customize your selection using generalized XPath expressions and modify the table as needed.
- Configure pagination settings, save templates for future use, and choose the desired data export format in the respective process.
- Monitor the scraping processes in the Processes tab, and manage them by stopping, interacting to solve manual actions or opening the output files.
- Adjust application settings, including browser preferences and language management in the Settings tab.
Contributions to the Scraping Interface project are welcome! If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request.
This project is licensed under the MIT License. Feel free to use, modify, and distribute the code.