GetPostContent

GetPostContent is a Python script that automates the process of scraping content from a list of URLs and saving the extracted content into structured .docx files. The content is organized based on HTML tags like <h1>, <h2>, <p>, etc., and includes the source URL at the top of each document.

Features

Automatic URL Processing: The script processes multiple URLs provided in a list.
Content Structuring: Extracted content is structured into headings, paragraphs, and lists.
Error Handling: The script includes retry logic for handling network errors and incomplete responses.
File Naming: Each .docx file is named based on the <h1> heading of the corresponding URL.

Requirements

Ensure you have the following installed:

Python 3.7+
Required Python packages:
- requests
- beautifulsoup4
- python-docx

You can install the necessary packages using pip:

pip install requests beautifulsoup4 python-docx

Setup

Clone the Repository: Clone this repository to your local machine.

git clone https://github.com/andersonkevin/getpostcontent.git

Navigate to the Project Directory:

cd getpostcontent

Create a Virtual Environment (optional but recommended):

python3 -m venv env
source env/bin/activate  # On Windows, use `env\Scripts\activate`

Install Dependencies:

pip install -r requirements.txt

Usage

Add URLs: Open the script file (getpostcontent.py) and add the URLs you want to scrape to the urls list.

urls = [
    'https://www.example.com/article-1',
    'https://www.example.com/article-2',
    # Add more URLs here
]

Run the Script:

python getpostcontent.py

The script will process each URL, extract the content, and save it as a .docx file named after the article's <h1> title. The source URL will be included at the top of each document.

Error Handling

The script includes retry logic to handle network issues, such as incomplete responses or connection errors.
If a URL cannot be processed after multiple attempts, the script will skip it and continue with the next URL.

Contributing

Contributions are welcome! If you'd like to improve the script or add new features, please fork the repository and submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions or issues, please open an issue on GitHub or contact the repository owner.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
getpostcontent.py		getpostcontent.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GetPostContent

Features

Requirements

Setup

Usage

Error Handling

Contributing

License

Contact

About

Releases

Packages

Languages

License

andersonkevin/getpostcontent

Folders and files

Latest commit

History

Repository files navigation

GetPostContent

Features

Requirements

Setup

Usage

Error Handling

Contributing

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages