GitHub Action to encourage behavior changes from AI/ML scrapers that disrespect robots.txt
- ⬆️ Top of Document
- 🏗️ Requirements
- ⚡ Quick Start
- 🧰 Usage
- 🗒 Notes
- 📈 Contributing
- 📇 Attribution
- ⚖️ Licensing
Access to GitHub Actions if using on GitHub, or manually assigning expected
environment variables prior to running entrypoint.sh
script.
Include, and modify, the following within your repository's workflow that published to GitHub Pages
- name: Make something nasty for bots
uses: gha-utilities/ai-bait@v0.0.4
with:
bs: 512
count: 10000
destination: _site/assets/ai/bait.zip
⚠️ be sure to update yourrobots.txt
to disallow all user-agents any path thatdestination
resolves to!
Reference the code of this repository within your own workflow
...
.github/workflows/github-pages.yaml
on:
push:
branches: [ gh-pages ]
permissions:
contents: read
pages: write
id-token: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout source
uses: actions/checkout@v4
with:
fetch-depth: 1
fetch-tags: true
ref: ${{ github.head_ref }}
# ↓ Do some site building here ↓
- name: Setup pages
uses: actions/configure-pages@v5.0.0
- name: Build pages
uses: actions/jekyll-build-pages@v1
# ↑ Do some site building here ↑
- name: Make _sweet_ for bots
uses: gha-utilities/ai-bait@v0.0.4
with:
bs: 10g
count: 10000
destination: _site/assets/ai/bait.zip
- name: Upload artifact
uses: actions/upload-pages-artifact@v3.0.1
deploy:
runs-on: ubuntu-latest
needs: build
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v4.0.5
This repository may not be feature complete and/or fully functional, Pull Requests that add features or fix bugs are certainly welcomed.
To prevent causing issues with legitimate/authorized web-scrapers, and other
tools, be sure to keep your site's robots.txt
up to date! Simplest option is
to tell all robots not to access the file(s) generated by AI Bait;
User-agent: *
Disallow: /assets/ai/bait.zip
To prevent duplicate deployments caused by default gh-pages
branch behavior,
be sure to update the repository Settings → Pages → GitHub Pages → Build and
deployment → Source configuration to use "GitHub Actions"
https://github.com/<ACCOUNT>/<REPO>/settings/pages
To mitigate allowed bots accidentally scraping files generated by AI Bait that
may be linked to within a site, leverage the nofollow
HTML attribute eg;
<a href="assets/ai/bait.zip" nofollow>assets/ai/bait.zip</a>
Options for contributing to AI Bait and gha-utilities
Start making a Fork of this repository to an account that you have write permissions for.
- Add remote for fork URL. The URL syntax is
git@github.com:<NAME>/<REPO>.git
...
cd ~/git/hub/gha-utilities/ai-bait
git remote add fork git@github.com:<NAME>/ai-bait.git
- Commit your changes and push to your fork, eg. to fix an issue...
cd ~/git/hub/gha-utilities/ai-bait
git commit -F- <<'EOF'
:bug: Fixes #42 Issue
**Edits**
- `<SCRIPT-NAME>` script, fixes some bug reported in issue
EOF
git push fork main
Note, the
-u
option may be used to setfork
as the default remote, eg.git push -u fork main
however, this will also default thefork
remote for pulling from too! Meaning that pulling updates fromorigin
must be done explicitly, eg.git pull origin main
- Then on GitHub submit a Pull Request through the Web-UI, the URL syntax is
https://github.com/<NAME>/<REPO>/pull/new/<BRANCH>
Note; to decrease the chances of your Pull Request needing modifications before being accepted, please check the dot-github repository for detailed contributing guidelines.
Thanks for even considering it!
Via Liberapay you may
on a
repeating basis.
Regardless of if you're able to financially support projects such as AI Bait
that gha-utilities
maintains, please consider sharing projects that are
useful with others, because one of the goals of maintaining Open Source
repositories is to provide value to the community.
- GitHub --
github-utilities/make-readme
- GitHub Docs -- Metadata syntax for GitHub Actions
outputs
for composite actions - Stack Overflow -- How does one make a zip-bomb
GitHub Action to encourage behavior changes from AI/ML scrapers that disrespect robots.txt
Copyright (C) 2024 S0AndS0
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published
by the Free Software Foundation, version 3 of the License.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
For further details review full length version of AGPL-3.0 License.