Skip to content

GitHub Action to encourage behavior changes from AI/ML scrapers that disrespect robots.txt

License

Notifications You must be signed in to change notification settings

gha-utilities/ai-bait

Use this GitHub action with your project
Add this Action to an existing workflow or create a new one
View on Marketplace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Bait

GitHub Action to encourage behavior changes from AI/ML scrapers that disrespect robots.txt

Byte size of AI Bait Open Issues Open Pull Requests Latest commits License



Requirements

Access to GitHub Actions if using on GitHub, or manually assigning expected environment variables prior to running entrypoint.sh script.


Quick Start

Include, and modify, the following within your repository's workflow that published to GitHub Pages

      - name: Make something nasty for bots
        uses: gha-utilities/ai-bait@v0.0.4
        with:
          bs: 512
          count: 10000
          destination: _site/assets/ai/bait.zip

⚠️ be sure to update your robots.txt to disallow all user-agents any path that destination resolves to!


Usage

Reference the code of this repository within your own workflow...

Example GitHub Pages -- Jekyll

.github/workflows/github-pages.yaml

on:
  push:
    branches: [ gh-pages ]

permissions:
  contents: read
  pages: write
  id-token: write

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout source
        uses: actions/checkout@v4
        with:
          fetch-depth: 1
          fetch-tags: true
          ref: ${{ github.head_ref }}

      # ↓ Do some site building here ↓
      - name: Setup pages
        uses: actions/configure-pages@v5.0.0
      - name: Build pages
        uses: actions/jekyll-build-pages@v1
      # ↑ Do some site building here ↑

      - name: Make _sweet_ for bots
        uses: gha-utilities/ai-bait@v0.0.4
        with:
          bs: 10g
          count: 10000
          destination: _site/assets/ai/bait.zip

      - name: Upload artifact
        uses: actions/upload-pages-artifact@v3.0.1

  deploy:
    runs-on: ubuntu-latest
    needs: build

    environment:
      name: github-pages
      url: ${{ steps.deployment.outputs.page_url }}

    steps:
      - name: Deploy to GitHub Pages
        id: deployment
        uses: actions/deploy-pages@v4.0.5

Notes

This repository may not be feature complete and/or fully functional, Pull Requests that add features or fix bugs are certainly welcomed.

To prevent causing issues with legitimate/authorized web-scrapers, and other tools, be sure to keep your site's robots.txt up to date! Simplest option is to tell all robots not to access the file(s) generated by AI Bait;

User-agent: *
Disallow: /assets/ai/bait.zip

To prevent duplicate deployments caused by default gh-pages branch behavior, be sure to update the repository Settings → Pages → GitHub Pages → Build and deployment → Source configuration to use "GitHub Actions"

https://github.com/<ACCOUNT>/<REPO>/settings/pages

To mitigate allowed bots accidentally scraping files generated by AI Bait that may be linked to within a site, leverage the nofollow HTML attribute eg;

<a href="assets/ai/bait.zip" nofollow>assets/ai/bait.zip</a>


Contributing

Options for contributing to AI Bait and gha-utilities


Forking

Start making a Fork of this repository to an account that you have write permissions for.

  • Add remote for fork URL. The URL syntax is git@github.com:<NAME>/<REPO>.git...
cd ~/git/hub/gha-utilities/ai-bait

git remote add fork git@github.com:<NAME>/ai-bait.git
  • Commit your changes and push to your fork, eg. to fix an issue...
cd ~/git/hub/gha-utilities/ai-bait


git commit -F- <<'EOF'
:bug: Fixes #42 Issue


**Edits**


- `<SCRIPT-NAME>` script, fixes some bug reported in issue
EOF


git push fork main

Note, the -u option may be used to set fork as the default remote, eg. git push -u fork main however, this will also default the fork remote for pulling from too! Meaning that pulling updates from origin must be done explicitly, eg. git pull origin main

  • Then on GitHub submit a Pull Request through the Web-UI, the URL syntax is https://github.com/<NAME>/<REPO>/pull/new/<BRANCH>

Note; to decrease the chances of your Pull Request needing modifications before being accepted, please check the dot-github repository for detailed contributing guidelines.


Sponsor

Thanks for even considering it!

Via Liberapay you may sponsor__shields_io__liberapay on a repeating basis.

Regardless of if you're able to financially support projects such as AI Bait that gha-utilities maintains, please consider sharing projects that are useful with others, because one of the goals of maintaining Open Source repositories is to provide value to the community.


Attribution


License

GitHub Action to encourage behavior changes from AI/ML scrapers that disrespect robots.txt
Copyright (C) 2024 S0AndS0

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published
by the Free Software Foundation, version 3 of the License.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.

For further details review full length version of AGPL-3.0 License.

About

GitHub Action to encourage behavior changes from AI/ML scrapers that disrespect robots.txt

Resources

License

Code of conduct

Stars

Watchers

Forks

Sponsor this project