Skip to content

Releases: simonw/s3-ocr

0.6.3

10 Aug 04:43
391e66c
Compare
Choose a tag to compare
  • Pages with no OCR text on them are now recorded as rows with empty strings, instead of being skipped entirely. #23

0.6.2

09 Aug 20:35
ef546f2
Compare
Choose a tag to compare
  • Fixed bug where commands were sometimes not properly registered. #26

0.6.1

09 Aug 19:40
d107f0c
Compare
Choose a tag to compare
  • Now pins to click>=8.0, which should avoid a bug where installing this on a machine with an older version of Click present would lead to the commands failing to register. #25
  • s3-ocr --help now includes links to the documentation and changelog.

0.6

07 Aug 17:42
a581534
Compare
Choose a tag to compare
0.6
  • s3-ocr start now automatically pauses and then retries if Textract complains that there are too many jobs running. This can be turned into an early exit with an error message using the new --no-retry option. #21
  • New s3-ocr start --dry-run option for displaying what would happen without starting the OCR process. #22
  • Textract now runs in the same region as the S3 bucket it is writing to, avoiding an error. #24

0.5

19 Jul 02:35
fdd73f4
Compare
Choose a tag to compare
0.5
  • Ability to run OCR against just the PDF files contained within a specific folder in the S3 bucket, using s3-ocr start my-bucket --prefix my-prefix/. #20
  • New command: s3-ocr dedupe my-bucket - scans the bucket for any new files that are duplicates of files that have already been OCRd and writes out job results to reuse existing OCR results and avoid processing them a second time in the future. #19

0.4

30 Jun 21:03
46712e9
Compare
Choose a tag to compare
0.4
  • New command: s3-ocr inspect-job <job_id> returns information about the status of a specific job. #15
  • Added a live demo at s3-ocr-demo.datasette.io. #16

0.3

30 Jun 00:44
Compare
Choose a tag to compare
0.3

First non-alpha release.

  • Breaking change: the order of arguments for s3-ocr index <bucket> <database_file> has been swapped, for consistency with other commands. #9
  • Breaking change: the start command no longer defaults to processing every .pdf file in the bucket. It now accepts a list of keys, or use the --all option to process every PDF file. #10
  • New s3-ocr fetch <bucket> <path> command for fetching the raw OCR JSON data for that file. #7
  • New s3-ocr text <bucket> <path> command for outputting just the extracted OCR text for a specified file. #8

0.2a0

29 Jun 19:35
Compare
Choose a tag to compare
0.2a0 Pre-release
Pre-release
  • New s3-ocr index database.db name-of-bucket command for creating a SQLite database containing the OCR results that have been written to the bucket. #2

0.1a0

29 Jun 02:53
Compare
Choose a tag to compare
0.1a0 Pre-release
Pre-release
  • s3-ocr start <bucket> command for triggering OCR runs using Textract for every PDF file in a bucket. #1
  • s3-ocr status <bucket> command for checking on the status of the ongoing OCR tasks.