Releases: simonw/s3-ocr
Releases · simonw/s3-ocr
0.6.3
0.6.2
0.6.1
0.6
s3-ocr start
now automatically pauses and then retries if Textract complains that there are too many jobs running. This can be turned into an early exit with an error message using the new--no-retry
option. #21- New
s3-ocr start --dry-run
option for displaying what would happen without starting the OCR process. #22 - Textract now runs in the same region as the S3 bucket it is writing to, avoiding an error. #24
0.5
- Ability to run OCR against just the PDF files contained within a specific folder in the S3 bucket, using
s3-ocr start my-bucket --prefix my-prefix/
. #20 - New command:
s3-ocr dedupe my-bucket
- scans the bucket for any new files that are duplicates of files that have already been OCRd and writes out job results to reuse existing OCR results and avoid processing them a second time in the future. #19
0.4
0.3
First non-alpha release.
- Breaking change: the order of arguments for
s3-ocr index <bucket> <database_file>
has been swapped, for consistency with other commands. #9 - Breaking change: the
start
command no longer defaults to processing every.pdf
file in the bucket. It now accepts a list of keys, or use the--all
option to process every PDF file. #10 - New
s3-ocr fetch <bucket> <path>
command for fetching the raw OCR JSON data for that file. #7 - New
s3-ocr text <bucket> <path>
command for outputting just the extracted OCR text for a specified file. #8