Skip to content

Commit

Permalink
Refactor file loading/parsing; add support for spreadsheets as input (#…
Browse files Browse the repository at this point in the history
  • Loading branch information
caufieldjh authored Aug 27, 2024
2 parents 5e41327 + d21492a commit f98914b
Show file tree
Hide file tree
Showing 3 changed files with 239 additions and 131 deletions.
28 changes: 22 additions & 6 deletions docs/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,17 @@ Use the option `--inputfile` to specify a path to a file containing input text.

For the `extract` command, this may be a single file or a directory of files.

In the latter case, all .txt files will be assumed to be input, and the path will *not* be parsed recursively.
In the latter case, all files in the following formats will be assumed to be input:

```txt
".csv", ".tsv", ".txt", ".od", ".odf", ".ods", ".pdf", ".xls", ".xlsx"
```

The path will *not* be parsed recursively.

When parsing PDF files, use the `use-pdf` option as described below.

When parsing tabular files like tsv or xlsx, you may specify exact columns to load with the `selectcols` option as described below.

### template

Expand Down Expand Up @@ -86,11 +96,7 @@ Disable it with `--no-recurse`.

Use the option `use-pdf` to specify whether to extract text from a PDF.

This is done through the `pymupdf` package, which also supports extracting text from EPUB, MOBI, DOCX, and more.

See <https://pymupdf.readthedocs.io/en/latest/about.html#about-feature-matrix> for the full list.

Extraction from these file types is off by default.
This is done through the `pymupdf` package.

Example:

Expand Down Expand Up @@ -186,6 +192,16 @@ Including an instruction like the following anecdotally helps to avoid parsing f
--system-message "You are going to extract information from text in the specified format. You will not deviate from the format; do not provide results in JSON format."
```
### selectcols
Use the option `selectcols` to specify exact colums to use when parsing tabular files as input.
Example:
```bash
ontogpt extract -t food -i inputs/myfile.tsv -o output.yaml --selectcols cheeses,grapes,flavors
```
## Functions
### categorize-mappings
Expand Down
Loading

0 comments on commit f98914b

Please sign in to comment.