Releases: aryn-ai/sycamore
v0.1.29
This Sycamore release contains small bug fixes and enhancements.
What's Changed
- when there's no table structure, take the token bbox for the cell bbox by @HenryL27 in #1061
- Disable use of scroll in OpenSearch reader when running KNN queries. by @austintlee in #1062
- Binarize OCR Image to Improve Performance by @karanataryn in #1063
- Fix
split_elements
for table elements with noelem.table
attribute by @MarkLindblad in #1064 - Fix Extract Schema Empty Return by @karanataryn in #1067
- Bump version to v0.1.29. by @bsowell in #1068
Full Changelog: v0.1.28...v0.1.29
v0.1.28
This release updates doc_ids from UUIDs to NanoIds, adds some document title functionality, and improves stability and performance.
What's Changed
- adding one shot prompting along with multimodal request by @Soeb-aryn in #1023
- Fix query-ui dependency on boto3 and re-lock. by @mdwelsh in #1028
- Updated NTSB queries and ground truth for CIDR-25 paper. by @mdwelsh in #1026
- Add streaming support and tests for query-server. by @mdwelsh in #1027
- Supply element types in output from MarkedMerger. by @alexaryn in #1031
- Fix SummarizeData so that downstream .materialize operations will work. by @mdwelsh in #1030
- add nanoid by @HenryL27 in #1034
- Removed duplicate code in query execution. by @akarshgupta7 in #1035
- Convert docids from UUID to NanoID. by @alexaryn in #1032
- Use NanoIDs in file_scan. by @alexaryn in #1036
- extract table properties prompt & bug fix by @Soeb-aryn in #1037
- Convert DocIDs to UUIDs for Qdrant & Weaviate; unit tests. by @alexaryn in #1038
- heuristics to get title from section headers by @Soeb-aryn in #1033
- updating function in pdf_miner class by @Soeb-aryn in #1041
- Added ragas to compute string metrics for evaluation. by @akarshgupta7 in #1039
- Fix sort so that it works with an unspecified or None default_value. by @eric-anderson in #1040
- Added correctness score to the metrics. by @akarshgupta7 in #1043
- Query planner improvements by @baitsguy in #1046
- Fix materialize to tolerate an empty input directory in ray mode by @eric-anderson in #1045
- PR fix by @baitsguy in #1047
- disable vectorsearch rerank by default in query by @baitsguy in #1048
- vectorsearch planner prompt changes by @baitsguy in #1049
- Make OpenAIEmbedder serializable after client has been initialized. by @bsowell in #1050
- Rename Embedding in ElasticSearch Notebook by @karanataryn in #1051
- Add deformable table extractor by @HenryL27 in #1053
- Add helper for thread local variables that can be used to add metadata to the output stream by @eric-anderson in #1052
- Propagate element level llm_filter output to doc.properties by @baitsguy in #1054
- Handle military clock time (0800) in time standardizer. by @alexaryn in #1056
- Fix incorrect docstring for promote-certain-elements-to-title feature by @MarkLindblad in #1057
- adding parameter for API in sdk and remote_partitioner by @Soeb-aryn in #1042
- bump sycamore version to 0.1.28 by @HenryL27 in #1058
- bump aryn sdk version to 0.1.10 by @HenryL27 in #1059
- don't die if box is None in try_draw_boxes by @HenryL27 in #1060
New Contributors
- @akarshgupta7 made their first contribution in #1035
Full Changelog: v0.1.27...v0.1.28
v0.1.27
This Sycamore release includes a variety of small bug fixes and improvements.
What's Changed
- Bump
aryn-sdk
version to 0.1.9 from 0.1.8 by @MarkLindblad in #1011 - Add plan validation by @baitsguy in #1001
- Sort retrieval docs by score properties if they exist by @baitsguy in #1012
- Add 120k max chars (default) for summarize_data by @baitsguy in #1013
- Queryeval docset write fix by @baitsguy in #1014
- Add notebook file for OpenSearch example by @jonfritz in #1015
- Fix up NTSB queries for query-eval tool. by @mdwelsh in #1016
- Rename from APS to DocParse by @karanataryn in #1017
- enable JSONifying tables by @HenryL27 in #1018
- Fix
aryn-sdk
'sconvert_image_element
example by @MarkLindblad in #1019 - Fix DocParse chunking example in
aryn-sdk
by @MarkLindblad in #1021 - blacksmith.sh: Migrate workflows to Blacksmith by @blacksmith-sh in #1020
- Revert Unit Tests to GitHub Actions by @karanataryn in #1025
- Bump version to 0.1.27. by @bsowell in #1024
Full Changelog: v0.1.26...v0.1.27
v0.1.26
This release includes several stabliity and reliability improvements.
What's Changed
- skip flaky test by @HenryL27 in #956
- Fix mypy warnings. by @mdwelsh in #947
- Work around hang observed during vcrpy recording. by @alexaryn in #950
- Postprocessing to modify plans returned by llm planner; minor issues with query-ui by @amolvdeshpande in #882
- bump sdk to 0.1.7 by @HenryL27 in #961
- Add HeaderAugmenterMerger by @dhruvkaliraman7 in #946
- Update docs to reflect OpenAIPropertyExtractor->LLMPropertyextractor by @bsowell in #964
- Couple of minor fixes and tweaks to the table merger. by @bsowell in #963
- Enable use_elements in query.summarize_data by @baitsguy in #966
- Fix typo in syntax in docstring for Summarize Images by @jonfritz in #967
- Add missing
tokenizer
argument inMarkBreakByTokens
docstring by @MarkLindblad in #969 - Add Lots of Connector Unit Tests by @karanataryn in #957
- Add OCR Evaluation Code by @karanataryn in #685
- Fixed query tag check by @baitsguy in #968
- Fix SDK Threshold Bug by @karanataryn in #970
- Add score to each document in OpenSearch query result. by @bsowell in #971
- Fix HeaderAugmenterMerger by @MarkLindblad in #973
- Refactor
mark_bbox_preset
to expose function outsideDocSet
by @MarkLindblad in #972 - Fix
mark_bbox_preset
'sMarkDropHeaderFooter
parameter by @MarkLindblad in #975 - OpenSearch improvements by @baitsguy in #974
- Adding a separate installation instructions page by @AbhijitP-009 in #977
- Union OCR / PDFMiner Tokens with Table Outputs by @karanataryn in #976
- Make Table Code More Robust by @karanataryn in #979
- fix divide by zero in align_headers by @HenryL27 in #978
- Allow for returning query traces on cached query executions. by @mdwelsh in #959
- Add Enhance Table Option to SDK by @karanataryn in #980
- Bump SDK Version by @karanataryn in #981
- Update Lockfiles by @karanataryn in #920
- Add query planning strategy objects by @baitsguy in #982
- Move tokenized data to device by @baitsguy in #983
- Update vectorsearch query test by @baitsguy in #984
- Integration test for Sycamore Query demo. by @mdwelsh in #985
- Add Closure of Client Connections for Connectors by @karanataryn in #989
- Work around lack of resource module on Windows. by @alexaryn in #962
- Update README.md by @karanataryn in #990
- Merge in Fixes from Luna Demo Deployment by @karanataryn in #992
- Add table-chunker by @dhruvkaliraman7 in #993
- chore: Added back to top , contributors section and star history graph by @samarth29jc in #987
- Return the list of documents referenced in a Luna query. by @mdwelsh in #995
- Sync Locks across all Directories by @karanataryn in #988
- Remove unused code (
_batchify
) by @MarkLindblad in #887 - Don't try to put footers in columns by @HenryL27 in #998
- Docprep notebook testing by @sohamkasar19 in #996
- Add expected documents in query-eval tool by @baitsguy in #997
- Move Aryn DocParse Docs Out of Sycamore by @karanataryn in #994
- Remove seed from rewrite prompt by @baitsguy in #1000
- Fix OpenAI reduce methods to handle Azure deployment names. by @bsowell in #1002
- Add support for custom source parameter for remote Aryn Partitioner by @MarkLindblad in #1003
- Fix mixed samples for schema extraction. by @mdwelsh in #1004
- updating extract table prop by @Soeb-aryn in #1005
- Update Opensearch domain in docprep notebook testing (GHA) by @sohamkasar19 in #1006
- Improve suggested install command by @HenryL27 in #1007
- Fix augment_text docstring by @HenryL27 in #1008
- Add support for using Aryn DocParse chunking from
aryn-sdk
by @MarkLindblad in #1010 - Update sycamore to 0.1.26 by @HenryL27 in #1009
New Contributors
- @amolvdeshpande made their first contribution in #882
- @samarth29jc made their first contribution in #987
Full Changelog: v0.1.25...v0.1.26
v0.1.25
This Sycamore release includes numerous bug fixes for connectors and other transforms. It also includes support for Anthropic LLMs via Amazon Bedrock.
What's Changed
- Sycamore Query evaluation tool. by @mdwelsh in #912
- Luna client local schema (take 2) by @dtecuci in #919
- Fix small bug in client. by @mdwelsh in #923
- Fix DuckDB Spelling Error by @karanataryn in #924
- Make OpenSearchSchema a proper Pydantic model. by @mdwelsh in #922
- Fix typo by @Yashbhatt786 in #927
- Bugfixes: DocumentSource enum serialization and missing element_id in old data by @baitsguy in #928
- Bug fixes: remove kwargs in docset.rerank, sycamore query codegen by @baitsguy in #932
- Add Table Merger by @dhruvkaliraman7 in #880
- Basic Bedrock LLM client. by @mdwelsh in #931
- Accept query plan examples in config by @baitsguy in #934
- Evaluate query plans in query-eval by @baitsguy in #936
- Add local mode support for json scan and json document scan by @bohou-aryn in #925
- Handle Drawing Missing Tables and Cells by @karanataryn in #938
- Support LLM selection in Sycamore Query Client. by @mdwelsh in #935
- Crop To Bbox Error by @karanataryn in #939
- Add plan correctness metrics summary + K in TopK optional by @baitsguy in #940
- don't embed the empty string with openai by @HenryL27 in #943
- Support SummarizeImages with non-OpenAI LLMs. by @bsowell in #941
- Add support for tags and notes. by @mdwelsh in #942
- Create LLMSchemaExtractor and LLMPropertyExtractor classes. by @bsowell in #945
- Don't run embedded weaviate in the unit tests by @HenryL27 in #951
- fix empty strings in section headers by @HenryL27 in #948
- Select pages by @bsowell in #937
- Fixup notebook tests by @eric-anderson in #933
- Use pytest-xdist for unit tests. by @mdwelsh in #952
- Update standardizer.py by @jonfritz in #944
- Fix bugs in Unflattening Data by @karanataryn in #930
- fix materialize bug with s3 filesystem by @eric-anderson in #954
- Bump version to 0.1.25. by @bsowell in #955
New Contributors
- @Yashbhatt786 made their first contribution in #927
Full Changelog: v0.1.24...v0.1.25
v0.1.24
This Sycamore release includes several bug fixes in the Weaviate and DuckDB connectors and in several of the example notebooks. Thanks to @Dnaynu for contributing to the Sycamore documentation!
What's Changed
- fix asdict in the reader too. duh by @HenryL27 in #907
- Add text reprentation for empty tables by @dhruvkaliraman7 in #909
- Refactor logical plan serialization. by @mdwelsh in #905
- microperformance improvement by @HenryL27 in #906
- Bugfix: Handle opensearch reader doc resconstruction when no parent doc in results by @baitsguy in #908
- Fix bug in entity extraction. by @eric-anderson in #911
- added ability to read schema from file by @dtecuci in #904
- Enable copying of the hash context. by @alexaryn in #910
- Add option to extract line-based bounding boxes from pdfminer. by @bsowell in #874
- Support random sample in local mode. by @bsowell in #913
- Opensearch kwargs fix by @baitsguy in #914
- Fix Typo in NTSB Demo by @karanataryn in #917
- Update using_jupyter.md by @jonfritz in #902
- Docs: Typo Fix by @Dnaynu in #918
- Update DuckDB Reader to Package Change by @karanataryn in #916
- Make metadata-extraction.ipynb work by @eric-anderson in #915
- Bump Sycamore version to 0.1.24. by @bsowell in #921
New Contributors
Full Changelog: v0.1.23...v0.1.24
v0.1.23
This is a small release that fixes a bug in the Weaviate writer and includes a few other bug fixes and documentation improvements.
What's Changed
- fix bug in weaviate writer causing api keys to be of wrong type by @HenryL27 in #893
- Expose local easyocr kwargs by @baitsguy in #894
- Fix PDFMiner Output Parsing by @karanataryn in #890
- Allow passing custom ocr object to arynpartitioner by @baitsguy in #895
- Update Elasticsearch Port by @karanataryn in #896
- Update Merger Parameters in Docs by @sohamkasar19 in #897
- Fix Elasticsearch Docs by @karanataryn in #899
- Cleanup Docs by @karanataryn in #900
- Add smaller pdfminer bboxs to large detr bboxs by doing iob and not iou by @dhruvkaliraman7 in #901
- Fix anonymous reading in materialize and add rate limited logging. by @eric-anderson in #898
- Bump version to v0.1.23. by @bsowell in #903
Full Changelog: v0.1.22...v0.1.23
v0.1.22
This sycamore release includes support for Python 3.12, a connector for the Qdrant vector database, and many bug fixes and enhancements. Thanks to @Anush008 for contributing the Qdrant support!
What's Changed
- bump sdk to 0.1.4 by @HenryL27 in #823
- Fix issue with empty tool response leading to hallucinations. by @mdwelsh in #818
- Fix bug where prompt is modified by OpenAIEntityExtractor. by @mdwelsh in #824
- Fix poetry.lock with missing dependency. by @mdwelsh in #825
- Query trace viewer for Luna demo, and better PDF previews. by @mdwelsh in #828
- Batch Processing Bug Fix by @karanataryn in #829
- Get local mode working 1/n by @eric-anderson in #826
- Changing titles for some posts by @AbhijitP-009 in #827
- Transform to convert Document into Markdown. by @alexaryn in #811
- Fix query trace viewer. by @mdwelsh in #830
- Ingest more fields into OpenSearch schema for NTSB demo. by @mdwelsh in #834
- Fix bug with trace view. by @mdwelsh in #833
- Improved sorting of elements by bbox for one and two columns. by @alexaryn in #801
- Make PDFMiner Pipelined by @karanataryn in #807
- Fix error message on None value passed to DateTimeStandardizer. by @mdwelsh in #835
- Sundry improvements while using luna in a customer. by @eric-anderson in #832
- fix to pass string to tokenizer by @Soeb-aryn in #831
- Some improvements to query plans for Luna demo. by @mdwelsh in #836
- Update requires_modules type annotations to work with mypy. by @bsowell in #837
- Lazily Set Table Text Representation by @karanataryn in #839
- Have Luna use .keyword field for path field. by @mdwelsh in #841
- Add a simple logical query plan compare function by @baitsguy in #840
- Improve luna property handling by @eric-anderson in #842
- Add support for Python 3.12. by @bsowell in #838
- Fix Luna UI to show query plan operators. by @mdwelsh in #847
- bugfix to extract text summaries(dont just randomly assert) by @RitxmSaha in #848
- Ignore bad tables by @MarkLindblad in #849
- Add support for caching intermediate results of Luna queries. by @mdwelsh in #850
- add read.opensearch(reconstruct_document =True) option by @baitsguy in #845
- Fold in query-demo capability to query-ui. by @mdwelsh in #852
- Define parallelism on nodes by @eric-anderson in #853
- Basic documentation for APS markdown option. by @alexaryn in #854
- Implement output_format in Aryn SDK partition_file(). by @alexaryn in #857
- Add
local-inference
extra tosycamore-ai
dependency inapps/query-ui
. by @mdwelsh in #859 - Super basic FastAPI wrapper to Sycamore Query. by @mdwelsh in #855
- Support output_format in ArynPartitioner. by @alexaryn in #858
- Fix tile cannot extend outside image by @dhruvkaliraman7 in #856
- Support Jupyter saving to S3 by @eric-anderson in #860
- Add PaddleOCR and Refactor Text Extraction by @karanataryn in #745
- Fix broken test. by @mdwelsh in #863
- Get Local Mode working 2/n by @eric-anderson in #861
- Remove package-mode by @eric-anderson in #865
- Add similarity scoring and rerank transform by @baitsguy in #864
- adding docs for AssignDocProperties, Standardizer and ExtractTableProperties by @Soeb-aryn in #866
- Add newline before text elements. by @alexaryn in #862
- handle file paths in the sdk by @HenryL27 in #869
- Add packaging library to aryn-sdk pyproject.toml. by @bsowell in #870
- Do some escaping of special Markdown characters. by @alexaryn in #867
- fix type annotation for file by @HenryL27 in #871
- Element ordering and test improvements by @baitsguy in #872
- Test fixes and more local mode by @baitsguy in #873
- Add a few more files to .gitignore. by @bsowell in #875
- feat: Qdrant support by @Anush008 in #821
- Get llm_filter to support document structure + similarity sorting for elements by @baitsguy in #876
- Add documentation for Sycamore Query. by @mdwelsh in #878
- Move loaddata script to query-ui. by @mdwelsh in #877
- Remove deprecated query-demo UI. by @mdwelsh in #881
- Adjust Pinecene Docs for Clarity by @karanataryn in #883
- Add source_mode parameter to AutoMaterialize. by @bsowell in #885
- add optimization from training development by @HenryL27 in #886
- Fix documentation link, sentence grammar by @MarkLindblad in #879
- Clean Up Text Extraction by @karanataryn in #868
- Fix Parameter Error in Docs by @karanataryn in #888
- Enable document model in sycamore.query + query-ui improvements by @baitsguy in #884
- Fix parallelism bug. by @eric-anderson in #889
- fix issue when packages and containers do not align at all -> max([]) by @HenryL27 in #891
- Bump version to 0.1.22. by @bsowell in #892
New Contributors
Full Changelog: v0.1.21...v0.1.22
v0.1.21
This Sycamore release contains Aryn Partitioning Service client updates to support the new auto-threshold feature and add support for Microsoft Word (.doc and .docx) and Microsoft PowerPoint (.ppt and .pptx) files. It also contains a variety of bug fixes and stability improvements.
What's Changed
- Fix Lib/Sycamore README by @karanataryn in #771
- Allow custom SycamoreQueryClient in query-ui + cleanup by @baitsguy in #772
- Sycamore changes to support new NTSB demo. by @mdwelsh in #774
- improving ExtractTableProperties and standardizer transforms by @Soeb-aryn in #773
- add materialize to transform toc by @eric-anderson in #779
- Fix Bugs in Sycamore Pipeline by @karanataryn in #777
- New NTSB Luna demo UI. by @mdwelsh in #778
- neo4j, refactor pipeline to not auto resolve entities + add support for images in pipeline. by @RitxmSaha in #766
- Fix issue with duplicate widget keys. by @mdwelsh in #780
- Bug fixes in query path by @baitsguy in #781
- Add querydemo to pyproject.toml. by @mdwelsh in #783
- A few Luna demo fixes. by @mdwelsh in #784
- Bugfixes for context_vars by @baitsguy in #785
- Fix Local Mode Read Bug by @karanataryn in #786
- Make reorder_elements more like sorted() so we can use key= by @alexaryn in #787
- Add new OpenSearch writer notebook by @jonfritz in #788
- Fix function signature reading in contextvars by @baitsguy in #789
- Various Luna Demo fixes. by @mdwelsh in #790
- Making changes to docs. Better titles etc. by @AbhijitP-009 in #793
- Update our container support by @eric-anderson in #782
- Add natural language result flag. by @mdwelsh in #794
- Make Element Class More Robust by @karanataryn in #797
- updated docs to explain the new default threshold setting for ArynPartitioner by @dtecuci in #795
- Add support for pushing query filters down to OpenSearch. by @mdwelsh in #796
- neo4j writer docs by @RitxmSaha in #798
- Docs for nms change (take 2) by @dtecuci in #799
- Fix TableTransformer Bug by @karanataryn in #800
- Remove dead code. by @mdwelsh in #803
- Verify .map can run parallel classes by @eric-anderson in #802
- bugfix to extract graph entities by @RitxmSaha in #805
- Update default threshold values for ArynPartitioner. by @bsowell in #804
- Update type signatures for threshold in aryn_sdk. by @bsowell in #806
- Change Bounding Box Validity Assertion by @karanataryn in #808
- A few Luna demo tweaks. by @mdwelsh in #810
- Couple of Luna demo bugfixes. by @mdwelsh in #814
- Add QueryVectorDatabase to SycamoreQuery by @baitsguy in #813
- Add
.docx
documentation by @MarkLindblad in #812 - Ritam add example notebook by @RitxmSaha in #815
- query-ui: cosmetic changes by @baitsguy in #817
- Improved NTSB ingestion pipeline for Luna demo. by @mdwelsh in #816
- Bump version to v0.1.21. by @bsowell in #819
- Reverts README change to restore poetry build. by @bsowell in #820
- Fix typo scyamore -> sycamore. by @bsowell in #822
Full Changelog: v0.1.20...v0.1.21
v0.1.20
This release refactors Sycamore’s dependencies to use extras in order to conditionally pull in dependencies for connectors and local inference (e.g. creating vector embeddings). For example, if you want to use the OpenSearch connector, you will need to: pip install sycamore-ai[opensearch]. Or, if you want to run a local vector embedding model, you will need to: pip install sycamore-ai[local-inference]. To do both, you will need to: pip install sycamore-ai[opensearch,local-inference]
Also, this release includes performance and stability improvements.
What's Changed
- Dependencies 1/n: Remove need to restart colab runtime by @bsowell in #728
- Don't require installing neo4j unless it's used. by @eric-anderson in #733
- Handle None cases for element.table = <> by @baitsguy in #735
- Fix materialize + S3 not working. by @eric-anderson in #734
- Fixed neo4j relationship property loading + added support for loading lists and dictionaries as properties by @RitxmSaha in #736
- Handle non-hashable data types in opensearch schema extractor by @baitsguy in #737
- docs: update README.md by @eltociear in #739
- Support concurrent libreoffice executions, fix bug to support s3 source paths in file_format_tools by @baitsguy in #741
- Fix calls to structured outputs so that they can be cached by @RitxmSaha in #738
- fix 'SycamorePartitioner' error message by @HenryL27 in #748
- Fix context test by @eric-anderson in #749
- Enforce the constraint that each cell is only in one spanning cell. by @bsowell in #754
- Add context_params decorator to read args from Context by @baitsguy in #747
- Remove unnecessary tracing code. by @mdwelsh in #752
- Dependencies 2/3: Move connectors to extras. by @bsowell in #740
- Allow any pinecone error on create index by @HenryL27 in #750
- Allow all Exceptions while creating Connector Targets by @karanataryn in #753
- Adding new ETL tutorial by @jonfritz in #751
- Add materialize to the ntsb loader for luna by @eric-anderson in #742
- Add Weaviate notebook by @karanataryn in #757
- Update get_started.rst by @jonfritz in #759
- Update pinecone.md by @jonfritz in #758
- added new document structure + tests by @RitxmSaha in #746
- Dependencies 3/3: Add partitioning extras. by @bsowell in #755
- Dependencies: Remove need to restart colab session for aryn-sdk by @bsowell in #756
- Default llm in transforms by @baitsguy in #760
- Improve materialize by @eric-anderson in #762
- adding neo4j s3 proxy for aura db + split_calls flag for entity and relationship extractor. by @RitxmSaha in #761
- Fix show_pages in Google Colab. by @bsowell in #763
- Jonfritz patch 3 tutorial by @jonfritz in #764
- Fix materialize to work even if it is re-executed on the same documents. by @eric-anderson in #765
- add clear_materialize(path=) by @eric-anderson in #767
- Jonfritz patch 3 consoledocs by @jonfritz in #768
- Update docs with more info on dependencies. by @bsowell in #769
- bump sycamore version to 0.1.20 by @HenryL27 in #770
New Contributors
- @eltociear made their first contribution in #739
Full Changelog: v0.1.19...v0.1.20