The format is based on Keep a Changelog.
Also check rust changelog.
0.6.9 (2024-11-20)
- Allow string literals as
SplitMode
(#245) - Add
sudachipy.Config
andsudachipy.errors.SudachiError
to default import (#260) - Add support for Python3.13
- Python3.13t (no GIL) is not supported yet
- by Updating PyO3 dependency to v0.22 (#265, #276)
-s
(system dictionary path) ofsudachi ubuild
command is now required (#239)- Migrate from setup.py install (#252)
-d
option of sudachi cli (which is no-op) now warns (#278)- Update the output of
sudachi dump
subcommand (#277)
- Documentation fix/update (#247 by @t-yamamura, #250, #268)
- Change the way how python error is raised (#273)
- Fix clippy warnings without breaking changes (#263)
- Remove Python 3.7 and 3.8 support as it reaches its end of life (https://devguide.python.org/versions/) (#249, #281).
0.6.8 (2023-12-14)
- Produce builds for Python 3.12 (#236)
- Add a simple configuration API
- Add surface projections (#230)
- For chiTra compatibility SudachiPy can now directly produce different tokens in the surface field.
- Original surface is accessible via
Morheme.raw_surface()
method - It is possible to customize projection dictionary-wise, via Config object, passing it on a dictionary creation, or for a single pre-tokenizer.
0.6.7 (2023-02-16)
- Produce builds for Python 3.11
- Add
Dictionary.lookup()
method which allows you to enumerate morphemes from the dictionary without performing analysis.
- Add boundary matching mode to regex oov handler
- macOS binary builds are now unversal2 (arm+x64)
- Binary builds are universal2 (arm+x64)
- Caveat: we don't run tests on arm because there are no public arm instances, so builds may be broken without any warning
- Fixed invalid POS tags which appeared when using user-defined POS tags both in user dictionaries and OOV handlers. You are not affected by this bug if you did not use user-defined POS in OOV handlers.
- OOV handler plugins support user-defined POS, similar to Java version
- Added Regex OOV handler
- For details, see Java version changelog
- In Rust/Python Regexes do not support backtracking and backreferences
maxLength
setting defines maximum length in unicode codepoints, not in utf-8 bytes as in Java (will be changed to codepoints later)
- Remove Python 3.6 support which reached end-of-life status on 2021-12-23
- Print Debug feature is disabled now.
-d
option ofsudachipy
cli does nothing.sudachipy.Tokenizer
will ignore the provided logger.- Ref: [#76]
- Changed path resolution algorithm for resources #203
- Added set operations to
PosMatcher
#204 - Added
pos_of()
function toDictionary
which returns a POS tuple for a given POS id.
- Fixed analysis differences with 0.5.4
Morpheme.part_of_speech
method now returns Tuple of POS components instead of a list.- Partial Dictionary Read
- It is possible to ask for a subset of morpheme fields instead of all fields
- Supported API:
Dictionary.create()
,Dictionary.pre_tokenizer()
- HuggingFace PreTokenizer support
- We provide a built-in HuggingFace-compatible pre-tokenizer
- API:
Dictionary.pre_tokenizer()
- It is multithreading-compatible and supports customization
- Memory allocation reuse
- It is possible to reduce re-allocation overhead by using
out
parameters which acceptMorphemeList
s - Supported API:
Tokenizer.tokenize()
,Morpheme.split()
- It is now a recommended way to use both those APIs
- It is possible to reduce re-allocation overhead by using
- PosMatcher
- New API for checking if a morpheme has a POS tag from a set
- Strongly prefer using it instead of string comparison of POS components
- Performance
- Greatly decreased cost of accessing POS components
len(Morpheme)
now returns the length of the morpheme in Unicode codepoints. Use it instead oflen(m.surface())
Morpheme.split()
has newadd_single
parameter, which can be used to check whether the split has produced anything- E.g. with
if m.split(SplitMode.A, out=res, add_single=False): handle_splits(res)
add_single=True
, returning the list with the current morpheme is the current behavior
- E.g. with
Morpheme
/MorphemeList
now have readable__repr__
and__str__
dict_type
parameter ofDictionary()
constructor. Usedict
instead which is a complete alias.
- Do not use
mode
parameter ofTokenizer.tokenize()
method if you always tokenize with a single mode.- Use the mode parameter of
Dictionary.create()
method instead.
- Use the mode parameter of
- Support building dictionary
sudachidict_*
packages starting from 20210802.post1 are compatible with 0.6.0 release and will work as is
- From this version, SudachiPy is provided as a binding of the Rust implementation.
- See API reference page for all APIs.
- Since this is release-candidate version, you need to explicitly specify version to install.
pip install sudachipy==0.6.0rc1
- You also need to install
sudachidict_*
before since installing it will overwrite this version.
- Module structure changed: every classes locate at the root module.
- Import is now like:
from sudachipy import Dictionary, Tokenizer
- You can still import them in the previous way (not recommended).
from sudachipy.dictionary import Dictionary
- Import is now like:
MorphemeList.empty
now needs asudachipy.Dictionary
instance as arguments.- This method is also marked as deprecated.
MorphemeList.empty(dict)
- Users should not generate MorphemeList by themselves.
- Use
Tokenizer.tokenize("")
if you need.
Morpheme.get_word_info()
- Users should not touch the raw WordInfo.
- Necessary fields are provided via
Morpheme
.- Please create an issue if fields you need is not implemented to
Morpheme
.
- Please create an issue if fields you need is not implemented to
Morpheme.split(mode)
- The API around this feature will change.
- See issue [#92].
- Some of APIs are not supported.
- See API reference page for the full list of supported APIs.
- Most of instance attributes are unaccessible.
- You cannot access
Dictionary.grammar
orDictionary.lexicon
.
- You cannot access
Please see python version repository.