Skip to content

issues Search Results · repo:JuliaText/WordTokenizers.jl language:Julia

Filter by

18 results
 (58 ms)

18 results

inJuliaText/WordTokenizers.jl (press backspace or delete to remove)

Hi, I hope you are doing great Thank you for your effort in this package Just to report that the dependency HTML_Entities avoid this package to work with PackageCompiler This issue was fixed in this ...
  • AbrJA
  • Opened 
    on Aug 13, 2024
  • #65

julia Pkg.activate(temp=true) Activating new project at `/var/folders/4n/gvbmlhdc8xj973001s6vdyw00000gq/T/jl_iVQTya` julia Pkg.add( WordTokenizers ) Resolving package versions... some logging ...
  • ablaom
  • 2
  • Opened 
    on Apr 8, 2024
  • #64

Several times the paragraphs have new lines copied from the source document (particularly when copied from PDF) and they should be ignored when sentences are tokenized. This is the text taken from copying ...
  • sambitdash
  • Opened 
    on Feb 26, 2021
  • #60

Hi @Ayushk4 - I was suggested by @oxinabox and @aviks to ping you. I am interested in investigating and improving the sentence tokenizers part of WordTokenizers.jl. Would that be of interest to you if ...
  • TheCedarPrince
  • 2
  • Opened 
    on Jan 18, 2021
  • #59

The tokenize function returns a vector of words(strings) when input string is passed. It doesn t lowercase each word by default. For example: julia text = This is a this sentence This is a this sentence ...
  • shikhargoswami
  • 3
  • Opened 
    on Jan 4, 2021
  • #57

I get an InitError with WordTokenizers on Julia 1.5 when using WordTokenizers in a package. Even if the package is almost empty. Here is an MWE (@v1.5) pkg generate TestToken # in shell $ cd TestToken/src ...
  • chengchingwen
  • 3
  • Opened 
    on Aug 26, 2020
  • #55

I think it is better to release a new version of WordTokenizers with Statistical Tokenizer. It also serves as a dependency for TextAnalysis.ALBERT
  • tejasvaidhyadev
  • Opened 
    on Aug 25, 2020
  • #52

We should benchmark agaisnt https://github.com/huggingface/tokenizers I don t expect for us to win, but it gives us a line to target against.
  • oxinabox
  • Opened 
    on Feb 5, 2020
  • #46

BERT and related models have been using statistical tokenization algorithms. These work well on out-of-vocab words with ML models. High-speed implementations of BPE / WordPiece etc. will be good additions ...
  • Ayushk4
  • 20
  • Opened 
    on Jan 30, 2020
  • #44

julia WordTokenizers.split_sentences( This is a sentence.Laugh Out Loud. Keep coding. No. Yes! True! ohh!ya! me too. ) 7-element Array{SubString{String},1}: This is a sentence.Laugh Out Loud. Keep coding. ...
  • oxinabox
  • 2
  • Opened 
    on Oct 11, 2019
  • #38
Issue origami icon

Learn how you can use GitHub Issues to plan and track your work.

Save views for sprints, backlogs, teams, or releases. Rank, sort, and filter issues to suit the occasion. The possibilities are endless.Learn more about GitHub Issues
ProTip! 
Restrict your search to the title by using the in:title qualifier.
Issue origami icon

Learn how you can use GitHub Issues to plan and track your work.

Save views for sprints, backlogs, teams, or releases. Rank, sort, and filter issues to suit the occasion. The possibilities are endless.Learn more about GitHub Issues
ProTip! 
Restrict your search to the title by using the in:title qualifier.
Issue search results · GitHub