A Rust SentencePiece tokenizer implementation

Tokenization into words or sub-word units is a key component of Natural Language Processing pipeline. Modern approaches such as Byte Pair Encoding (Sennrich et al., 2015), WordPiece or SentencePiece (Kudo et al., 2018) segment rare words into sub-tokens i… Read more

Similar

Rust/WinRT Coming Soon

My rust adventure continues as I have been furiously working on Rust/WinRT for the last five months or so. I am looking forward to opening it up to the community as soon as possible. Even then, it … (more…)

Read more »