A Rust SentencePiece tokenizer implementation

Tokenization into words or sub-word units is a key component of Natural Language Processing pipeline. Modern approaches such as Byte Pair Encoding (Sennrich et al., 2015), WordPiece or SentencePiece (Kudo et al., 2018) segment rare words into sub-tokens i… Read more

Similar

How to “Rewrite it in Rust”

In a previous article we’ve talked about how you can avoid rewriting a library in Rust when you don’t need to. But what about the times when you really do need to? In most languages you’d need to rewrite the entire library from the ground up, waiting unti... (more…)

Read more »