metadata library_name: transformers
license: apache-2.0
language:
- en
tags:
- tokenizer
- bpe
- byte-level
- chatml
- tool-use
- code
- python
pipeline_tag: text-generation
datasets:
- nvidia/Nemotron-CC-HQ
- HuggingFaceTB/smoltalk
- sahil2801/CodeAlpaca-20k
Daisy Tokenizer
Custom byte-level BPE tokenizer trained for the Daisy language model, optimized for Python code and instruction-following tasks.
Details
Property
Value
Vocabulary size
49,152
Algorithm
Byte-level BPE
Pre-tokenizer
Llama-3 style regex
Chat format
ChatML
Max length
131,072 tokens
Training date
2026-01-14
Features
Python-optimized : Trained on Python code for efficient tokenization
Tool calling : Native support for <|tool_call|> / <|tool_result|> patterns
Inline computation : Support for <|python|> / <|output|> for calculator-style reasoning
Chain-of-thought : <|think|> tokens for reasoning blocks
No UNK tokens : Byte-level fallback handles any Unicode input
Special Tokens
Token
ID
Purpose
<|endoftext|>
49131
End of sequence / BOS
<|pad|>
49132
Padding token
<|im_start|>
49133
Start of message (ChatML)
<|im_end|>
49134
End of message (ChatML)
<|tool_call|>
49135
Start of tool call
<|/tool_call|>
49136
End of tool call
<|tool_result|>
49137
Start of tool result
<|/tool_result|>
49138
End of tool result
<|python|>
49139
Start of Python expression
<|/python|>
49140
End of Python expression
<|output|>
49141
Start of computed output
<|/output|>
49142
End of computed output
<|think|>
49143
Start of reasoning block
<|/think|>
49144
End of reasoning block
<|system|>
49145
System role marker
<|user|>
49146
User role marker
<|assistant|>
49147
Assistant role marker
<|reserved_0|>
49148
Reserved
<|reserved_1|>
49149
Reserved
<|reserved_2|>
49150
Reserved
<|reserved_3|>
49151
Reserved
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jonathanmiddleton/daisy" )
tokens = tokenizer.encode("Hello, world!" )
messages = [
{"role" : "system" , "content" : "You are a helpful assistant." },
{"role" : "user" , "content" : "Hello!" },
{"role" : "assistant" , "content" : "Hi there! How can I help you?" },
]
text = tokenizer.apply_chat_template(messages, tokenize=False )
Chat Template Format
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>
Tool Calling Example
<|im_start|>assistant
Let me calculate that for you.
<|tool_call|>{"name": "calculator", "arguments": {"expression": "2 + 2"}}<|/tool_call|>
<|tool_result|>4<|/tool_result|>
The answer is 4.<|im_end|>
Compression Ratios
Benchmarked against common tokenizers on Python code, prose, and instruction data:
Python Code (SmolTalk self-oss-instruct, 504 samples)
Tokenizer
Vocab Size
Chars/Token
Tokens
meta-llama/Llama-3.2-3B-Instruct
128,000
4.391
88,644
Qwen/Qwen2.5-1.5B-Instruct
151,643
4.366
89,139
HuggingFaceTB/SmolLM2-135M-Instruct
49,152
3.906
99,650
JonathanMiddleton/daisy
49,131
3.766
103,349
microsoft/phi-2
50,257
3.628
107,290
openai-community/gpt2
50,257
3.152
123,467
English Prose (FineWeb-Edu, 505 samples)
Tokenizer
Vocab Size
Chars/Token
Tokens
meta-llama/Llama-3.2-3B-Instruct
128,000
4.681
466,617
JonathanMiddleton/daisy
49,131
4.594
475,422
openai-community/gpt2
50,257
4.584
476,460
microsoft/phi-2
50,257
4.584
476,461
Qwen/Qwen2.5-1.5B-Instruct
151,643
4.563
478,607
HuggingFaceTB/SmolLM2-135M-Instruct
49,152
4.475
488,120
Instructions (SmolTalk, 504 samples)
Tokenizer
Vocab Size
Chars/Token
Tokens
meta-llama/Llama-3.2-3B-Instruct
128,000
4.771
737,130
Qwen/Qwen2.5-1.5B-Instruct
151,643
4.731
743,360
JonathanMiddleton/daisy
49,131
4.487
783,803
HuggingFaceTB/SmolLM2-135M-Instruct
49,152
4.455
789,399
microsoft/phi-2
50,257
4.437
792,658
openai-community/gpt2
50,257
4.254
826,711
Cross-Content Average
Tokenizer
Python
Prose
Instruction
Average
meta-llama/Llama-3.2-3B-Instruct
4.391
4.681
4.771
4.614
Qwen/Qwen2.5-1.5B-Instruct
4.366
4.563
4.731
4.554
JonathanMiddleton/daisy
3.766
4.594
4.487
4.282
HuggingFaceTB/SmolLM2-135M-Instruct
3.906
4.475
4.455
4.278
microsoft/phi-2
3.628
4.584
4.437
4.216
openai-community/gpt2
3.152
4.584
4.254
3.997
Key findings : Daisy achieves competitive compression with a ~49K vocabulary, ranking 2nd among tested similar-sized tokenizers for prose and instructions while maintaining strong Python performance.
Training Data
General text : lehduong/nemotron-cc-hq (~60%)
Python code : HuggingFaceTB/smoltalk, self-oss-instruct (~25%)
Instructions : HuggingFaceTB/OpenHermes-2.5-H4, OpenHermes (~15%)
License
Apache 2.0