--- library_name: transformers license: apache-2.0 language: - en tags: - tokenizer - bpe - byte-level - chatml - tool-use - code - python pipeline_tag: text-generation datasets: - nvidia/Nemotron-CC-HQ - HuggingFaceTB/smoltalk - sahil2801/CodeAlpaca-20k --- # Daisy Tokenizer Custom byte-level BPE tokenizer trained for the Daisy language model, optimized for Python code and instruction-following tasks. ## Details | Property | Value | |---------------------|---------------------| | **Vocabulary size** | 49,152 | | **Algorithm** | Byte-level BPE | | **Pre-tokenizer** | Llama-3 style regex | | **Chat format** | ChatML | | **Max length** | 131,072 tokens | | **Training date** | 2026-01-14 | ## Features - **Python-optimized**: Trained on Python code for efficient tokenization - **Tool calling**: Native support for `<|tool_call|>` / `<|tool_result|>` patterns - **Inline computation**: Support for `<|python|>` / `<|output|>` for calculator-style reasoning - **Chain-of-thought**: `<|think|>` tokens for reasoning blocks - **No UNK tokens**: Byte-level fallback handles any Unicode input ## Special Tokens | Token | ID | Purpose | |----------------------|-------|----------------------------| | `<\|endoftext\|>` | 49131 | End of sequence / BOS | | `<\|pad\|>` | 49132 | Padding token | | `<\|im_start\|>` | 49133 | Start of message (ChatML) | | `<\|im_end\|>` | 49134 | End of message (ChatML) | | `<\|tool_call\|>` | 49135 | Start of tool call | | `<\|/tool_call\|>` | 49136 | End of tool call | | `<\|tool_result\|>` | 49137 | Start of tool result | | `<\|/tool_result\|>` | 49138 | End of tool result | | `<\|python\|>` | 49139 | Start of Python expression | | `<\|/python\|>` | 49140 | End of Python expression | | `<\|output\|>` | 49141 | Start of computed output | | `<\|/output\|>` | 49142 | End of computed output | | `<\|think\|>` | 49143 | Start of reasoning block | | `<\|/think\|>` | 49144 | End of reasoning block | | `<\|system\|>` | 49145 | System role marker | | `<\|user\|>` | 49146 | User role marker | | `<\|assistant\|>` | 49147 | Assistant role marker | | `<\|reserved_0\|>` | 49148 | Reserved | | `<\|reserved_1\|>` | 49149 | Reserved | | `<\|reserved_2\|>` | 49150 | Reserved | | `<\|reserved_3\|>` | 49151 | Reserved | ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("jonathanmiddleton/daisy") # Basic encoding tokens = tokenizer.encode("Hello, world!") # Chat formatting messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi there! How can I help you?"}, ] text = tokenizer.apply_chat_template(messages, tokenize=False) ``` ## Chat Template Format ``` <|im_start|>system {system_message}<|im_end|> <|im_start|>user {user_message}<|im_end|> <|im_start|>assistant {assistant_message}<|im_end|> ``` ### Tool Calling Example ``` <|im_start|>assistant Let me calculate that for you. <|tool_call|>{"name": "calculator", "arguments": {"expression": "2 + 2"}}<|/tool_call|> <|tool_result|>4<|/tool_result|> The answer is 4.<|im_end|> ``` ## Compression Ratios Benchmarked against common tokenizers on Python code, prose, and instruction data: ### Python Code (SmolTalk self-oss-instruct, 504 samples) | Tokenizer | Vocab Size | Chars/Token | Tokens | |-------------------------------------|------------|-------------|-------------| | meta-llama/Llama-3.2-3B-Instruct | 128,000 | 4.391 | 88,644 | | Qwen/Qwen2.5-1.5B-Instruct | 151,643 | 4.366 | 89,139 | | HuggingFaceTB/SmolLM2-135M-Instruct | 49,152 | 3.906 | 99,650 | | **JonathanMiddleton/daisy** | **49,131** | **3.766** | **103,349** | | microsoft/phi-2 | 50,257 | 3.628 | 107,290 | | openai-community/gpt2 | 50,257 | 3.152 | 123,467 | ### English Prose (FineWeb-Edu, 505 samples) | Tokenizer | Vocab Size | Chars/Token | Tokens | |-------------------------------------|------------|-------------|-------------| | meta-llama/Llama-3.2-3B-Instruct | 128,000 | 4.681 | 466,617 | | **JonathanMiddleton/daisy** | **49,131** | **4.594** | **475,422** | | openai-community/gpt2 | 50,257 | 4.584 | 476,460 | | microsoft/phi-2 | 50,257 | 4.584 | 476,461 | | Qwen/Qwen2.5-1.5B-Instruct | 151,643 | 4.563 | 478,607 | | HuggingFaceTB/SmolLM2-135M-Instruct | 49,152 | 4.475 | 488,120 | ### Instructions (SmolTalk, 504 samples) | Tokenizer | Vocab Size | Chars/Token | Tokens | |-------------------------------------|------------|-------------|-------------| | meta-llama/Llama-3.2-3B-Instruct | 128,000 | 4.771 | 737,130 | | Qwen/Qwen2.5-1.5B-Instruct | 151,643 | 4.731 | 743,360 | | **JonathanMiddleton/daisy** | **49,131** | **4.487** | **783,803** | | HuggingFaceTB/SmolLM2-135M-Instruct | 49,152 | 4.455 | 789,399 | | microsoft/phi-2 | 50,257 | 4.437 | 792,658 | | openai-community/gpt2 | 50,257 | 4.254 | 826,711 | ### Cross-Content Average | Tokenizer | Python | Prose | Instruction | Average | |-------------------------------------|-----------|-----------|-------------|-----------| | meta-llama/Llama-3.2-3B-Instruct | 4.391 | 4.681 | 4.771 | 4.614 | | Qwen/Qwen2.5-1.5B-Instruct | 4.366 | 4.563 | 4.731 | 4.554 | | **JonathanMiddleton/daisy** | **3.766** | **4.594** | **4.487** | **4.282** | | HuggingFaceTB/SmolLM2-135M-Instruct | 3.906 | 4.475 | 4.455 | 4.278 | | microsoft/phi-2 | 3.628 | 4.584 | 4.437 | 4.216 | | openai-community/gpt2 | 3.152 | 4.584 | 4.254 | 3.997 | **Key findings**: Daisy achieves competitive compression with a ~49K vocabulary, ranking 2nd among tested similar-sized tokenizers for prose and instructions while maintaining strong Python performance. ## Training Data - **General text**: lehduong/nemotron-cc-hq (~60%) - **Python code**: HuggingFaceTB/smoltalk, self-oss-instruct (~25%) - **Instructions**: HuggingFaceTB/OpenHermes-2.5-H4, OpenHermes (~15%) ## License Apache 2.0