---
library_name: transformers
license: apache-2.0
language:
  - en
tags:
  - tokenizer
  - bpe
  - byte-level
  - chatml
  - tool-use
  - code
  - python
pipeline_tag: text-generation
datasets:
  - nvidia/Nemotron-CC-HQ
  - HuggingFaceTB/smoltalk
  - sahil2801/CodeAlpaca-20k
---

# Daisy Tokenizer

Custom byte-level BPE tokenizer trained for the Daisy language model, optimized for Python code and instruction-following tasks.

## Details

| Property            | Value               |
|---------------------|---------------------|
| **Vocabulary size** | 49,152              |
| **Algorithm**       | Byte-level BPE      |
| **Pre-tokenizer**   | Llama-3 style regex |
| **Chat format**     | ChatML              |
| **Max length**      | 131,072 tokens      |
| **Training date**   | 2026-01-14          |

## Features

- **Python-optimized**: Trained on Python code for efficient tokenization
- **Tool calling**: Native support for `<|tool_call|>` / `<|tool_result|>` patterns
- **Inline computation**: Support for `<|python|>` / `<|output|>` for calculator-style reasoning
- **Chain-of-thought**: `<|think|>` tokens for reasoning blocks
- **No UNK tokens**: Byte-level fallback handles any Unicode input

## Special Tokens

| Token                | ID    | Purpose                    |
|----------------------|-------|----------------------------|
| `<\|endoftext\|>`    | 49131 | End of sequence / BOS      |
| `<\|pad\|>`          | 49132 | Padding token              |
| `<\|im_start\|>`     | 49133 | Start of message (ChatML)  |
| `<\|im_end\|>`       | 49134 | End of message (ChatML)    |
| `<\|tool_call\|>`    | 49135 | Start of tool call         |
| `<\|/tool_call\|>`   | 49136 | End of tool call           |
| `<\|tool_result\|>`  | 49137 | Start of tool result       |
| `<\|/tool_result\|>` | 49138 | End of tool result         |
| `<\|python\|>`       | 49139 | Start of Python expression |
| `<\|/python\|>`      | 49140 | End of Python expression   |
| `<\|output\|>`       | 49141 | Start of computed output   |
| `<\|/output\|>`      | 49142 | End of computed output     |
| `<\|think\|>`        | 49143 | Start of reasoning block   |
| `<\|/think\|>`       | 49144 | End of reasoning block     |
| `<\|system\|>`       | 49145 | System role marker         |
| `<\|user\|>`         | 49146 | User role marker           |
| `<\|assistant\|>`    | 49147 | Assistant role marker      |
| `<\|reserved_0\|>`   | 49148 | Reserved                   |
| `<\|reserved_1\|>`   | 49149 | Reserved                   |
| `<\|reserved_2\|>`   | 49150 | Reserved                   |
| `<\|reserved_3\|>`   | 49151 | Reserved                   |

## Usage

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jonathanmiddleton/daisy")

# Basic encoding
tokens = tokenizer.encode("Hello, world!")

# Chat formatting
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there! How can I help you?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
```

## Chat Template Format

```
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>
```

### Tool Calling Example

```
<|im_start|>assistant
Let me calculate that for you.
<|tool_call|>{"name": "calculator", "arguments": {"expression": "2 + 2"}}<|/tool_call|>
<|tool_result|>4<|/tool_result|>
The answer is 4.<|im_end|>
```

## Compression Ratios

Benchmarked against common tokenizers on Python code, prose, and instruction data:

### Python Code (SmolTalk self-oss-instruct, 504 samples)

| Tokenizer                           | Vocab Size | Chars/Token | Tokens      |
|-------------------------------------|------------|-------------|-------------|
| meta-llama/Llama-3.2-3B-Instruct    | 128,000    | 4.391       | 88,644      |
| Qwen/Qwen2.5-1.5B-Instruct          | 151,643    | 4.366       | 89,139      |
| HuggingFaceTB/SmolLM2-135M-Instruct | 49,152     | 3.906       | 99,650      |
| **JonathanMiddleton/daisy**         | **49,131** | **3.766**   | **103,349** |
| microsoft/phi-2                     | 50,257     | 3.628       | 107,290     |
| openai-community/gpt2               | 50,257     | 3.152       | 123,467     |

### English Prose (FineWeb-Edu, 505 samples)

| Tokenizer                           | Vocab Size | Chars/Token | Tokens      |
|-------------------------------------|------------|-------------|-------------|
| meta-llama/Llama-3.2-3B-Instruct    | 128,000    | 4.681       | 466,617     |
| **JonathanMiddleton/daisy**         | **49,131** | **4.594**   | **475,422** |
| openai-community/gpt2               | 50,257     | 4.584       | 476,460     |
| microsoft/phi-2                     | 50,257     | 4.584       | 476,461     |
| Qwen/Qwen2.5-1.5B-Instruct          | 151,643    | 4.563       | 478,607     |
| HuggingFaceTB/SmolLM2-135M-Instruct | 49,152     | 4.475       | 488,120     |

### Instructions (SmolTalk, 504 samples)

| Tokenizer                           | Vocab Size | Chars/Token | Tokens      |
|-------------------------------------|------------|-------------|-------------|
| meta-llama/Llama-3.2-3B-Instruct    | 128,000    | 4.771       | 737,130     |
| Qwen/Qwen2.5-1.5B-Instruct          | 151,643    | 4.731       | 743,360     |
| **JonathanMiddleton/daisy**         | **49,131** | **4.487**   | **783,803** |
| HuggingFaceTB/SmolLM2-135M-Instruct | 49,152     | 4.455       | 789,399     |
| microsoft/phi-2                     | 50,257     | 4.437       | 792,658     |
| openai-community/gpt2               | 50,257     | 4.254       | 826,711     |

### Cross-Content Average

| Tokenizer                           | Python    | Prose     | Instruction | Average   |
|-------------------------------------|-----------|-----------|-------------|-----------|
| meta-llama/Llama-3.2-3B-Instruct    | 4.391     | 4.681     | 4.771       | 4.614     |
| Qwen/Qwen2.5-1.5B-Instruct          | 4.366     | 4.563     | 4.731       | 4.554     |
| **JonathanMiddleton/daisy**         | **3.766** | **4.594** | **4.487**   | **4.282** |
| HuggingFaceTB/SmolLM2-135M-Instruct | 3.906     | 4.475     | 4.455       | 4.278     |
| microsoft/phi-2                     | 3.628     | 4.584     | 4.437       | 4.216     |
| openai-community/gpt2               | 3.152     | 4.584     | 4.254       | 3.997     |

**Key findings**: Daisy achieves competitive compression with a ~49K vocabulary, ranking 2nd among tested similar-sized tokenizers for prose and instructions while maintaining strong Python performance.

## Training Data

- **General text**: lehduong/nemotron-cc-hq (~60%)
- **Python code**: HuggingFaceTB/smoltalk, self-oss-instruct (~25%)
- **Instructions**: HuggingFaceTB/OpenHermes-2.5-H4, OpenHermes (~15%)

## License

Apache 2.0