daisy / README.md

Upload 5 files

a491445 verified 28 days ago

6.93 kB

library_name: transformers
license: apache-2.0
language:
  - en
tags:
  - tokenizer
  - bpe
  - byte-level
  - chatml
  - tool-use
  - code
  - python
pipeline_tag: text-generation
datasets:
  - nvidia/Nemotron-CC-HQ
  - HuggingFaceTB/smoltalk
  - sahil2801/CodeAlpaca-20k

Daisy Tokenizer

Custom byte-level BPE tokenizer trained for the Daisy language model, optimized for Python code and instruction-following tasks.

Details

Property	Value
Vocabulary size	49,152
Algorithm	Byte-level BPE
Pre-tokenizer	Llama-3 style regex
Chat format	ChatML
Max length	131,072 tokens
Training date	2026-01-14

Features

Python-optimized: Trained on Python code for efficient tokenization
Tool calling: Native support for <|tool_call|> / <|tool_result|> patterns
Inline computation: Support for <|python|> / <|output|> for calculator-style reasoning
Chain-of-thought: <|think|> tokens for reasoning blocks
No UNK tokens: Byte-level fallback handles any Unicode input

Special Tokens

Token	ID	Purpose
`<\|endoftext\|>`	49131	End of sequence / BOS
`<\|pad\|>`	49132	Padding token
`<\|im_start\|>`	49133	Start of message (ChatML)
`<\|im_end\|>`	49134	End of message (ChatML)
`<\|tool_call\|>`	49135	Start of tool call
`<\|/tool_call\|>`	49136	End of tool call
`<\|tool_result\|>`	49137	Start of tool result
`<\|/tool_result\|>`	49138	End of tool result
`<\|python\|>`	49139	Start of Python expression
`<\|/python\|>`	49140	End of Python expression
`<\|output\|>`	49141	Start of computed output
`<\|/output\|>`	49142	End of computed output
`<\|think\|>`	49143	Start of reasoning block
`<\|/think\|>`	49144	End of reasoning block
`<\|system\|>`	49145	System role marker
`<\|user\|>`	49146	User role marker
`<\|assistant\|>`	49147	Assistant role marker
`<\|reserved_0\|>`	49148	Reserved
`<\|reserved_1\|>`	49149	Reserved
`<\|reserved_2\|>`	49150	Reserved
`<\|reserved_3\|>`	49151	Reserved

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jonathanmiddleton/daisy")

# Basic encoding
tokens = tokenizer.encode("Hello, world!")

# Chat formatting
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there! How can I help you?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)

Chat Template Format

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>

Tool Calling Example

<|im_start|>assistant
Let me calculate that for you.
<|tool_call|>{"name": "calculator", "arguments": {"expression": "2 + 2"}}<|/tool_call|>
<|tool_result|>4<|/tool_result|>
The answer is 4.<|im_end|>

Compression Ratios

Benchmarked against common tokenizers on Python code, prose, and instruction data:

Python Code (SmolTalk self-oss-instruct, 504 samples)

Tokenizer	Vocab Size	Chars/Token	Tokens
meta-llama/Llama-3.2-3B-Instruct	128,000	4.391	88,644
Qwen/Qwen2.5-1.5B-Instruct	151,643	4.366	89,139
HuggingFaceTB/SmolLM2-135M-Instruct	49,152	3.906	99,650
JonathanMiddleton/daisy	49,131	3.766	103,349
microsoft/phi-2	50,257	3.628	107,290
openai-community/gpt2	50,257	3.152	123,467

English Prose (FineWeb-Edu, 505 samples)

Tokenizer	Vocab Size	Chars/Token	Tokens
meta-llama/Llama-3.2-3B-Instruct	128,000	4.681	466,617
JonathanMiddleton/daisy	49,131	4.594	475,422
openai-community/gpt2	50,257	4.584	476,460
microsoft/phi-2	50,257	4.584	476,461
Qwen/Qwen2.5-1.5B-Instruct	151,643	4.563	478,607
HuggingFaceTB/SmolLM2-135M-Instruct	49,152	4.475	488,120

Instructions (SmolTalk, 504 samples)

Tokenizer	Vocab Size	Chars/Token	Tokens
meta-llama/Llama-3.2-3B-Instruct	128,000	4.771	737,130
Qwen/Qwen2.5-1.5B-Instruct	151,643	4.731	743,360
JonathanMiddleton/daisy	49,131	4.487	783,803
HuggingFaceTB/SmolLM2-135M-Instruct	49,152	4.455	789,399
microsoft/phi-2	50,257	4.437	792,658
openai-community/gpt2	50,257	4.254	826,711

Cross-Content Average

Tokenizer	Python	Prose	Instruction	Average
meta-llama/Llama-3.2-3B-Instruct	4.391	4.681	4.771	4.614
Qwen/Qwen2.5-1.5B-Instruct	4.366	4.563	4.731	4.554
JonathanMiddleton/daisy	3.766	4.594	4.487	4.282
HuggingFaceTB/SmolLM2-135M-Instruct	3.906	4.475	4.455	4.278
microsoft/phi-2	3.628	4.584	4.437	4.216
openai-community/gpt2	3.152	4.584	4.254	3.997

Key findings: Daisy achieves competitive compression with a ~49K vocabulary, ranking 2nd among tested similar-sized tokenizers for prose and instructions while maintaining strong Python performance.

Training Data

General text: lehduong/nemotron-cc-hq (~60%)
Python code: HuggingFaceTB/smoltalk, self-oss-instruct (~25%)
Instructions: HuggingFaceTB/OpenHermes-2.5-H4, OpenHermes (~15%)

License

Apache 2.0