Eve-2-MoE-IT-272M
The Foundation for Nano-Scale Swarm Intelligence
Eve-2-MoE-IT-272M is a 272M parameter instruction-tuned model designed as the foundational base for the Eve Swarmβa collection of hyper-specialized, CPU-deployable adapters.
Unlike massive generalist LLMs, Eve is built for deterministic-ish transformations. She is designed to be "overfitted" into specialists that perform one job perfectly (e.g., SQL generation, Git commits, JSON extraction) with negligible latency and cost.
Author: Anthony Maio / Public Outputs
The Eve Swarm (Specialist Ecosystem)
This model serves as the parent for the following Full Fine-Tuned (FFT) specialists. All members were trained on an NVIDIA H200 SXM to ensure optimal embedding alignment.
Note (Feb 2026): The base model was updated with 10B tokens of continued pretraining (see Training History below). All specialists below were trained on the v1 base weights and need to be retrained on the updated base to benefit from the improved foundation. The GGUF quantizations also need to be regenerated from the new IT weights.
| Specialist Model | Task | Dataset Source | Size | Loss | Status |
|---|---|---|---|---|---|
| Eve-NanoFunction | Strict JSON Function Calling β produces valid JSON outputs from natural language. | glaive-function-calling-v2 | 272M | <0.4 (35k samples) | Needs retrain |
| Eve-NanoSummary | Conversation Summarization β condenses dialogues into concise summaries. | knkarthick/dialogsum | 272M | <1.0 (12.5k samples) | Needs retrain |
| Eve-NanoCommit | Git Diff β Commit Message β writes conventional commits from raw code diffs. | bigcode/commitpackft | 272M | <1.0 (20k samples) | Needs retrain |
| Eve-NanoExtract | Text β Structured Data β extracts parameters/entities into strict JSON schemas. | Salesforce/xlam-function-calling | 272M | <0.4 (20k samples) | Needs retrain |
| Eve-NanoSQL | Natural Language β SQL β converts questions to SQL using table context. | b-mc2/sql-create-context | 272M | <0.2 (25k samples) | Needs retrain |
| Eve-NanoPrompt | Prompt Expansion β expands simple ideas into rich image gen prompts. | Stable-Diffusion-Prompts | 272M | <1.0 (15k samples) | Needs retrain |
| Eve-NanoRouter | Intent Classification β routes user queries to the correct swarm member. | bitext/customer-support | 272M | <0.3 (25k samples) | Needs retrain |
| Eve-NanoPII | PII Redaction β identifies and masks sensitive entities. | ai4privacy/pii-masking-200k | 272M | <0.1 (35k samples) | Needs retrain |
Training History
Run 1: Initial Pretraining (v1)
- Dataset: HuggingFaceFW/fineweb-edu β 10B tokens
- Hardware: NVIDIA H200 SXM (141GB VRAM)
- Result: Trained from scratch to a functional base model
Run 2: Continued Pretraining (v2 β current base)
- Dataset: HuggingFaceFW/finepdfs (eng_Latn subset) β 10B tokens
- Hardware: NVIDIA A100 80GB (Google Colab)
- Duration: 14.7 hours
- Throughput: 189,295 tok/s
- Val perplexity: 36.8 β 32.3 (12.2% improvement)
- Result: Stronger base with improved document understanding from academic/scientific PDFs
Run 3: Instruction Tuning (IT β this model)
- Dataset: yahma/alpaca-cleaned β ~52K instruction-response pairs, 3 epochs
- Hardware: NVIDIA H200 SXM (141GB VRAM)
- Method: Full Fine-Tuning with response-only loss masking (prompt tokens excluded from loss)
- Duration: ~1 hour
- Batch size: 128
- Peak LR: 2e-5 (cosine decay)
- WikiText-2 val ppl: 32.6 β 36.2 (slight increase expected β model shifted toward instruction-following distribution)
Technical Specifications
Architecture: Nano-MoE
Eve uses a DeepSeek-style Mixture-of-Experts architecture scaled down to the "Nano" range.
- Total Parameters: 272M
- Active Parameters: ~80M (per token)
- Experts: 8 routed + 1 shared
- Top-K: 2
- Context Window: 2048 tokens
- Vocab: 50,304 (GPT-2 compatible)
Training Config (H200 SXM)
This model was trained using Full Fine-Tuning (FFT). We found that LoRA was insufficient for aligning the embeddings of such a small model; unfreezing all weights yielded significant performance gains. You don't need to use a H200, it's absurdly overkill. I love it.
- Hardware: NVIDIA H200 SXM (141GB VRAM)
- Method: Full Fine-Tuning (No PEFT/LoRA)
- Precision:
bfloat16 - Batch Size: 128 (Global)
- Learning Rate: 2e-5 (Cosine Schedule) β IT run; 5e-5 for specialist FFT
- Loss Masking: Response-only (masked user prompts via loss_mask tensor)
How to Tune Eve 2
If you want to train your own Eve specialist, follow these rules derived from our H200 experiments:
- Abandon LoRA: For a 272M model, LoRA restricts the embedding space too much. You have the VRAM; use Full Fine-Tuning.
- Mask User Prompts: You must use a collator that masks the prompt (loss only on response tokens). If the model calculates loss on the instruction, it wastes capacity learning English grammar instead of the task.
- Batch Size Matters: We saturated the H200 with
batch_size=128. High batch sizes stabilize the gradients for these volatile small architectures. - Dataset Quality > Quantity:
- Bad: 100k rows of scraped web text.
- Good: 10k rows of "Input β Ideal Output" pairs.
- Sweet Spot: 2 Epochs. Do not over-train; these models memorize quickly.
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "anthonym21/Eve-2-MoE-IT-272M"
# Load with trust_remote_code=True for custom MoE architecture
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Standard formatting
prompt = "User: Explain the concept of Semantic Quantization.\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=150, do_sample=True, temperature=0.6)
print(tokenizer.decode(out[0], skip_special_tokens=True))
GGUF Quantizations
Quantized versions are available at anthonym21/Eve-2-MoE-IT-272M-GGUF:
| Quantization | Filename | Size |
|---|---|---|
| Q8_0 | Eve-2-MoE-IT-272M-Q8_0.gguf | ~318 MB |
| Q4_K_M | Eve-2-MoE-IT-272M-Q4_K_M.gguf | ~204 MB |
Citation
@misc{maio2026eve2moeit,
author = {Maio, Anthony D.},
title = {Eve-2-MoE-IT-272M: A Nano-MoE Foundation for Swarm Intelligence},
year = {2026},
publisher = {Maio, Anthony D.},
url = {https://huggingface.co/anthonym21/Eve-2-MoE-IT-272M}
}
- Downloads last month
- 342
Model tree for anthonym21/Eve-2-MoE-IT-272M
Base model
anthonym21/Eve-2-MoE-272M