Eve-2-MoE-IT-272M

The Foundation for Nano-Scale Swarm Intelligence

Eve-2-MoE-IT-272M is a 272M parameter instruction-tuned model designed as the foundational base for the Eve Swarm—a collection of hyper-specialized, CPU-deployable adapters.

Unlike massive generalist LLMs, Eve is built for deterministic-ish transformations. She is designed to be "overfitted" into specialists that perform one job perfectly (e.g., SQL generation, Git commits, JSON extraction) with negligible latency and cost.

Author: Anthony Maio / Public Outputs

The Eve Swarm (Specialist Ecosystem)

This model serves as the parent for the following Full Fine-Tuned (FFT) specialists. All members were trained on an NVIDIA H200 SXM to ensure optimal embedding alignment.

Note (Feb 2026): The base model was updated with 10B tokens of continued pretraining (see Training History below). All specialists below were trained on the v1 base weights and need to be retrained on the updated base to benefit from the improved foundation. The GGUF quantizations also need to be regenerated from the new IT weights.

Specialist Model	Task	Dataset Source	Size	Loss	Status
Eve-NanoFunction	Strict JSON Function Calling – produces valid JSON outputs from natural language.	glaive-function-calling-v2	272M	<0.4 (35k samples)	Needs retrain
Eve-NanoSummary	Conversation Summarization – condenses dialogues into concise summaries.	knkarthick/dialogsum	272M	<1.0 (12.5k samples)	Needs retrain
Eve-NanoCommit	Git Diff → Commit Message – writes conventional commits from raw code diffs.	bigcode/commitpackft	272M	<1.0 (20k samples)	Needs retrain
Eve-NanoExtract	Text → Structured Data – extracts parameters/entities into strict JSON schemas.	Salesforce/xlam-function-calling	272M	<0.4 (20k samples)	Needs retrain
Eve-NanoSQL	Natural Language → SQL – converts questions to SQL using table context.	b-mc2/sql-create-context	272M	<0.2 (25k samples)	Needs retrain
Eve-NanoPrompt	Prompt Expansion – expands simple ideas into rich image gen prompts.	Stable-Diffusion-Prompts	272M	<1.0 (15k samples)	Needs retrain
Eve-NanoRouter	Intent Classification – routes user queries to the correct swarm member.	bitext/customer-support	272M	<0.3 (25k samples)	Needs retrain
Eve-NanoPII	PII Redaction – identifies and masks sensitive entities.	ai4privacy/pii-masking-200k	272M	<0.1 (35k samples)	Needs retrain

Training History

Run 1: Initial Pretraining (v1)

Dataset: HuggingFaceFW/fineweb-edu — 10B tokens
Hardware: NVIDIA H200 SXM (141GB VRAM)
Result: Trained from scratch to a functional base model

Run 2: Continued Pretraining (v2 — current base)

Dataset: HuggingFaceFW/finepdfs (eng_Latn subset) — 10B tokens
Hardware: NVIDIA A100 80GB (Google Colab)
Duration: 14.7 hours
Throughput: 189,295 tok/s
Val perplexity: 36.8 → 32.3 (12.2% improvement)
Result: Stronger base with improved document understanding from academic/scientific PDFs

Run 3: Instruction Tuning (IT — this model)

Dataset: yahma/alpaca-cleaned — ~52K instruction-response pairs, 3 epochs
Hardware: NVIDIA H200 SXM (141GB VRAM)
Method: Full Fine-Tuning with response-only loss masking (prompt tokens excluded from loss)
Duration: ~1 hour
Batch size: 128
Peak LR: 2e-5 (cosine decay)
WikiText-2 val ppl: 32.6 → 36.2 (slight increase expected — model shifted toward instruction-following distribution)

Technical Specifications

Architecture: Nano-MoE

Eve uses a DeepSeek-style Mixture-of-Experts architecture scaled down to the "Nano" range.

Total Parameters: 272M
Active Parameters: ~80M (per token)
Experts: 8 routed + 1 shared
Top-K: 2
Context Window: 2048 tokens
Vocab: 50,304 (GPT-2 compatible)

Training Config (H200 SXM)

This model was trained using Full Fine-Tuning (FFT). We found that LoRA was insufficient for aligning the embeddings of such a small model; unfreezing all weights yielded significant performance gains. You don't need to use a H200, it's absurdly overkill. I love it.

Hardware: NVIDIA H200 SXM (141GB VRAM)
Method: Full Fine-Tuning (No PEFT/LoRA)
Precision: bfloat16
Batch Size: 128 (Global)
Learning Rate: 2e-5 (Cosine Schedule) — IT run; 5e-5 for specialist FFT
Loss Masking: Response-only (masked user prompts via loss_mask tensor)

How to Tune Eve 2

If you want to train your own Eve specialist, follow these rules derived from our H200 experiments:

Abandon LoRA: For a 272M model, LoRA restricts the embedding space too much. You have the VRAM; use Full Fine-Tuning.
Mask User Prompts: You must use a collator that masks the prompt (loss only on response tokens). If the model calculates loss on the instruction, it wastes capacity learning English grammar instead of the task.
Batch Size Matters: We saturated the H200 with batch_size=128. High batch sizes stabilize the gradients for these volatile small architectures.
Dataset Quality > Quantity:
- Bad: 100k rows of scraped web text.
- Good: 10k rows of "Input → Ideal Output" pairs.
- Sweet Spot: 2 Epochs. Do not over-train; these models memorize quickly.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "anthonym21/Eve-2-MoE-IT-272M"

# Load with trust_remote_code=True for custom MoE architecture
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Standard formatting
prompt = "User: Explain the concept of Semantic Quantization.\nAssistant:"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=150, do_sample=True, temperature=0.6)

print(tokenizer.decode(out[0], skip_special_tokens=True))

GGUF Quantizations

Quantized versions are available at anthonym21/Eve-2-MoE-IT-272M-GGUF:

Quantization	Filename	Size
Q8_0	Eve-2-MoE-IT-272M-Q8_0.gguf	~318 MB
Q4_K_M	Eve-2-MoE-IT-272M-Q4_K_M.gguf	~204 MB

Citation

@misc{maio2026eve2moeit,
  author = {Maio, Anthony D.},
  title = {Eve-2-MoE-IT-272M: A Nano-MoE Foundation for Swarm Intelligence},
  year = {2026},
  publisher = {Maio, Anthony D.},
  url = {https://huggingface.co/anthonym21/Eve-2-MoE-IT-272M}
}