Eve-2 Swarm

Eve-2-MoE-IT-272M

The Foundation for Nano-Scale Swarm Intelligence

Eve-2-MoE-IT-272M is a 272M parameter instruction-tuned model designed as the foundational base for the Eve Swarmβ€”a collection of hyper-specialized, CPU-deployable adapters.

Unlike massive generalist LLMs, Eve is built for deterministic-ish transformations. She is designed to be "overfitted" into specialists that perform one job perfectly (e.g., SQL generation, Git commits, JSON extraction) with negligible latency and cost.

Author: Anthony Maio / Public Outputs


The Eve Swarm (Specialist Ecosystem)

This model serves as the parent for the following Full Fine-Tuned (FFT) specialists. All members were trained on an NVIDIA H200 SXM to ensure optimal embedding alignment.

Note (Feb 2026): The base model was updated with 10B tokens of continued pretraining (see Training History below). All specialists below were trained on the v1 base weights and need to be retrained on the updated base to benefit from the improved foundation. The GGUF quantizations also need to be regenerated from the new IT weights.

Specialist Model Task Dataset Source Size Loss Status
Eve-NanoFunction Strict JSON Function Calling – produces valid JSON outputs from natural language. glaive-function-calling-v2 272M <0.4 (35k samples) Needs retrain
Eve-NanoSummary Conversation Summarization – condenses dialogues into concise summaries. knkarthick/dialogsum 272M <1.0 (12.5k samples) Needs retrain
Eve-NanoCommit Git Diff β†’ Commit Message – writes conventional commits from raw code diffs. bigcode/commitpackft 272M <1.0 (20k samples) Needs retrain
Eve-NanoExtract Text β†’ Structured Data – extracts parameters/entities into strict JSON schemas. Salesforce/xlam-function-calling 272M <0.4 (20k samples) Needs retrain
Eve-NanoSQL Natural Language β†’ SQL – converts questions to SQL using table context. b-mc2/sql-create-context 272M <0.2 (25k samples) Needs retrain
Eve-NanoPrompt Prompt Expansion – expands simple ideas into rich image gen prompts. Stable-Diffusion-Prompts 272M <1.0 (15k samples) Needs retrain
Eve-NanoRouter Intent Classification – routes user queries to the correct swarm member. bitext/customer-support 272M <0.3 (25k samples) Needs retrain
Eve-NanoPII PII Redaction – identifies and masks sensitive entities. ai4privacy/pii-masking-200k 272M <0.1 (35k samples) Needs retrain

Training History

Run 1: Initial Pretraining (v1)

  • Dataset: HuggingFaceFW/fineweb-edu β€” 10B tokens
  • Hardware: NVIDIA H200 SXM (141GB VRAM)
  • Result: Trained from scratch to a functional base model

Run 2: Continued Pretraining (v2 β€” current base)

  • Dataset: HuggingFaceFW/finepdfs (eng_Latn subset) β€” 10B tokens
  • Hardware: NVIDIA A100 80GB (Google Colab)
  • Duration: 14.7 hours
  • Throughput: 189,295 tok/s
  • Val perplexity: 36.8 β†’ 32.3 (12.2% improvement)
  • Result: Stronger base with improved document understanding from academic/scientific PDFs

Run 3: Instruction Tuning (IT β€” this model)

  • Dataset: yahma/alpaca-cleaned β€” ~52K instruction-response pairs, 3 epochs
  • Hardware: NVIDIA H200 SXM (141GB VRAM)
  • Method: Full Fine-Tuning with response-only loss masking (prompt tokens excluded from loss)
  • Duration: ~1 hour
  • Batch size: 128
  • Peak LR: 2e-5 (cosine decay)
  • WikiText-2 val ppl: 32.6 β†’ 36.2 (slight increase expected β€” model shifted toward instruction-following distribution)

Technical Specifications

Architecture: Nano-MoE

Eve uses a DeepSeek-style Mixture-of-Experts architecture scaled down to the "Nano" range.

  • Total Parameters: 272M
  • Active Parameters: ~80M (per token)
  • Experts: 8 routed + 1 shared
  • Top-K: 2
  • Context Window: 2048 tokens
  • Vocab: 50,304 (GPT-2 compatible)

Training Config (H200 SXM)

This model was trained using Full Fine-Tuning (FFT). We found that LoRA was insufficient for aligning the embeddings of such a small model; unfreezing all weights yielded significant performance gains. You don't need to use a H200, it's absurdly overkill. I love it.

  • Hardware: NVIDIA H200 SXM (141GB VRAM)
  • Method: Full Fine-Tuning (No PEFT/LoRA)
  • Precision: bfloat16
  • Batch Size: 128 (Global)
  • Learning Rate: 2e-5 (Cosine Schedule) β€” IT run; 5e-5 for specialist FFT
  • Loss Masking: Response-only (masked user prompts via loss_mask tensor)

How to Tune Eve 2

If you want to train your own Eve specialist, follow these rules derived from our H200 experiments:

  1. Abandon LoRA: For a 272M model, LoRA restricts the embedding space too much. You have the VRAM; use Full Fine-Tuning.
  2. Mask User Prompts: You must use a collator that masks the prompt (loss only on response tokens). If the model calculates loss on the instruction, it wastes capacity learning English grammar instead of the task.
  3. Batch Size Matters: We saturated the H200 with batch_size=128. High batch sizes stabilize the gradients for these volatile small architectures.
  4. Dataset Quality > Quantity:
    • Bad: 100k rows of scraped web text.
    • Good: 10k rows of "Input β†’ Ideal Output" pairs.
    • Sweet Spot: 2 Epochs. Do not over-train; these models memorize quickly.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "anthonym21/Eve-2-MoE-IT-272M"

# Load with trust_remote_code=True for custom MoE architecture
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Standard formatting
prompt = "User: Explain the concept of Semantic Quantization.\nAssistant:"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=150, do_sample=True, temperature=0.6)

print(tokenizer.decode(out[0], skip_special_tokens=True))

GGUF Quantizations

Quantized versions are available at anthonym21/Eve-2-MoE-IT-272M-GGUF:

Quantization Filename Size
Q8_0 Eve-2-MoE-IT-272M-Q8_0.gguf ~318 MB
Q4_K_M Eve-2-MoE-IT-272M-Q4_K_M.gguf ~204 MB

Citation

@misc{maio2026eve2moeit,
  author = {Maio, Anthony D.},
  title = {Eve-2-MoE-IT-272M: A Nano-MoE Foundation for Swarm Intelligence},
  year = {2026},
  publisher = {Maio, Anthony D.},
  url = {https://huggingface.co/anthonym21/Eve-2-MoE-IT-272M}
}
Downloads last month
342
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for anthonym21/Eve-2-MoE-IT-272M

Finetuned
(1)
this model
Finetunes
8 models
Quantizations
1 model

Datasets used to train anthonym21/Eve-2-MoE-IT-272M

Collection including anthonym21/Eve-2-MoE-IT-272M