binomial-marks-1
An earnings-call NLP scorer that produces 23 structured signals per transcript.
Built by Binomial AI Research. Part of the specialist zoo β a roster of small, deployable AI models for quantitative finance. Each model is named after a thinker who shaped how markets are understood. marks-1 is named after Howard Marks (Oaktree), whose memos parse market sentiment, tone, and the gap between what's said and what's meant.
Headline numbers
- ~80% of frontier-LLM consensus on topic-direction scoring (mean Spearman vs frontier panel: 0.674, vs the ceiling that frontier reasoners hit with each other: 0.838).
- Frontier parity on tone: marks-1 β frontier mean Spearman 0.62 is statistically tied with frontier β frontier 0.61 (DeepSeek-included) and within 0.05 of the Western-frontier subset.
- F1 = 0.91 on the binary topic-mention heads β i.e. it agrees with the teacher 9 out of 10 times on whether a topic was discussed at all.
- 6 of 10 topics β₯ 0.71 Spearman with Claude Opus 4.7.
dividendshits 0.84, only 0.05 below the frontier-frontier ceiling of 0.89. - ~50ms / call on CPU, sub-10ms on a modern GPU, ~12 calls/sec batched on A100/H100/B200 β vs multi-second latency for a comparable LLM API call. Two orders of magnitude faster, deterministic, and runs offline.
- 23 outputs in a single forward pass β no chained LLM calls, no JSON parsing, no retry logic.
- 16,384-token context window covers ~p99 of earnings calls; conditioned on
(country, sector, ticker, quarter)so the same words read correctly in context. - Apache 2.0 β deployable anywhere, no API key, no vendor lock-in.
What it does
Given the text of an earnings call (with light metadata), binomial-marks-1 returns
23 structured numbers per call:
10 topic-direction scores (each: was the topic discussed? if so, what direction?)
| Topic | What β2 / +2 mean |
|---|---|
guidance |
lowered hard / raised significantly |
revenue_growth |
decelerating / accelerating |
margins |
compressing / expanding |
demand |
softening / strong |
buybacks |
paused or reduced / new or upsized |
dividends |
cut or skipped / raised or initiated |
m_and_a |
divestiture / strategic acquisition |
headcount |
layoffs / aggressive hiring |
macro_exposure |
clear headwind / clear tailwind |
competition |
losing share / gaining share |
3 tone scores (each: 1 to 5, low to high)
| Dimension | What it measures |
|---|---|
mgmt_confidence |
directness in prepared remarks (1 = uncertain "we hope" β 5 = "we will deliver X by Y") |
mgmt_defensiveness |
evasion in Q&A (1 = open β 5 = deflects, pivots, refuses to commit) |
analyst_skepticism |
analyst pushback (1 = congratulatory β 5 = re-asking the same question) |
The model is conditioned on country, sector, ticker, and quarter at inference, so the same words read differently in the right context β "margins compressing" in software isn't the same signal as in retail; "demand softening" in a Chinese consumer name isn't the same as a US one. This conditioning is the difference between a generic sentiment scorer and one that reads earnings calls the way an analyst does.
Quants consume the 23 outputs as features in factor models, screening filters, or event-study triggers. The model outputs structure, not opinions β buy/sell logic is the consumer's responsibility.
Quick start
One-liner via the convenience helper
pip install binomial-marks
from binomial_marks import score
result = score(
transcript="Operator: Welcome to NVIDIA's Q4 2025 earnings call...",
ticker="NVDA",
sector="Technology",
country="US",
year=2025, quarter=4,
)
# {
# "topics": {
# "guidance": {"mentioned": True, "mention_prob": 0.94, "score": +1.7},
# "revenue_growth": {"mentioned": True, "mention_prob": 0.97, "score": +1.5},
# ...
# },
# "mgmt_confidence": 4.6,
# "mgmt_defensiveness": 1.4,
# "analyst_skepticism": 1.8,
# }
Direct via transformers
from transformers import AutoTokenizer, AutoModel
import torch
tok = AutoTokenizer.from_pretrained("BinomialTechnologies/binomial-marks-1")
model = AutoModel.from_pretrained(
"BinomialTechnologies/binomial-marks-1",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).eval().cuda()
prefix = "[SECTOR: Technology] [COUNTRY: US] [TICKER: NVDA] [QUARTER: Q4 2025]\n\n"
inputs = tok(prefix + transcript, return_tensors="pt",
truncation=True, max_length=16384).to("cuda")
with torch.no_grad():
out = model.predict(**inputs)
# out["topic_score"]: shape (1, 10), the 10 topic directions
# out["tone_score"]: shape (1, 3), the 3 tone dimensions
Batched
from binomial_marks import MarksScorer
scorer = MarksScorer() # loads model once
results = scorer.score_batch([
{"transcript": ..., "ticker": "NVDA", "sector": "Technology", "year": 2025, "quarter": 4},
{"transcript": ..., "ticker": "AAPL", "sector": "Technology", "year": 2025, "quarter": 1},
])
Training
binomial-marks-1 is trained on 80,000+ earnings call transcripts spanning 2,700+
unique tickers across global markets (2012β2026), each tagged with country, sector, and
industry metadata. Labels are distilled from frontier reasoning models and the model is
benchmarked against the same set of frontier systems on a held-out 2,000-call sample.
The split is (ticker, year, quarter)-keyed; this is a pure NLP imitation task β labels
come from language models, not market outcomes β so a temporal split is unnecessary.
Eval β cross-LLM agreement on a 2,000-call benchmark
The benchmark is 2,000 calls held out from training, scored by five systems
(Grok-4.1-fast-reasoning, Claude Opus 4.7, GPT-5.5 low-reasoning, DeepSeek V4-Pro, and
marks-1 itself). Pairwise Spearman rank correlation across the 10 topic-direction
dimensions:
| vs Opus | vs GPT-5.5 | vs Grok | vs DeepSeek | |
|---|---|---|---|---|
| Opus 4.7 | β | 0.886 | 0.832 | 0.803 |
| GPT-5.5 | 0.886 | β | 0.871 | 0.827 |
| Grok | 0.832 | 0.871 | β | 0.807 |
| DeepSeek V4 | 0.803 | 0.827 | 0.807 | β |
| marks-1 | 0.697 | 0.696 | 0.677 | 0.627 |
| Frontier β Frontier (6 pairs) | marks-1 β Frontier (4 pairs) | |
|---|---|---|
| Mean topic-score Spearman | 0.838 | 0.674 |
| Mean tone Spearman | 0.61 (see note) | 0.62 |
| Mean mentioned MAE | 0.05 | 0.10 |
Note on tone: DeepSeek V4 reads management mood/aggression differently from Western frontier models (its tone Spearman vs the others is 0.50β0.55, vs OpusβGPT-5.5 at 0.78). Excluding DeepSeek, frontier tone agreement is 0.72 β and marks-1 still hits 0.67 against that subset.
marks-1 reproduces β80% of the agreement that frontier reasoners have with each other on financial NLP scoring, at a fraction of the inference cost (~50ms on CPU vs multi-second LLM API calls).
Per-topic Spearman vs. Claude Opus 4.7
| Topic | marks-1 β Opus | Opus β GPT-5.5 (ceiling) | Ξ |
|---|---|---|---|
dividends |
0.84 | 0.89 | -0.05 β |
demand |
0.82 | 0.94 | -0.12 |
revenue_growth |
0.80 | 0.94 | -0.14 |
buybacks |
0.77 | 0.94 | -0.17 |
guidance |
0.76 | 0.91 | -0.15 |
m_and_a |
0.71 | 0.83 | -0.12 |
macro_exposure |
0.66 | 0.89 | -0.23 |
margins |
0.63 | 0.91 | -0.28 |
competition |
0.59 | 0.81 | -0.22 |
headcount |
0.39 | 0.81 | -0.42 β |
Headcount is the weakest dimension. Layoff/hiring signal is harder to parse than direction-of-growth signals. v2 will revisit.
Inference
- Latency: ~50ms/call on CPU, sub-10ms on modern GPUs.
- Batched throughput (bf16, max_length=16384): ~12 calls/sec/instance on A100/H100/B200.
- Output is deterministic β same input always returns the same 23 numbers.
- Context window: 16,384 tokens (~50k characters). Covers ~p99 of earnings calls.
For deployment: the model is a standard transformers model. Wrap in FastAPI, deploy on
HF Inference Endpoints, or run as a subprocess in your data pipeline.
Limitations and known gaps
headcountdimension is unreliable (Spearman 0.39 vs frontier β 50% below the other 9 topics). Treat with skepticism.- Tone has rank-order signal but absolute levels drift. Quants should normalize cross-sectionally rather than thresholding raw values.
- English transcripts only. Non-English calls (translated) work but degrade. Top non-US training countries: GB, DE, FR, JP, SE, CH, CN.
- Truncates at 16,384 tokens. Covers ~p99 of calls; the very longest (Asian conglomerates with 8h+ analyst days) lose middle content via head+tail truncation.
- Pure NLP scorer β not an alpha model. Outputs are features; the trading rule is the consumer's responsibility.
Tier
Tier 2 β research preview. v1 of the model. Eval against frontier LLMs is documented above; absolute calibration may shift in v2 with a larger label set. Production users should run their own validation against return data.
Citation
@misc{binomialmarks2026,
author = {Binomial AI Research},
title = {binomial-marks-1: An earnings-call NLP scorer for quantitative finance},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/BinomialTechnologies/binomial-marks-1}},
}
License
Apache 2.0. Use freely; we'd appreciate a citation if you build on it.
- Downloads last month
- 74