Auden-Voice

Auden-Voice is a general-purpose voice encoder trained to learn robust speaker representations.

The model is trained using multi-task learning, where jointly optimizing speaker identification, emotion, gender, and age classification objectives leads to more general and transferable voice representations.


Model Details

  • Model type: Voice encoder
  • Architecture: Zipformer
  • Embedding dimension: 768
  • Number of parameters: ~156M
  • Framework: PyTorch
  • Output: Frame-level embeddings [B, T, D]
  • Pooling: User-defined (e.g., mean pooling for utterance-level embeddings)

Training

Training Strategy

Multi-task learning was found to work best. The model is jointly trained on the following tasks:

  • Speaker identification
  • Emotion classification
  • Gender classification
  • Age classification

This setup encourages the encoder to learn robust and general-purpose voice representations.

Training Data

The model is trained on publicly available academic speech datasets, totaling approximately 2050 hours of audio.

Task Dataset(s) #Samples Hours
Speaker Identification VoxCeleb2 974k 2026
Paralinguistic Tasks CREMA-D, RAVDESS, IEMOCAP, TESS 18.3k 20

Training Code

Full training scripts and configurations are available at:
https://github.com/AudenAI/Auden/tree/main/examples/voice


Intended Use

This model is intended to be used as a general-purpose voice encoder for:

  • Speaker identification and verification
  • Speaker diarization
  • Emotion, gender, and age classification
  • Audio–text and text–audio retrieval
  • Speech-related downstream tasks that benefit from pretrained voice embeddings

How to Use

Load the Encoder

from auden.auto.auto_model import AutoModel
import torch

encoder = AutoModel.from_pretrained("AudenAI/auden-encoder-voice")
encoder = encoder.to("cuda" if torch.cuda.is_available() else "cpu")

# Extract Voice Embeddings
import torch.nn.functional as F

audio_files = ["/path/to/audio1.wav", "/path/to/audio2.wav"]
embeddings_list = []

for audio_file in audio_files:
    x, x_lens = encoder.extract_feature([audio_file])
    x, x_lens = x.to(device), x_lens.to(device)

    with torch.no_grad():
        encoder_output = encoder(x, x_lens)
        frame_embeddings = encoder_output["encoder_out"]  # [B, T, D]

        # Global average pooling (example for speaker verification)
        T = frame_embeddings.size(1)
        mask = (torch.arange(T, device=device).unsqueeze(0) < x_lens.unsqueeze(1)).unsqueeze(-1).float()
        utterance_embedding = (frame_embeddings * mask).sum(dim=1) / mask.sum(dim=1)

        embeddings_list.append(utterance_embedding)

embeddings = torch.cat(embeddings_list, dim=0)  # [N, D]
embeddings = F.normalize(embeddings, p=2, dim=-1)

similarity = torch.matmul(embeddings[0], embeddings[1])
print(f"Cosine similarity: {similarity:.4f}")


# Expected Output
🎵 Audio 1:
   Frame embeddings shape: torch.Size([1, 97, 768])
   Utterance embedding shape: torch.Size([1, 768])

🎵 Audio 2: 
   Frame embeddings shape: torch.Size([1, 138, 768])
   Utterance embedding shape: torch.Size([1, 768])

Cosine similarity: 0.7234
Same speaker: YES

Performance

Task - Dataset Metric
Speaker Identification - VoxCeleb2 Accuracy 95.25%
Speaker Verification - VoxCeleb1-O EER 3%
Speaker Diarization - VoxConverse DER 17%
Age Classification - CREMA-D Accuracy 93.91%
Gender Classification - CREMA-D Accuracy 99.72%
Gender Classification - RAVDESS Accuracy 100%
Emotion Classification - CREMA-D Accuracy 83.99%
Emotion Classification - RAVDESS Accuracy 89.71%
Audio → Text Retrieval - ParaspeechCaps R@1 63.31
Text → Audio Retrieval - ParaspeechCaps R@1 61.69
LLM-QA Emotion - AirBench-MELD Accuracy 27.23%
LLM-QA Emotion - AirBench-IEMOCAP Accuracy 84.70%
LLM-QA Gender - AirBench-MELD Accuracy 81.58%
LLM-QA Gender - AirBench-CommonVoice Accuracy 93.15%
LLM-QA Age - AirBench-CommonVoice Accuracy 58.27%

Limitations

  • The model is trained primarily on English speech data and may not generalize well to other languages.
  • The model is not evaluated on generative tasks such as speech synthesis or voice conversion.
  • Utterance-level representations depend on the pooling strategy selected by the user.

Citation

If you use this model in your research, please cite:

@article{huo2025auden,
  title={Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding},
  author={Huo, Mingyue and Tseng, Wei-Cheng and Shao, Yiwen and Zhang, Hao and Yu, Dong},
  journal={arXiv preprint arXiv:2511.15145},
  year={2025}
}
Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for AudenAI/auden-encoder-voice