Transformers documentation
BERTology
Get started
Tutorials
Run inference with pipelinesWrite portable code with AutoClassPreprocess dataFine-tune a pretrained modelTrain with a scriptSet up distributed training with 🤗 AccelerateLoad and train adapters with 🤗 PEFTShare your modelAgentsGeneration with LLMs
Task Guides
Natural Language Processing
Audio
Computer Vision
Multimodal
Generation
Developer guides
Use fast tokenizers from 🤗 TokenizersRun inference with multilingual modelsUse model-specific APIsShare a custom modelRun training on Amazon SageMakerExport to ONNXExport to TFLiteExport to TorchScriptBenchmarksNotebooks with examplesCommunity resourcesCustom Tools and PromptsTroubleshoot
Performance and scalability
Overview Instantiating a big modelTroubleshootingXLA Integration for TensorFlow ModelsOptimize inference using `torch.compile()`
Efficient training techniques
Methods and tools for efficient training on a single GPUMultiple GPUs and parallelismEfficient training on CPUDistributed CPU trainingTraining on TPUsTraining on TPU with TensorFlowTraining on Specialized HardwareCustom hardware for trainingHyperparameter Search using Trainer API
Optimizing inference
Contribute
How to contribute to transformers?How to add a model to 🤗 Transformers?How to convert a 🤗 Transformers model to TensorFlow?How to add a pipeline to 🤗 Transformers?TestingChecks on a Pull Request
Conceptual guides
PhilosophyGlossaryWhat 🤗 Transformers can doHow 🤗 Transformers solve tasksThe Transformer model familySummary of the tokenizersAttention mechanismsPadding and truncationBERTologyPerplexity of fixed-length modelsPipelines for webserver inferenceModel training anatomy
API
Main Classes
Agents and ToolsAuto ClassesCallbacksConfigurationData CollatorKeras callbacksLoggingModelsText GenerationONNXOptimizationModel outputsPipelinesProcessorsQuantizationTokenizerTrainerDeepSpeed IntegrationFeature ExtractorImage Processor
Models
Text models
Vision models
Audio models
Multimodal models
Reinforcement learning models
Time series models
Graph models
Internal Helpers
You are viewing v4.33.0 version. A newer version v5.8.1 is available.
BERTology
There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call “BERTology”). Some good examples of this field are:
- BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
- Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
- What Does BERT Look At? An Analysis of BERT’s Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
- CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure: https://arxiv.org/abs/2210.04633
In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
- accessing all the hidden-states of BERT/GPT/GPT-2,
- accessing all the attention weights for each head of BERT/GPT/GPT-2,
- retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
To help you understand and use these features, we have added a specific example script: bertology.py while extract information and prune a model pre-trained on GLUE.