DeBERTa Model

class DeBERTa.deberta.DeBERTa(config=None, pre_trained=None)[source]

DeBERTa encoder This module is composed of the input embedding layer with stacked transformer layers with disentangled attention.

  • config – A model config class instance with the configuration to build a new model. The schema is similar to BertConfig, for more details, please refer ModelConfig
  • pre_trained – The pre-trained DeBERTa model, it can be a physical path of a pre-trained DeBERTa model or a released configurations, i.e. [base, large, base_mnli, large_mnli]

Load state from previous loaded model state dictionary.

Parameters:state (dict, optional) – State dictionary as the state returned by torch.module.state_dict(), default: None. If it’s None, then will use the pre-trained state loaded via the constructor to re-initialize the DeBERTa model
forward(input_ids, attention_mask=None, token_type_ids=None, output_all_encoded_layers=True, position_ids=None, return_att=False)[source]
  • input_ids – a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary
  • attention_mask

    an optional parameter for input mask or attention mask.

    • If it’s an input mask, then it will be torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It’s a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. It’s the mask that we typically use for attention when a batch has varying length sentences.
    • If it’s an attention mask then it will be torch.LongTensor of shape [batch_size, sequence_length, sequence_length]. In this case, it’s a mask indicate which tokens in the sequence should be attended by other tokens in the sequence.
  • token_type_ids – an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. Type 0 corresponds to a sentence A and type 1 corresponds to a sentence B token (see BERT paper for more details).
  • output_all_encoded_layers – whether to output results of all encoder layers, default, True

  • The output of the stacked transformer layers if output_all_encoded_layers=True, else the last layer of stacked transformer layers
  • Attention matrix of self-attention layers if return_att=True


# Batch of wordPiece token ids.
# Each sample was padded with zero to the maxium length of the batch
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
# Mask of valid input ids
attention_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])

# DeBERTa model initialized with pretrained base model
bert = DeBERTa(pre_trained='base')

encoder_layers = bert(input_ids, attention_mask=attention_mask)


class DeBERTa.deberta.NNModule(config, *inputs, **kwargs)[source]

An abstract class to handle weights initialization and a simple interface for dowloading and loading pretrained models.

Parameters:config (ModelConfig) – The model config to the module

Apply Gaussian(mean=0, std=`config.initializer_range`) initialization to the module.

Parameters:module (torch.nn.Module) – The module to apply the initialization.


class MyModule(NNModule):
  def __init__(self, config):
    # Add construction instructions
    self.bert = DeBERTa(config)

    # Add other modules

    # Apply initialization
classmethod load_model(model_path, model_config=None, tag=None, no_cache=False, cache_dir=None, *inputs, **kwargs)[source]

Instantiate a sub-class of NNModule from a pre-trained model file.

  • model_path (str) –

    Path or name of the pre-trained model which can be either,

    • The path of pre-trained model
    • The pre-trained DeBERTa model name in DeBERTa GitHub releases, i.e. [base, base_mnli, large, large_mnli].

    If model_path is None or -, then the method will create a new sub-class without initialing from pre-trained models.

  • model_config (str) –

    The path of model config file. If it’s None, then the method will try to find the the config in order:

    1. [‘config’] in the model state dictionary.
    2. model_config.json aside the model_path.

    If it failed to find a config the method will fail.

  • tag (str, optional) – The release tag of DeBERTa, default: None.
  • no_cache (bool, optional) – Disable local cache of downloaded models, default: False.
  • cache_dir (str, optional) – The cache directory used to save the downloaded models, default: None. If it’s None, then the models will be saved at $HOME/.~DeBERTa

The sub-class object.

Return type:



class DeBERTa.deberta.DisentangledSelfAttention(config)[source]

Disentangled self-attention module

Parameters:config (str) – A model config class instance with the configuration to build a new model. The schema is similar to BertConfig, for more details, please refer ModelConfig
forward(hidden_states, attention_mask, return_att=False, query_states=None, relative_pos=None, rel_embeddings=None)[source]

Call the module

  • hidden_states (torch.FloatTensor) – Input states to the module usally the output from previous layer, it will be the Q,K and V in Attention(Q,K,V)
  • attention_mask (torch.ByteTensor) – An attention mask matrix of shape [B, N, N] where B is the batch size, N is the maxium sequence length in which element [i,j] = 1 means the i th token in the input can attend to the j th token.
  • return_att (bool, optional) – Whether return the attention maxitrix.
  • query_states (torch.FloatTensor, optional) – The Q state in Attention(Q,K,V).
  • relative_pos (torch.LongTensor) – The relative position encoding between the tokens in the sequence. It’s of shape [B, N, N] with values ranging in [-max_relative_positions, max_relative_positions].
  • rel_embeddings (torch.FloatTensor) – The embedding of relative distances. It’s a tensor of shape [\(2 \times \text{max_relative_positions}\), hidden_size].
DeBERTa.deberta.build_relative_position(query_size, key_size, device)[source]

Build relative position according to the query and key

We assume the absolute position of query \(P_q\) is range from (0, query_size) and the absolute position of key \(P_k\) is range from (0, key_size), The relative positions from query to key is

\(R_{q \rightarrow k} = P_q - P_k\)

  • query_size (int) – the length of query
  • key_size (int) – the length of key

A tensor with shape [1, query_size, key_size]

Return type:



class DeBERTa.deberta.ContextPooler(config)[source]


class DeBERTa.deberta.BertEncoder(config)[source]

Modified BertEncoder with relative position bias support


class DeBERTa.deberta.BertLayerNorm(size, eps=1e-12)[source]

LayerNorm module in the TF style (epsilon inside the square root).


class DeBERTa.deberta.XSoftmax

Masked Softmax which is optimized for saving memory

  • input (torch.tensor) – The input tensor that will apply softmax.
  • mask (torch.IntTensor) – The mask matrix where 0 indicate that element will be ignored in the softmax caculation.
  • dim (int) – The dimenssion that will apply softmax.


import torch
from DeBERTa.deberta import XSoftmax
# Make a tensor
x = torch.randn([4,20,100])
# Create a mask
mask = (x>0).int()
y = XSoftmax.apply(x, mask, dim=-1)


class DeBERTa.deberta.StableDropout(drop_prob)[source]

Optimized dropout module for stabilizing the training

Parameters:drop_prob (float) – the dropout probabilities

Call the module

Parameters:x (torch.tensor) – The input tensor to apply dropout


DeBERTa.deberta.MaskedLayerNorm(layerNorm, input, mask=None)[source]

Masked LayerNorm which will apply mask over the output of LayerNorm to avoid inaccurate updatings to the LayerNorm module.

  • layernorm (BertLayerNorm) – LayerNorm module or function
  • input (torch.tensor) – The input tensor
  • mask (torch.IntTensor) – The mask to applied on the output of LayerNorm where 0 indicate the output of that element will be ignored, i.e. set to 0


# Create a tensor b x n x d
x = torch.randn([1,10,100])
m = torch.tensor([[1,1,1,0,0,0,0,0,0,0]],
LayerNorm = DeBERTa.deberta.BertLayerNorm(100)
y = MaskedLayerNorm(LayerNorm, x, m)


class DeBERTa.deberta.GPT2Tokenizer(vocab_file=None, do_lower_case=True, special_tokens=None)[source]

A wrapper of GPT2 tokenizer with similar interface as BERT tokenizer

  • vocab_file (str, optional) –

    The local path of vocabulary package or the release name of vocabulary in DeBERTa GitHub releases, e.g. “bpe_encoder”, default: None.

    If it’s None, then it will download the vocabulary in the latest release from GitHub. The vocabulary file is a state dictionary with three items, “dict_map”, “vocab”, “encoder” which correspond to three files used in RoBERTa, i.e. dict.txt, vocab.txt and encoder.json. The difference between our wrapped GPT2 tokenizer and RoBERTa wrapped tokenizer are,
    • Special tokens, unlike RoBERTa which use <s>, </s> as the start token and end token of a sentence. We use [CLS] and [SEP] as the start and end token of input sentence which is the same as BERT.
    • We remapped the token ids in our dictionary with regarding to the new special tokens, [PAD] => 0, [CLS] => 1, [SEP] => 2, [UNK] => 3, [MASK] => 50264
  • do_lower_case (bool, optional) – Whether to convert inputs to lower case. Not used in GPT2 tokenizer.
  • special_tokens (list, optional) – List of special tokens to be added to the end of the vocabulary.

Adds a special token to the dictionary.

Parameters:token (str) – Tthe new token/word to be added to the vocabulary.
Returns:The id of new token in the vocabulary.
add_symbol(word, n=1)[source]

Adds a word to the dictionary.

  • word (str) – Tthe new token/word to be added to the vocabulary.
  • n (int, optional) – The frequency of the word.

The id of the new word.


Convert list of ids to tokens.

Parameters:ids (list) – list of ids
Returns:List of tokens

Convert list of tokens to ids.

Parameters:tokens (list) – list of tokens
Returns:List of ids

Decode list of tokens to text strings.

Parameters:tokens (list) – list of tokens.
Returns:Text string corresponds to the input tokens.


>>> tokenizer = GPT2Tokenizer()
>>> text = "Hello world!"
>>> tokens = tokenizer.tokenize(text)
>>> print(tokens)
['15496', '995', '0']

>>> tokenizer.decode(tokens)
'Hello world!'

Convert an input text to tokens.

Parameters:text (str) – input text to be tokenized.
Returns:A list of byte tokens where each token represent the byte id in GPT2 byte dictionary


>>> tokenizer = GPT2Tokenizer()
>>> text = "Hello world!"
>>> tokens = tokenizer.tokenize(text)
>>> print(tokens)
['15496', '995', '0']


class DeBERTa.deberta.ModelConfig[source]

Configuration class to store the configuration of a DeBERTa model.


Size of the encoder layers and the pooler layer, default: 768.


Number of hidden layers in the Transformer encoder, default: 12.


Number of attention heads for each attention layer in the Transformer encoder, default: 12.


The size of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder, default: 3072.


The non-linear activation function (function or string) in the encoder and pooler. If string, “gelu”, “relu” and “swish” are supported, default: gelu.


The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler, default: 0.1.


The dropout ratio for the attention probabilities, default: 0.1.


The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048), default: 512.


The vocabulary size of the token_type_ids passed into DeBERTa model, default: -1.


The sttdev of the _normal_initializer for initializing all weight matrices, default: 0.02.


Whether use relative position encoding, default: False.


The range of relative positions [-max_position_embeddings, max_position_embeddings], default: -1, use the same value as max_position_embeddings.


The value used to pad input_ids, default: 0.


Whether add absolute position embedding to content embedding, default: True.


The type of relative position attention, it can be a combination of [p2c, c2p, p2p], e.g. “p2c”, “p2c|c2p”, “p2c|c2p|p2p”., default: “None”.



class DeBERTa.deberta.PoolConfig(config=None)[source]

Configuration class to store the configuration of pool layer.

Parameters:config (ModelConfig) – The model config. The field of pool config will be initalized with the pooling field in model config.

Size of the encoder layers and the pooler layer, default: 768.


The dropout rate applied on the output of [CLS] token,


The activation function of the projection layer, it can be one of [‘gelu’, ‘tanh’].



# Here is the content of an exmple model config file in json format

      "hidden_size": 768,
      "num_hidden_layers" 12,
      "num_attention_heads": 12,
      "intermediate_size": 3072,
      "pooling": {
        "hidden_size":  768,
        "hidden_act": "gelu",
        "dropout": 0.1