In Autoregressive models, like GPTs, the outputs (tokens) are generated sequentially, i.e. token by token, and each token's prediction conditioned on the tokens that have come before it. This approach enables the model to capture intricate patterns in language but comes with the trade-off of increased computational complexity.
For example, consider the sentence: "The quick brown fox jumps over the lazy dog."
In an autoregressive model, the process might look like this:
"The" -> model generates "quick"
"The quick" -> model generates "brown"
"The quick brown" -> model generates "fox" ...and so on.
Non-Autoregressive models generate all output tokens in parallel, without strict sequential dependence. This can lead to faster generation but might sacrifice some level of context and coherence.
For example, BERT (Bidirectional Encoder Representations from Transformers) employs a bidirectional context modeling approach, which means that each token is influenced by both tokens to its left and tokens to its right in the input sequence. This is in contrast to autoregressive models that are only influenced by tokens to their left.
In a non-autoregressive model using the Masked Token Prediction approach, the process might look like this:
"The quick [MASK] jumps over the lazy [MASK]."
Model predicts the masked tokens in parallel: "The quick brown jumps over the lazy dog."
In this example, the masked tokens are all predicted simultaneously, leveraging the contextual information present in the input sentence. This parallelism can lead to faster generation compared to autoregressive models, where each token is generated based on the preceding ones.