What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Instruction tuning, LLMs, Zero shot learning

Our experiments show that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero-shot generalization after purely unsupervised pretraining. However, models with non-causal visibility on their input trained with a masked language modeling objective followed by multitask finetuning perform the best among our experiments.