Language Models are Unsupervised Multitask Learners
Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-
learning for fast adaptation of deep networks. arXiv preprint
arXiv:1703.03400, 2017.
Gehrmann, S., Deng, Y., and Rush, A. M. Bottom-up abstractive
summarization. arXiv preprint arXiv:1808.10792, 2018.
Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. Mul-
tilingual language processing from bytes. arXiv preprint
arXiv:1512.00103, 2015.
Gong, C., He, D., Tan, X., Qin, T., Wang, L., and Liu, T.-Y. Frage:
frequency-agnostic word representation. In Advances in Neural
Information Processing Systems, pp. 1341–1352, 2018.
Grave, E., Joulin, A., and Usunier, N. Improving neural
language models with a continuous cache. arXiv preprint
arXiv:1612.04426, 2016.
He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep
residual networks. In European conference on computer vision,
pp. 630–645. Springer, 2016.
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kian-
inejad, H., Patwary, M., Ali, M., Yang, Y., and Zhou, Y. Deep
learning scaling is predictable, empirically. arXiv preprint
arXiv:1712.00409, 2017.
Hill, F., Bordes, A., Chopra, S., and Weston, J. The goldilocks
principle: Reading children’s books with explicit memory rep-
resentations. arXiv preprint arXiv:1511.02301, 2015.
Hill, F., Cho, K., and Korhonen, A. Learning distributed repre-
sentations of sentences from unlabelled data. arXiv preprint
arXiv:1602.03483, 2016.
Hoang, L., Wiseman, S., and Rush, A. M. Entity tracking im-
proves cloze-style reading comprehension. arXiv preprint
arXiv:1810.02891, 2018.
Howard, J. and Ruder, S. Universal language model fine-tuning for
text classification. In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguistics (Volume 1:
Long Papers), volume 1, pp. 328–339, 2018.
Jelinek, F. and Mercer, R. L. Interpolated estimation of markov
source parameters from sparse data. In Proceedings of the
Workshop on Pattern Recognition in Practice, Amsterdam, The
Netherlands: North-Holland, May., 1980.
Jia, R. and Liang, P. Adversarial examples for evaluating read-
ing comprehension systems. arXiv preprint arXiv:1707.07328,
2017.
Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu,
Y. Exploring the limits of language modeling. arXiv preprint
arXiv:1602.02410, 2016.
Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N.,
Jones, L., and Uszkoreit, J. One model to learn them all. arXiv
preprint arXiv:1706.05137, 2017.
Karpathy, A., Johnson, J., and Fei-Fei, L. Visualizing and under-
standing recurrent networks. arXiv preprint arXiv:1506.02078,
2015.
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins,
G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-
Barwinska, A., et al. Overcoming catastrophic forgetting in
neural networks. Proceedings of the national academy of sci-
ences, pp. 201611835, 2017.
Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R.,
Torralba, A., and Fidler, S. Skip-thought vectors. In Advances
in neural information processing systems, pp. 3294–3302, 2015.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classifi-
cation with deep convolutional neural networks. In Advances in
neural information processing systems, pp. 1097–1105, 2012.
Kwiatkowski, T., Palomaki, J., Rhinehart, O., Collins, M., Parikh,
A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey, M., Devlin,
J., et al. Natural questions: a benchmark for question answering
research. 2019.
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J.
Building machines that learn and think like people. Behavioral
and Brain Sciences, 40, 2017.
Lample, G., Conneau, A., Denoyer, L., and Ranzato, M. Unsu-
pervised machine translation using monolingual corpora only.
arXiv preprint arXiv:1711.00043, 2017.
Levesque, H., Davis, E., and Morgenstern, L. The winograd
schema challenge. In Thirteenth International Conference on
the Principles of Knowledge Representation and Reasoning,
2012.
Levy, O. and Goldberg, Y. Neural word embedding as implicit ma-
trix factorization. In Advances in neural information processing
systems, pp. 2177–2185, 2014.
Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser, L.,
and Shazeer, N. Generating wikipedia by summarizing long
sequences. arXiv preprint arXiv:1801.10198, 2018.
McCann, B., Bradbury, J., Xiong, C., and Socher, R. Learned
in translation: Contextualized word vectors. In Advances in
Neural Information Processing Systems, pp. 6294–6305, 2017.
McCann, B., Keskar, N. S., Xiong, C., and Socher, R. The natural
language decathlon: Multitask learning as question answering.
arXiv preprint arXiv:1806.08730, 2018.
Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel
mixture models. arXiv preprint arXiv:1609.07843, 2016.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean,
J. Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing
systems, pp. 3111–3119, 2013.
Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., et al. Abstrac-
tive text summarization using sequence-to-sequence rnns and
beyond. arXiv preprint arXiv:1602.06023, 2016.
Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi,
R., Pezzelle, S., Baroni, M., Boleda, G., and Fern
´
andez, R. The
lambada dataset: Word prediction requiring a broad discourse
context. arXiv preprint arXiv:1606.06031, 2016.
Pennington, J., Socher, R., and Manning, C. Glove: Global vectors
for word representation. In Proceedings of the 2014 conference
on empirical methods in natural language processing (EMNLP),
pp. 1532–1543, 2014.