Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset

Aria Y. Wang, Kendrick Kay, Thomas Naselaris, Michael J. Tarr, Leila Wehbe

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

High-performing neural networks for vision have dramatically advanced our ability to account for neural data in biological systems. Recently, further improvement in performance of these neural networks has been catalysed by joint training on images and natural language, increased dataset sizes and data diversity. We explored whether the same factors (joint training, dataset size and diversity) support similar improvements in the prediction of visual responses in the human brain. We used models pretrained with Contrastive Language-Image Pretraining (CLIP)—which learns image embeddings that best match text embeddings of image captions from diverse, large-scale datasets—to study visual representations. We built voxelwise encoding models based on CLIP image features to predict brain responses to real-world images. We found that ResNet50 with CLIP is a better model of high-level visual cortex, explaining up to R 2 = 79% of variance in voxel responses in held-out test data, a substantial increase from models trained only with image/label pairs (ImageNet trained ResNet) or text (BERT). Comparisons across different model backbones ruled out network architecture as a factor in performance improvements. Comparisons across models that controlled for dataset size and data diversity demonstrated that language feedback along with large and diverse datasets are important factors in explaining neural responses in high-level visual brain regions. Visualizations of model embeddings and principal component analysis revealed that our models capture both global and fine-grained semantic dimensions represented within human visual cortex.

Original languageEnglish (US)
Pages (from-to)1415-1426
Number of pages12
JournalNature Machine Intelligence
Volume5
Issue number12
DOIs
StatePublished - Dec 2023

Bibliographical note

Publisher Copyright:
© 2023, The Author(s), under exclusive licence to Springer Nature Limited.

Fingerprint

Dive into the research topics of 'Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset'. Together they form a unique fingerprint.

Cite this