Pre-training Differentially Private Models with Limited Public Data

Zhiqi Bu, Xinwei Zhang, Sheng Zha, Mingyi Hong, George Karypis

Research output: Contribution to journalConference articlepeer-review

Abstract

The superior performance of large foundation models relies on the use of massive amounts of high-quality data, which often contain sensitive, private and copyrighted material that requires formal protection. While differential privacy (DP) is a prominent method to gauge the degree of security provided to the models, its application is commonly limited to the model fine-tuning stage, due to the performance degradation when DP is applied during the pretraining stage. Consequently, DP is yet not capable of protecting a substantial portion of the data used during the initial pre-training process. In this work, we provide a theoretical understanding of the efficacy of DP training by analyzing the per-iteration loss improvement, through the lens of Hessian matrix for large neural networks. We make a key observation that DP optimizers' performance degradation can be significantly mitigated by the use of limited public data, which leads to a novel DP continual pre-training strategy. Empirically, using only 10% of public data and 90% of private data, our strategy can achieve DP accuracy of 41.5% on ImageNet-21k (with ϵ = 8), as well as non-DP accuracy of 55.7% and 60.0% on downstream tasks Places365 and iNaturalist-2021, respectively, on par with state-of-the-art standard pretraining and substantially outperforming existing DP pre-trained models. Our DP pre-trained models are released in fastDP library (https://github.com/awslabs/fast-differential-privacy/releases/tag/v2.1).

Original languageEnglish (US)
JournalAdvances in Neural Information Processing Systems
Volume37
StatePublished - 2024
Event38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada
Duration: Dec 9 2024Dec 15 2024

Bibliographical note

Publisher Copyright:
© 2024 Neural information processing systems foundation. All rights reserved.

Fingerprint

Dive into the research topics of 'Pre-training Differentially Private Models with Limited Public Data'. Together they form a unique fingerprint.

Cite this