A privacy-preserving distributed filtering framework for NLP artifacts

Md Nazmus Sadat, Md Momin Al Aziz, Noman Mohammed, Serguei Pakhomov, Hongfang Liu, Xiaoqian Jiang

Research output: Contribution to journalArticlepeer-review

10 Scopus citations


Background: Medical data sharing is a big challenge in biomedicine, which often hinders collaborative research. Due to privacy concerns, clinical notes cannot be directly shared. A lot of efforts have been dedicated to de-identifying clinical notes but it is still very challenging to accurately locate and scrub all sensitive elements from notes in an automatic manner. An alternative approach is to remove sentences that might contain sensitive terms related to personal information. Methods: A previous study introduced a frequency-based filtering approach that removes sentences containing low frequency bigrams to improve the privacy protection without significantly decreasing the utility. Our work extends this method to consider clinical notes from distributed sources with security and privacy considerations. We developed a novel secure protocol based on private set intersection and secure thresholding to identify uncommon and low-frequency terms, which can be used to guide sentence filtering. Results: As the computational cost of our proposed framework mostly depends on the cardinality of the intersection of the sets and the number of data owners, we evaluated the framework in terms of these two factors. Experimental results demonstrate that our proposed method is scalable in various experimental settings. In addition, we evaluated our framework in terms of data utility. This evaluation shows that the proposed method is able to retain enough information for data analysis. Conclusion: This work demonstrates the feasibility of using homomorphic encryption to develop a secure and efficient multi-party protocol.

Original languageEnglish (US)
Article number183
JournalBMC medical informatics and decision making
Issue number1
StatePublished - Sep 7 2019

Bibliographical note

Funding Information:
This work was funded in part by NIBIB U01 EB023685, NSERC Discovery Grants (RGPIN-2015-04147), NIH U01TR002062, and University Research Grants Program (URGP) from the University of Manitoba. Xiaoqian Jiang was supported in part by the CPRIT RR180012, UT Stars award, the National Institute of Health (NIH) under award number U01TR002062, R01GM114612, R01GM118574, R01GM124111.

Publisher Copyright:
© 2019 The Author(s).


  • Biomedical data security and privacy
  • Clinical notes de-identification
  • Homomorphic encryption


Dive into the research topics of 'A privacy-preserving distributed filtering framework for NLP artifacts'. Together they form a unique fingerprint.

Cite this