Saudi Electronic University Patient Privacy during Survival Analysis Case Discussion

Please read the following study:

Bonomi, L., Jiang, X., & Ohno-Machado, L. (2020).

Protecting patient privacy in survival analyses

. Journal of the American Medical Informatics Association, 27(3), 366–375.

Discuss your response to this survival analysis study. Do you have the same concerns as the researchers regarding the patient privacy issues when presenting actuarial/survival analysis tables? Do you have other suggestions regarding protecting patient privacy within a study?

Journal of the American Medical Informatics Association, 27(3), 2020, 366–375
doi: 10.1093/jamia/ocz195
Advance Access Publication Date: 21 November 2019
Research and Applications
Research and Applications
Protecting patient privacy in survival analyses
Luca Bonomi1, Xiaoqian Jiang2, and Lucila Ohno-Machado1,3
1
Department of Biomedical Informatics, UC San Diego Health, University of California, San Diego, La Jolla, California, USA,
School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA, and 3Division of
Health Services Research and Development, VA San Diego Healthcare System, La Jolla, California, USA
2
Corresponding Author: Luca Bonomi, PhD, UCSD Health Department of Biomedical Informatics, University of California
San Diego, 9500 Gilman Dr., La Jolla, California 92093, USA; lbonomi@ucsd.edu
Received 15 July 2019; Revised 9 September 2019; Editorial Decision 6 October 2019; Accepted 18 October 2019
ABSTRACT
Objective: Survival analysis is the cornerstone of many healthcare applications in which the “survival” probability (eg, time free from a certain disease, time to death) of a group of patients is computed to guide clinical
decisions. It is widely used in biomedical research and healthcare applications. However, frequent sharing of
exact survival curves may reveal information about the individual patients, as an adversary may infer the presence of a person of interest as a participant of a study or of a particular group. Therefore, it is imperative to develop methods to protect patient privacy in survival analysis.
Materials and Methods: We develop a framework based on the formal model of differential privacy, which provides provable privacy protection against a knowledgeable adversary. We show the performance of privacyprotecting solutions for the widely used Kaplan-Meier nonparametric survival model.
Results: We empirically evaluated the usefulness of our privacy-protecting framework and the reduced privacy risk
for a popular epidemiology dataset and a synthetic dataset. Results show that our methods significantly reduce the
privacy risk when compared with their nonprivate counterparts, while retaining the utility of the survival curves.
Discussion: The proposed framework demonstrates the feasibility of conducting privacy-protecting survival
analyses. We discuss future research directions to further enhance the usefulness of our proposed solutions in
biomedical research applications.
Conclusion: The results suggest that our proposed privacy-protection methods provide strong privacy protections while preserving the usefulness of survival analyses.
Key words: data privacy, survival analysis, data sharing, Kaplan-Meier, actuarial
INTRODUCTION
Survival analysis aims at computing the “survival” probability (ie,
how long it takes for an event to happen) for a group of observations that contain information about individuals, including time to
event. In medical research, the primary interest of survival analysis
is in the computation and comparison of survival probabilities
across patient groups (eg, standard of care vs. intervention), in
which survival may refer, for example, to the time free from the
onset of a certain disease, time free from recurrence, and time to
death. Survival analysis provides important insights, among other
things, on the effectiveness of treatments, identification of risk,
biomarker utility, and hypotheses testing.1–10 Survival curves aggregate information from groups of interest and are easy to generate, interpret, compare, and publish online. Although aggregate
data can be protected by different approaches, such as, rounding,11,12 binning,13 and perturbation,14 survival analysis models
have special characteristics that warrant the development of customized methods. Before describing our proposed solutions, we
briefly review how survival curves are derived and what their vulnerabilities are from a privacy perspective.
C The Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association.
V
All rights reserved. For permissions, please email: journals.permissions@oup.com
366
Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3
Survival analysis methods and privacy
Methods for survival analysis can be divided into 3 main categories:
parametric, semiparametric, and nonparametric models. Parametric
models rely on known probability distributions (eg, the Weibull distribution) to learn a statistical model. These models are less frequently
used than semi- or nonparametric methods, as their parametric
assumptions hardly apply in practice. Even though the released curves
exhibit a natural “smoothing,” studies have shown that the parameters of the model may reveal sensitive information.15 Semiparametric
methods are extremely popular for multivariate analyses and can be
used to identify important risk factors for the event of interest. As an
example, the Cox proportional hazards model16 only assumes a proportional relationship between the baseline hazard and the hazard attributed to a specific group (ie, it does not assume that survival
follows a known distribution, as is the case with parametric models).
Nonparametric models are frequently used to describe the survival
probability over time, without requiring assumptions on the underlying data distribution. Among those models, the Kaplan-Meier (KM)
product-limit estimators are frequent in the biomedical literature. As
an example, a search for PubMed articles using the term KaplanMeier retrieves more than 8000 articles each year, from 2013 to
2018. A search for actuarial returns about 500 articles per year. In
this article, we focus on the KM estimator and present results for the
actuarial model in the Supplementary Appendix. The KM method
generates a survival curve in which each event can be seen by a corresponding drop in the probability of survival. For example, Foldvary
et al4 used the KM method to analyze seizure outcomes for patients
who underwent temporal lobectomy for epilepsy. In contrast, in the
actuarial method,17,18 the survival probability is computed over prespecified periods of time (eg, 1 week, 1 month). For example, Balsam
et al19 used actuarial curves to describe the long-term survival for
valve surgery in an elderly population.
It is surprising that relatively little attention has been given so far
to the protection of individual privacy in survival analysis. Survival
analyses generate aggregated results that are unlikely to directly reveal identifying information (eg, name, SSN).20 However, a knowledgeable adversary, who observes survival analysis results over time,
may be able to determine whether a targeted individual participated
in the study and even if the individual belongs to a particular subgroup in the study, thus learning sensitive phenotypes. Several previous privacy studies have shown that sharing aggregated results may
lead to this privacy risk.15,21,22 For example, small values of counts
(eg, .05) and the differences between groups continue to be statistically significant.
DISCUSSION
We presented a differentially private framework that can be used to
release survival curves while protecting patient privacy. We demonstrated that our method significantly reduces the risk of a privacy
372
Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3
Figure 4. Inference error for the Kaplan-Meier (KM) survival curves for N ¼ 1000, 10000, and 100000 sampled patients obtained with the nonprivate (KM) and private (differentially private KM [DP-KM]) methods. Inference error for KM method and differently private solution (DP-KM) vs the privacy parameter (Þ; with (A) N
¼ 1000, (B) N ¼ 10000, and (C) N ¼ 100000.
Distributed survival analysis
Current research initiatives often rely on collaborative efforts, such
as the clinical data research network pSCANNER60 and equivalent
multicenter consortia. While our proposed methods are designed for
a centralized setting (ie, trusted aggregator), they could be adapted
to the distributed setting. Inspired by previous work,61 we can consider a protocol in which each institution perturbs the local stream
of time to events, while a central unit (not necessary trusted) aggregates and partitions the received streams.
Relaxing privacy
Figure 5. Mean absolute error (MAE) of the differentially private Kaplan-Meier
(DP-KM) curve vs the privacy parameter (Þ; for N ¼ 1000; 10000; and 100000.
Achieving high utility under differential privacy is very challenging
in applications that require continual data releases. Recent works
have proposed extensions of the differential privacy model, in which
privacy is relaxed over time.48,49 Extending our privacy solutions to
satisfy those privacy relaxations would help improve the utility of
the released survival curves.
Solutions for other survival models
breach when compared with its nonprivate counterpart, while
retaining the utility of the survival curves. We discuss several future
research directions.
In this work, we presented a preliminary study on privacyprotecting survival analyses based on the KM method (the actuarial
method is shown in the Supplementary Appendix). However, there
are many other types of survival models, including those based on
Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3
373
Figure 6. Survival curves for breast cancer patients in the Surveillance Epidemiology and End Results dataset for different groups. We sampled 2500 patients for
each group (ie, black, white, and others) who have been diagnosed since 2005. The curves obtained with the (A) nonprivate KM method and (B) differentially private curve (DP-KM).
Table 2. Kolmogorov-Smirnov test results for the Kaplan-Meier
method
White
Black
Other
White
Black
Other
0.0 (1.0)
–
–
0.37 (1.38 108)
0.0 (1.0)
–
0.21 (4.32 106)
0.48 (2.39 1014)
0.0 (1.0)
FUNDING
This work was supported by the National Heart, Lung, and Blood Institute
grant R01HL136835, and National Institute of General Medical Sciences
grant R01GM118609, and National Human Genome Research Institute
grant K99HG010493.
AUTHOR CONTRIBUTIONS
Values are the Kolmogorov-Smirnov statistic (P value).
Table 3. Kolmogorov-Smirnov test results for the DP-KM method
DPWhite
DPBlack
DPOther
DPWhite
0.0 (1.0)
0.36 (2.95 108)a 0.23 (1.08 104)a
DPBlack
–
0.0 (1.0)
0.45, 1.15 1012)a
DPOther
–
–
0.0 (1.0)
White
0.10 (.52)a
0.34 (1.29 109) 0.21 (4.34 103)
0.14 (.16)a
0.48 (6.45 1014)
Black
0.38 (5.28 109)
Other
0.28 (5.47 105) 0.49 (8.70 1015)
0.13 (.21)a
Values are the Kolmogorov-Smirnov statistic (P value). The test results
obtained on the curve produced by the differentially private Kaplan-Meier
method.
DP: differentially private.
a
Differentially private curves are not statistically different from the original
ones (P > .05), and they preserve the separation between groups (P < .05). Cox proportional hazards,16 accelerated failure time,62 recurrent time-to-event data,63 and competing risk64 methods. Building on our results, we plan to develop new privacy methods for enabling other popular privacy-protecting survival analyses in the future. CONCLUSION Publication of survival curves is frequent in the biomedical literature and is becoming more frequent in websites. In this work, we studied the privacy risk in conducting survival analyses and proposed a differentially private framework for the KM product limit estimator. The differentially private curves generated by our framework prevent an adversary to infer the time to event for a particular target individual without a significant error (eg, 250 time units) while retaining the usefulness of the original nonprivate curves. LB developed the methods, contributed the majority of the writing, and conducted the experiments. XJ provided helpful comments on both methods and presentation. LO-M provided the motivation for this work, detailed edits, and critical suggestions. SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online. CONFLICT OF INTEREST STATEMENT None declared. REFERENCES 1. Ohno-Machado L. Modeling medical prognosis: survival analysis techniques. J Biomed Inform 2001; 34 (6): 428–39. 2. Cortese G, Scheike TH, Martinussen T. Flexible survival regression modelling. Stat Methods Med Res 2010; 19 (1): 5–28. 3. Schwartzbaum JA, Hulka BS, Fowler JW, Kaufman DG, Hoberman D. The influence of exogenous estrogen use on survival after diagnosis of endometrial cancer. Am J Epidemiol 1987; 126 (5): 851–60. 4. Foldvary N, Nashold B, Mascha E. Seizure outcome after temporal lobectomy for temporal lobe epilepsy: a Kaplan-Meier survival analysis. Neurology 2000; 54 (3): 630. 5. Galon J, Costes A, Sanchez-Cabo F, et al. Type, density, and location of immune cells within human colorectal tumors predict clinical outcome. Science 2006; 313 (5795): 1960–4. 6. Le Voyer TE, Sigurdson ER, Hanlon AL, et al. Colon cancer survival is associated with increasing number of lymph nodes analyzed: a secondary survey of intergroup trial INT-0089. J Clin Oncol 2003; 21 (15): 2912–9. 7. Lee ET, Go OT. Survival analysis in public health research. Annu Rev Public Health 1997; 18 (1): 105–34. 374 Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 8. Wagner M, Redaelli C, Lietz M, Seiler CA, Friess H, Büchler MW. Curative resection is the single most important factor determining outcome in patients with pancreatic adenocarcinoma. Br J Surg 2004; 91 (5): 586–94. 9. Strober M, Freeman R, Morrell W. The long-term course of severe anorexia nervosa in adolescents: survival analysis of recovery, relapse, and outcome predictors over 10–15 years in a prospective study. Int J Eat Disord 1997; 22 (4): 339–60. 10. Erbes R, Schaberg T, Loddenkemper R. Lung function tests in patients with idiopathic pulmonary fibrosis: are they helpful for predicting outcome? Chest 1997; 111 (1): 51–7. 11. Murphy SN, Chueh HC. A security architecture for query tools used to access large biomedical databases. Proc AMIA Symp 2002; 2002: 552–6. 12. Bacharach M. Matrix rounding problems. Manage Sci 1966; 12 (9): 732–42. 13. Lin Z, Hewett M, Altman RB. Using binning to maintain confidentiality of medical data. Proc AMIA Symp 2002; 2002: 454–8. 14. Dwork C. Differential privacy: a survey of results. In: Agrawal M, Du D, Duan Z, and Li A, eds. Theory and Applications of Models of Computation (Lecture Notes on Computation Series, volume 4978). New York, NY: Springer; 2008: 1–19. 15. Fredrikson M, Jha S, Ristenpart T. Model inversion attacks that exploit confidence information and basic countermeasures. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. New York, NY: ACM; 2015: 1322–33. 16. Cox DR. Regression models and life-tables. J R Stat Soc Ser B 1972; 34 (2): 187–220. 17. Cutler SJ, Ederer F. Maximum utilization of the life table method in analyzing survival. J Chronic Dis 1958; 8 (6): 699–712. 18. Berkson J, Gage RP. Calculation of survival rates for cancer. Proc Staff MeetMayo Clinic 1950; 25 (11): 270–86. 19. Balsam LB, Grossi EA, Greenhouse DG, et al. Reoperative valve surgery in the elderly: predictors of risk and long-term survival. Ann Thorac Surg 2010; 90 (4): 1195–201. 20. O’Keefe CM, Sparks RS, McAullay D, Loong B. Confidentialising survival analysis output in a remote data access system. J Priv Confid 2012; 4 (1): 127–54. 21. Homer N, Szelinger S, Redman M, et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet 2008; 4 (8): e1000167. 22. Shokri R, Stronati M, Song C, Shmatikov V. Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy. Piscataway, NJ: IEEE; 2017: 3–18. 23. Klann JG, Joss M, Shirali R, et al. The Ad-Hoc uncertainty principle of patient privacy. AMIA Summits Transl Sci Proc 2018; 2017: 132–8. 24. Dwork C, McSherry F, Nissim K, Smith A, Smith A. Calibrating noise to sensitivity in private data analysis. In: Halevi S, Rabin T, eds. TCC 2006: Theory of Cryptography Conference. New York, NY: Springer; 2006: 265–84. 25. Faraggi D, Simon R. A neural network model for survival data. Stat Med 1995; 14 (1): 73–82. 26. Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. Deep survival: a deep Cox proportional hazards network. stat 2016; 1050: 2. 27. Luck M, Sylvain T, Cardinal H, Lodi A, Bengio Y. Deep learning for patient-specific kidney graft survival analysis. arXiv 2017 May 29 [E-pub ahead of print]. 28. Lee C, Zame WR, Yoon J, der Schaar M. Deephit: van A deep learning approach to survival analysis with competing risks. In: Thirty-Second AAAI Conference on Artificial Intelligence; 2018. 29. Lu C-L, Wang S, Ji Z, et al. WebDISCO: a Web service for DIStributed COx model learning without patient-level data sharing. J Am Med Informatics Assoc 2015; 22 (6): 1212–9. 30. Chaudhuri K, Monteleoni C. Privacy-preserving logistic regression. In: Koller D, Schuurmans D, eds. Advances in Neural Processing Systems 21 (NIPS 2008). San Diego, CA: Neural Information Processing Systems Foundation; 2008. 31. Chen T, Zhong S. Privacy-preserving models for comparing survival curves using the logrank test. Comput Methods Programs Biomed 2011; 104 (2): 249–53. 32. Yu S, Fung G, Rosales R, et al. Privacy-preserving Cox regression for survival analysis. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY: ACM; 2008: 1034–42. 33. Fung G, Yu S, Dehing-Oberije C, et al. Privacy-preserving predictive models for lung cancer survival analysis. Pract Priv-Preserving Data Min 2008; 40 . 34. Pagano M, Gauvreau K. Principles of Biostatistics. New York, NY: Chapman and Hall/CRC; 2018. 35. Dwork C, Roth A. The algorithmic foundations of differential privacy. FnT Theor Comput Sci 2013; 9 (3–4): 211–407. 36. Xiao X, Wang G, Gehrke J. Differential privacy via wavelet transforms. IEEE Trans Knowl Data Eng 2011; 23 (8): 1200–14. doi: 10.1109/ TKDE.2010.247. 37. Bonomi L, Xiong L. A two phase algorithm for mining sequential patterns with differential privacy. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. New York, NY: ACM; 2013: 269–78. 38. Li N, Qardaji W, Su D, Cao J. Privbasis: frequent itemset mining with differential privacy. Proc VLDB Endow 2012; 5 (11): 1340–51. 39. Bhaskar R, Laxman S, Smith A, Thakurta A. Discovering frequent patterns in sensitive data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD ’10. New York, NY: ACM Press; 2010: 503–12. 40. Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS). New York, NY: ACM; 2007: 273–82. 41. Cormode G, Procopiuc C, Srivastava D, Shen E, Yu T. Differentially private spatial decompositions. In: 2012 IEEE 28th International Conference on Data Engineering. Piscataway, NJ: IEEE; 2012: 20–31. 42. Li C, Hay M, Rastogi V, Miklau G., McGregor A. Optimizing linear counting queries under differential privacy. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS). New York, NY: ACM; 2010: 123–34. 43. Li C, Miklau G. An adaptive mechanism for accurate query answering under differential privacy. Proc VLDB Endow 2012; 5 (6): 514–25. 44. Fan L, Bonomi L, Xiong L, Sunderam VS. Monitoring web browsing behavior with differential privacy. In: Chung C-W, Broder AZ, Shim K, Suel T, eds. 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic Of Korea, April 7-11, 2014. New York, NY: ACM; 2014: 177–88. 45. Fan L, Xiong L. An adaptive approach to real-time aggregate monitoring with differential privacy. IEEE Trans Knowl Data Eng 2014; 26 (9): 2094–106. 46. Dwork C, Naor M, Pitassi T, Rothblum GN. Differential privacy under continual observation. In: Proceedings of the Forty-Second ACM Symposium on Theory of Computing. New York, NY: ACM; 2010: 715–24. 47. Chan T-H, Shi E, Song D. Private and continual release of statistics. ACM Trans Inf Syst Secur 2011; 14 (3): 1. 48. Kellaris G, Papadopoulos S, Xiao X, Papadias D. Differentially private event sequences over infinite streams. Proc VLDB Endow 2014; 7 (12): 1155–66. 49. Bolot J, Fawaz N, Muthukrishnan S, Nikolov A, Taft N. Private decayed predicate sums on streams. In: Proceedings of the 16th International Conference on Database Theory. New York, NY: ACM; 2013: 284–95. 50. Bonomi L, Xiong L. On differentially private longest increasing subsequence computation in data stream. Trans Data Priv 2016; 9 (1): 73–100. 51. Chaudhuri K, Monteleoni C, Sarwate A. Differentially private empirical risk minimization. J Mach Learn Res 2011; 12: 1069–109. 52. Ji Z, Jiang X, Wang S, Xiong L, Ohno-Machado L. Differentially private distributed logistic regression using private and public data. BMC Med Genomics 2014; 7 (Suppl 1): S14. 53. Abadi M, Chu A, Goodfellow I, et al. Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. New York, NY: ACM; 2016: 308–18. Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 54. Nissim K, Steinke T, Wood A, et al. Differential privacy: A primer for a non-technical audience. In: 10th Annual Privacy Law Scholars Conference; June 1–2, 2017; Berkeley, California. 55. Dwork C, Naor M, Reingold O, Rothblum GN. Pure differential privacy for rectangle queries via private partitions In: International Conference on the Theory and Application of Cryptology and Information Security. New York, NY: Springer; 2015: 735–51. 56. Hay M, Rastogi V, Miklau G, Suciu D. Boosting the accuracy of differentially private histograms through consistency. Proc VLDB Endow 2010; 3 (1–2): 1021–32. 57. Barlow RE, Brunk HD. The isotonic regression problem and its dual. J Am Stat Assoc 1972; 67 (337): 140–7. 58. Fleming TR, O’Fallon JR, O’Brien PC, Harrington DP. Modified Kolmogorov-Smirnov test procedures with application to arbitrarily rightcensored data. Biometrics 1980; 36 (4): 607–25. 375 59. Noone AM, Howlader N, Krapcho M. SEER Cancer Statistics Review, 1975-2015. Bethesda, MD: National Cancer Institute. 60. Ohno-Machado L, Agha Z, Bell DS, et al. pSCANNER: patient-centered scalable national network for effectiveness research. J Am Med Inform Assoc 2014; 21 (4): 621–6. doi: 10.1136/amiajnl-2014-002751. 61. Chan T-H, Li M, Shi E, Xu W. Differentially private continual monitoring of heavy hitters from distributed streams. In: International Symposium on Privacy Enhancing Technologies Symposium. New York, NY: Springer; 2012: 140–59. 62. Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Stat Med 1992; 11 (14–15): 1871–9. 63. Amorim L, Cai J. Modelling recurrent events: a tutorial for analysis in epidemiology. Int J Epidemiol 2015; 44 (1): 324–33. 64. Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemiologic data. Am J Epidemiol 2009; 170 (2): 244–56.

Order your essay today and save 25% with the discount code: STUDYSAVE

Order Now

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Saudi Electronic University Patient Privacy during Survival Analysis Case Discussion ”

Get high-quality paper

NEW! AI matching with writer

Order a unique copy of this paper

Type of paper needed:

Pages:

600 words

Academic level:

We'll send you the first draft for approval by September 11, 2018 at 10:52 AM

Total price:

$26

Our Services

Saudi Electronic University Patient Privacy during Survival Analysis Case Discussion

Order a unique copy of this paper