Introduction

Electronic Health Records (EHR) have been routinely collected by healthcare providers across the US and extensively used for research purposes. Similarly, claims data from insurance companies are often used in population-based clinical research. Normally, data are stored and managed within the institutions that collect and own them. Storing data locally is generally more feasible logistically, more cost-friendly, and easier for the data-owning entity to access, control, and manage the data. More importantly, local storage helps ensure data sovereignty, and maintain data privacy and security that comply with data protection regulations such as the Health Insurance Portability and Accountability Act (HIPAA)1,2. Strong privacy protection helps build confidence in researchers, patients, and other stakeholders to encourage research collaborations in trustworthy medical AI3.

Through collaborative research, EHRs and claims data from institutions across diverse geographical locations can form a larger and potentially more representative sample of the US population that could yield more reliable and generalizable research findings4. Leveraging distributed data in healthcare research can be particularly beneficial for certain marginalized or minority groups because it allows institutions with very small minority populations to borrow information from others5,6,7,8. Several large-scale distributed health data networks (DHDN) have been established to facilitate collaborations across multiple institutions. For example, the Sentinel Initiative by the U.S. Food and Drug Administration (FDA) is an effort to monitor the safety of FDA-regulated medical products. The Sentinel can get data from more than a dozen partners including academic medical centers, healthcare systems, and health insurance companies. These data partners collect data in routine operations and maintain control of their own data9,10. Another example is the Patient-Centered Scalable National Network for Effectiveness Research (pSCANNER), a national research infrastructure containing data from 13 sites emphasizing comparative effectiveness research11. Similarly, data are stored, owned, and governed by each one of the pSCANNER sites without a central data repository.

To enjoy the aforementioned benefits brought by distributed health data, conventional machine learning (ML) methods would require researchers to first “bring data to computation”, transmitting individual patient data from the remote sites to a central data repository and performing centralized ML on aggregated data. However, this is not always permitted for legal reasons or for data privacy and security concerns. In addition, operating data centers that are large enough for centralized storage and computation is financially and logistically challenging, and the consequences are serious if the large data center experiences system failure or data breach12. These restrictions and limitations have motivated a broad class of modern distributed ML algorithms that “bring computation to data”13. Distributed ML has allowed researchers to take advantage of distributed storage and computational infrastructure and resources, which reduces, if not eliminates, the need for large data centers for EHRs. It also minimizes the need to share sensitive protected health information, complying with legal requirements and improving the privacy and security of healthcare data14,15.

Missing data problem is prevalent in real-world EHRs and claims data, therefore DHDNs as well16,17. Failure to properly account for missing data will lead to biased inference and prediction results17,18,19. Recent studies further show that missingness in health data tends to harm minority groups disproportionally, exacerbating health inequities and disparities19, because the missing information in a minority cohort impacts the accuracy of the downstream studies for that population more severely than for a well-represented one. Also, the discrepancy in representation between different minority groups varies across different factors (e.g., the younger population tends to be generally underrepresented, regardless of their lineage)19.

The missing data in EHRs is classified as either missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). The variable is considered missing at random if its missingness depends on other variables in the dataset, and missing not at random if its missingness is determined by the variable itself. Otherwise, the data is considered MCAR. Complete case analysis that excludes observations with missing values is a valid approach if data is MCAR18. However, data that are MAR need to be properly imputed to recover unbiased analysis results18.

Multiple imputation (MI) is a popular imputation technique that, in general, works by replacing missing values with predicted values multiple times and combining the analysis results acquired from these imputed datasets. However, since DHDNs comprise data from multiple institutions, the missing data problems could potentially be more complex due to various heterogeneities between the institutions. Compared to the large body of literature on distributed ML algorithms, distributed MI that can handle missing data problems in DHDNs has not received as much attention. In principle, MI algorithms rely on various statistical ML models to impute missing observations. Therefore, distributed MI algorithms should enjoy the same benefits mentioned earlier as model-based distributed analysis. In addition, data sources with either small sample sizes or a small number of observed values due to high missing rates can borrow information from other data sources. To our knowledge, several distributed MI algorithms designed for MAR data have been proposed and have been shown to outperform MI conducted independently at each site17. In addition, a distributed MI algorithm for MNAR data that also demonstrates superiority over independent MI algorithms20. However, these approaches are not provably secure as they reveal intermediate results that can leak private information. One such example is the Gramian matrix that is revealed by these approaches, which can be used to completely reconstruct private data whenever the number of individuals is less than or equal to the number of training features in some party.

Here, for the first time, we offer a provably secure imputation of the missing data (Secure MICE) in distributed EHR that reveals only the final analysis result. We enabled an otherwise non-secure, centralized multiple imputation with chained equations (MICE) algorithm to be executed in secure distributed contexts. Our solution utilizes secure multiparty computation (SMC)21 and multiparty homomorphic encryption (MHE)22 technologies and provides an accuracy on par with the equivalent non-secure solutions, where the data is pooled into a single cohort. We evaluated our solution on both the MAR and MNAR data, for completeness, and compared them to two non-secure, centralized variants of the MICE algorithm, one with better performance and the other with state-of-the-art accuracy23. We used Sequre24,25, a framework for high-performance, SMC, to implement our SMC-based algorithms and extended it with MHE protocols to enable the development of our MHE-based solutions. As a result, we obtained practical runtimes of only a few seconds for small-scale solutions and less than 15 s for a large-scale solution. Finally, we showcase a real-life example where our solution enables up to 10% more accurate prediction of death within 48 h of intensive care unit (ICU) admission (Box 1), compared to otherwise non-distributed case where each site computes individually on top of its own data, by imputing and holistically utilizing a large-scale sample of an incomplete Medical Information Mart for Intensive Care (MIMIC) dataset26.

In general, we expect Secure MICE to help wholly utilize the private distributed datasets to impute the missing data and enrich the collective statistical studies, thus broadening the data sharing and collaboration efforts between medical institutions and other private data proprietors with incomplete datasets.

Results

Experiments setup

We adopted the experiment setup from the previous work17, which includes four simulations of data MAR and two real-data studies. While our approach is based on supervised learning and, as such, best suited for imputing the data MAR, we still added two simulation studies on top of data MNAR, for completeness. Each study follows the same pattern. First, the incomplete dataset of a different number of individuals and variables, such as demographic and disease information, is encrypted and pooled together from multiple study participants. The missing data in the pooled dataset is then imputed multiple times to form several independent complete datasets, on top of which different regression models are trained as part of the final analysis. The trained models’ weights are then combined via Rubin’s rules to produce a final regression model that is used to assess performance. Each step of the study is done on encrypted data without revealing any meaningful information apart from the final analysis output. Most studies in this work used linear regression as a final analysis model. The only exception is the second real-data study, which used logistic regression for a binary outcome variable. The quality of the final linear regression is measured as a mean absolute difference and a standard deviation of the absolute difference between the predicted outcome and the ground truth, while the quality of the logistic regression was measured as a combination of accuracy and area-under-curve (AUC). We also measured the bias \(\parallel {\mathbb{E}}{\mathbf{\Theta }}-\mathop{{\mathbf{\Theta }}}\limits^{ \sim }{\parallel }_{2}\), standard deviation \(\sqrt{{\mathbb{E}}\parallel {\mathbf{\Theta }}-{\mathbb{E}}{\mathbf{\Theta }}{\parallel }_{2}^{2}}\), and the mean-squared error \(\sqrt{{\mathbb{E}}\parallel {\mathbf{\Theta }}-\widetilde{{\mathbf{\Theta }}}{\parallel }_{2}^{2}}\) of the regression weights Θ and their ground truth \(\widetilde{{\mathbf{\Theta }}}\), where possible. To assess the quality of imputation alone, we additionally measured the mean absolute difference and standard deviation between imputed datasets and their ground truth in the simulation studies with incomplete continuous variables, and accuracy and AUC for the ones with incomplete binary variables. This assessment is not possible in the real-data studies, however, because the ground truth of the missing data is unknown. Finally, we also measured the runtime and network overhead where applicable.

Simulation studies

The first simulation (Table 1) is conducted on top of ten variables drawn from a normal distribution \({\mathcal{N}}(0,1)\), and one variable made incomplete uniformly at random with a missingness rate of 30%. The second simulation (Table 2) is the same, with the incomplete variable being a binary variable drawn from a Bernoulli distribution \({\mathcal{B}}(1,0.5)\) instead. The rest of simulation studies (Tables 36) have only two variables, X1 and X2, with the second variable drawn from a uniform distribution \({\mathcal{U}}(-3,3)\) and the first either from a normal distribution \({\mathcal{N}}(0.2-0.5{X}_{2},1)\) in the third and fifth simulation, or from a Bernoulli distribution \({\mathcal{B}}(1+{e}^{-0.2+0.5{X}_{1}})\) in the fourth and sixth simulation, with the missingness rate from 50% to 60%. Each simulation is benchmarked for a different number of individuals (500 and 5000 for our experiments). The outcome variable (i.e., the ground truth of final regression analysis) in each simulation study is obtained as Y = Θ0 + ∑iXiΘi + ϵ, where Xi are the variables; Θi the ground truth linear regression weights (set to 1 in our experiments), and ϵ is drawn from \({\mathcal{N}}(0,({{\mathbf{\Theta }}}_{0}+{\sum }_{i}{X}_{i}{{\mathbf{\Theta }}}_{i})/100)\). The outcome variable is computed on a complete dataset, before removing the missing data. In each study, five multiple imputations are used (i.e., each dataset is imputed five times and five independent regression models are trained as a part of a final analysis). The simulation and real-data studies are independently benchmarked 100 and 5 times, respectively.

Table 1 Scenario 1: Single continuous incomplete variable missing at random, 9 continuous complete variables, 100 random runs, 30% missing rate
Table 2 Scenario 2: Single binary incomplete variable missing at random, 9 continuous complete variables, 100 random runs, 30% missing rate
Table 3 Scenario 3: Single continuous incomplete variable missing at random, single continuous complete variable, 100 random runs, 50% missing rate
Table 4 Scenario 4: Single binary incomplete variable missing at random, single continuous complete variable, 100 random runs, 50% missing rate
Table 5 Scenario 5: Single continuous incomplete variable missing not at random, single continuous complete variable, 100 random runs, 55% missing rate
Table 6 Scenario 6: Single binary incomplete variable missing not at random, single continuous complete variable, 100 random runs, 60% missing rate

Real-data studies

We used Secure MICE to predict the arrival-to-computed tomography time and death within 48h using two large patients’ cohorts (Tables 7 and 8). For the first real-data study, the data from the Georgia Coverdell Acute Stroke Registry (GCASR) is used with 15 out of 203 selected variables (five continuous and ten binary based on previous work17) and 68,287 patients. Each continuous variable and seven binary variables are incomplete, and the missingness rate ranges between 0.035% and 53.84%. For the second real-data study, we used a sample of 94,459 patients from the MIMIC dataset26, for which we were able to curate 17 out of 164 continuous variables, such as basic demographics like age, ICU type, ICU admission time, and length of stay; vitals like heart rate and systolic/diastolic blood pressure, mean arterial pressure, respiratory rate, body temperature, and oxygen saturation; four Glasgow Coma Scale metrics for neurological assessment (total score and eye/verbal/motor responses); and laboratory measurements like glucose and blood pH, with a missing rate of 57% on average (ranging from none to 84% per variable). This sample included 3445 patients who died within 48 h of ICU admission and to reduce the prediction bias, we sampled the same number of patients who survived to come up with the total number of 6890 patients in the training dataset.

Table 7 Real-data scenario (GCASR): 90,000 individuals, 5 continuous and 10 binary incomplete (missing) variables with a missing rate ranging from 0.035% to 53.84%
Table 8 Real-data scenario (MIMIC) for prediction of death within 48 h after ICU admission using baseline demographics, neurological assessments, vitals, and laboratory measurements within the first 2 h of admission: 6890 individuals, 17 continuous variables with a missing rate of up to 84%

Benchmarked solutions and implementation details

We implemented two secure solutions for MICE, one based on SMC and the other on MHE. Additionally, we implemented two non-secure solutions to compare against. The first one is a raw Python implementation of MICE using off-the-shelf linear and logistic regression for imputation and final analysis, and the second one is an off-the-shelf MICE algorithm from Python’s scikit-learn library23. There is no clear winner between the two non-secure solutions, but the latter is generally expected to have better accuracy, while the former has better performance. The Python-based solutions are tested in an offline, non-secure context on top of plain, non-encrypted data, while the secure solutions are tested in a secure distributed setup, on top of encrypted data, with two computing parties aided by a trusted dealer.

We implemented both secure solutions in Sequre24,25 in less than 550 lines of Pythonic code. Sequre’s compile-time optimizations for network overhead reduction, as well as the low-level performance optimizations such as modulo operator customization and exposing data-level parallelism, are mainly responsible for achieving the practical runtimes. Also, Sequre’s configurable fixed-point arithmetic allowed us to reduce the truncation error noise in SMC. In particular, we used 192-bit long integers, with 32 bits reserved for the fractional part, 64 bits for a whole fixed-point value, and 64 bits of padding for statistical security. To obtain similar accuracy in MHE, we adhered to common CKKS parameters with 128-bit security, enabling 8192 slots with a default scale of 234, which provides a good balance between performance and accuracy27,28. Lastly, all experiments were done on a single 12-core Intel Core i7-8700 CPU at 3.20GHz and 64 GB of RAM. To simulate a multiparty setup, the UNIX sockets were used to connect multiple processes—each process corresponding to a separate computing party. Nevertheless, Sequre allows easy deployment across arbitrary network architectures and, as such, will facilitate seamless integration of our solution across multiple institutions.

Evaluation

The imputation and the final study quality of secure solutions are on par or slightly better than the offline solutions in all simulation studies. We note that our goal was not to improve the existing MICE algorithms but to design their secure equivalents with on-par accuracy and performance for the first time. The imputation accuracy is slightly worse (<0.006) only in studies where a categorical variable is imputed (Table 2 and Table 4) due to approximation algorithms (Chebyshev approximation) employed in secure variants of logistic regression. Similarly, the quality of the final study is only fractionally worse (0.001–0.063) in secure solutions—the offset that can be further attributed to approximation errors that are unavoidable in the security schemes that we employ24,29.

We simulated the distributed environment for the last real-data study (the prediction of death within 48 h on top of the MIMIC dataset) by splitting the data between three multiple sites. We first conducted a separate, independent run at each site, without imputation and data sharing. Each site’s data consisted of about 100 patients, since only 305 patients in the MIMIC dataset had complete data (i.e., no missing variables). As such, the accuracy and AUC of predicting the risk of death of recently admitted patients were 0.70 and 0.80, respectively. Then we conducted the same study using our secure solutions where the data of all 6890 patients was imputed and utilized for training and ultimately achieved an accuracy and AUC of 0.77 and 0.88, respectively (Table 8). In other words, our solution improves the classification of 10% additional high-risk patients per a number of ICU admissions. Moreover, apart from the MHE variant, which is slower for this amount of data due to under-utilization of its packing mechanism in which operations are executed over encrypted arrays in a SIMD-like manner29, the runtimes of SMC solutions are generally small (12 s for GCASR and 285 s for the MIMIC dataset). This is an important practical result since secure solutions are generally known to incur large performance overhead24. The reason for the slowdown in MIMIC-based experiment, even though it runs a smaller dataset than that of GCASR, is that the final outcome variable is binary and, thus, the final analysis model—utilized on a large, imputed dataset—uses logistic regression that employs the expensive polynomial approximations for the logistic sigmoid. Nevertheless, computing the risk score for a single patient (i.e., single inference) requires only 15 μs in SMC and 1 ms in MHE variants.

Discrepancy analysis

To further assess the quality of our imputation algorithms, we measured a number of discrepancies17 with respect to an off-shelf MICE algorithm from scikit-learn library as a base algorithm. In short, a variable in the final linear regression study has a discrepancy between the two MICE algorithms (target and base algorithm) if and only if its statistical significance is less or equal to 0.05 in the base algorithm and either its statistical significance in the target algorithm is larger than 0.05 or its weights in the two algorithms have the opposite signs. The smaller number of discrepancies is desired since it indicates similar imputation quality between the two algorithms. In our measurements, we observed one discrepancy in the offline Python implementation of MICE in the GCASR study, compared to no discrepancies in the secure counterpart. We also measured six discrepancies in an offline Python study on top of the MIMIC dataset, while our secure equivalent produced five. We measured no discrepancies in any other solution across all studies. Counting the number of discrepancies is particularly useful when there is no ground truth to measure the quality of imputation, such as in real-data studies.

Discussion

We enable provably secure statistical studies on top of private, incomplete distributed datasets while maintaining data privacy. Specifically, we used SMC and MHE to implement a secure distributed variant of multiple imputation with chain equations (MICE) procedure and enable imputing the missing data in a distributed setup and performing statistical analysis on top of it without revealing any meaningful information apart the final outcome to the study participants. Our solution proved to have practical performance and an on-par accuracy with the standard, non-secure, and centralized implementations of MICE, where data is assumed to be pooled together in a single cohort and which is often hindered in practice due to privacy concerns. For example, predicting the risk of death of a recently admitted ICU patient in the MIMIC dataset26, distributed across multiple sites that cannot directly share their data due to privacy, is 10% more accurate with our solution because it enables the distributed dataset to be utilized securely as a pooled cohort, in contrast to each site doing the prediction on top of their own dataset, without data sharing. Moreover, the high-level expressiveness of our solution allows for an easy adoption and deployment of our protocols, even when the size and resources of the institute employing them are limited. This is because Sequre—the secure programming framework we utilized—is written in a high-level, Pythonic syntax, oblivious of SMC or MHE-specific concerns, and is automatically optimized for performance, which allowed our experiments to be done on standard office hardware.

Our solution follows an honest-but-curious trust model, where study participants are expected to faithfully follow the execution protocol without altering either the algorithm or the data, but are allowed to arbitrarily interact with the data they possess or receive throughout the computation. Also, as the foundational MICE algorithm is apt for imputing the MAR data only, our secure methods are limited to the same missingness type, too. To increase their versatility, provide different security guarantees, and potentially achieve even better runtime and accuracy, we plan to extend our solution with more accurate secure algorithms for imputing the MNAR data, and support for malicious-safe protocols and trusted executing environments such as Intel’s SGX30. Also, as our solution currently supports only regression models in the final analysis, we plan to add support for other machine learning models and, in particular, deep learning-based models.

Methods

Missing data imputation

Multiple imputation (MI) addresses the uncertainty of the single imputation by probabilistically imputing data multiple times before conducting a study. The study is then done independently over each imputed dataset, and the results are combined via Rubin’s rules, usually in the form of an aggregate statistic of the underlying studies’ coefficients (Fig. 1)31. Whenever more than one variable in the initial dataset is incomplete during a single imputation, data is imputed iteratively, one variable at a time, while re-using the complete data from the previously imputed variables. This procedure is called multiple imputation with chained equations (MICE).

Fig. 1: Multiple imputation.
figure 1

The missing data is independently imputed multiple times to address the uncertainty of imputation. Then, a set of independent studies is done on top of imputed datasets, and their parameters are combined using Rubin’s rules to produce a final study. For example, if the final study involves doing a linear regression on top of a dataset, then multiple linear regression models will be independently trained and their coefficients combined, usually through some aggregation, into a final linear regression model.

Privacy enhancing technologies

Privacy-enhancing technologies protect data privacy throughout the computation. Prominent examples include differential privacy (DP)32, SMC21, homomorphic encryption (HE)33, and MHE22. Here, we focus on SMC and MHE, the technologies that enable computation on top of distributed data privately held by multiple stakeholders without disclosing any meaningful information to each other. Specifically, SMC enables computation on top of private distributed datasets by secret sharing34 and pooling all private data partitions into a single encrypted tensor (usually in the form of a matrix) and employing a set of specialized routines that enable computation on top of such shared data. As SMC comes in many variants, we settle for additive secret-sharing with honest-but-curious stakeholders (meaning that the parties will execute the provided code correctly but might try to infer information about the other parties’ data), aided by a trusted dealer34.

MHE, on the other hand, combines SMC with homomorphic encryption—another fundamental cryptographic primitive. Homomorphic encryption (HE) is a form of encryption that allows direct computations over encrypted data without decryption. In this work, we rely on the Cheon-Kim-Kim-Song (CKKS) scheme29, which sacrifices perfect correctness for improved performance and which encodes vectors of continuous values. This scheme supports vector additions, multiplications, and rotations, and any operation is performed simultaneously on all the vector values akin to the “single instruction, multiple data” (SIMD) instructions. To maintain the ciphertext size and scale (values are scaled by a constant before encryption to ensure a high level of precision), ciphertexts have to be rescaled after any multiplication and relinearized after multiplication with another ciphertext. After a certain number of multiplications, the ciphertext needs to be refreshed through a bootstrapping procedure to ensure correct decryption. While this operation is prohibitively expensive in the standard CKKS scheme, in MHE, it can be substituted with an interactive protocol where ciphertexts are transformed into secret shares and re-encrypted. Using a similar approach, a ciphertext can be converted into additive shares35, which can be used for SMC operations. While HE enables efficient polynomial operations on large-scale vector operations, non-polynomial operations such as comparisons, square root, and division can be efficiently evaluated in the secret-sharing variant of SMC.

Secure MICE algorithm

We consider a typical distributed use case where the incomplete training data is horizontally divided between the parties (i.e., each party contributes with a different number of individuals and the same number of training features). We note, however, that our solution is also applicable to other distribution types, such as vertical or even additive, where the sum of private data partitions forms the complete dataset. We enabled two variants of secure distributed MICE, one implemented using secure multiparty computation (SMC-MICE; Algorithm 1; Fig. 2) and the other using multiparty homomorphic encryption (MHE-MICE; Algorithm 5; Fig. 3). The former is suitable for small data scales (approximately less than 300,000 individuals) and a number of computing parties, while the latter scales better with the increase of data size or number of parties. Both schemes enable computation on top of encrypted, distributed data without revealing any meaningful information to the study participants or data owners.

Fig. 2: Multiple imputation via SMC.
figure 2

The input data is first secret-shared and then imputed and analyzed in SMC context. Each independent study produces secret-shared coefficients that are averaged together without decryption. The result is a final, secure linear regression model that allows inference on top of encrypted data without revealing any meaningful information.

Fig. 3: Multiple imputation via MHE.
figure 3

The input data is distributed between the parties and kept in a non-encrypted form, only to be encrypted when needed during the imputation and final analysis. The procedure also benefits from independent, parallel computation on top of local data partitions. This scheme, however, is suitable only for large-scale datasets due to the performance overhead incurred by the underlying, expensive cryptographic scheme that is inherently scalable with respect to data size and the number of computing parties.

Our solution, in both schemes, works conceptually as follows. The training data is first encrypted and pooled together from multiple data owners before being imputed multiple times using linear regression with error (drawn from \({\mathcal{N}}(0,0.01)\)) for imputing continuous variables or logistic regression for categorical variables. On top of each imputed dataset, an independent linear or logistic regression is trained as a part of a final study, and the arithmetic average of the resulting models’ coefficients is used as Rubin’s rules to produce the final regression model. We utilized a mini-batched gradient descent with a pre-defined step size and a number of epochs for both linear and logistic regression against the mean-squared and categorical cross-entropy loss, respectively. Additionally, we used a closed-form solution if the number of features is relatively small (less than 4 in our implementations) in linear regression. Each step, in both SMC and MHE variants of MICE, is done on top of the encrypted data without revealing any meaningful information to the parties.

In SMC-MICE, the computing parties first secret share the incomplete training data ([X]) and the training labels ([y]). Additionally, each party provides the missigness mask for its data partition (i.e., the zero-one matrix where 0 denotes missing entry). The missingness masks are pooled into a single matrix (M) that remains public throughout execution. The incomplete training data is then imputed multiple times using an SMC implementation of the two aforementioned regression models (i.e., linear for continuous and logistic for categorical variables). Each imputed, secret-shared dataset is used to train an independent final analysis model, which in our case again, is an SMC variant of linear or logistic regression. The secret shares of the models’ weights are then pooled and averaged together to form the final regression model. The details of the SMC-MICE algorithm are provided in Algorithm 1 and Algorithm 2. The former provides a general overview, while the latter gives an insight into a single imputation procedure where, for each incomplete variable, a different SMC regression model is used to infer the missing data. The SMC implementation of linear regression uses Beaver triplets36 to enable secure multiplication and computes the secret shares of weights in an otherwise classical manner (see Algorithm 3 for details), using only simple operations such as addition and subtraction (together with multiplication) that are efficient in our security scheme. Also, each additive operation is computed independently at each party without network overhead. For logistic regression, we implemented SMC variants of Chebyshev interpolation to support sigmoid and logarithms that are otherwise hard to compute in SMC. Lastly, we also employed the existing optimization techniques for caching the Beaver triplets24 to reduce network consumption.

Algorithm 1

Regression analysis via multiple imputation using SMC

INPUT:

\([{\bf{X}}]\in {{\mathbb{Z}}}_{p}^{m\times n}\): secret shared incomplete training data

\([{\bf{y}}]\in {{\mathbb{Z}}}_{p}^{m}\): secret shared training labels

M {0, 1}m×n: public missing data mask

\({{\mathcal{M}}}_{im}\): SMC imputation model

\({{\mathcal{M}}}_{f}\): SMC final analysis model

k: public number of multiple imputations

OUTPUT:

\([{\bf{c}}]\in {{\mathbb{Z}}}_{p}^{n+1}\): secret shared final analysis model coefficients

1: procedure SMC_MICE_ANALYSIS \(([{\bf{X}}],[{\bf{y}}],{\bf{M}},{{\mathcal{M}}}_{im},{{\mathcal{M}}}_{f},k)\)

2: \([{\bf{C}}]\leftarrow [{\bf{0}}]\in {{\mathbb{Z}}}_{p}^{k\times (n+1)}\) Secret shared zeros

3: for \(j=\overline{0,\ldots {\mathtt{k}}}\) do

4: \({\mathtt{smc}}\_{\mathtt{mice}}\_{\mathtt{impute}}({{\mathcal{M}}}_{im},[{\bf{X}}],{\bf{M}})\)

5: \({\mathtt{smc}}\_{\mathtt{fit}}({{\mathcal{M}}}_{f},[{\bf{X}}],[{\bf{y}}])\)

6: \({[{\bf{C}}]}_{j}\leftarrow {\mathtt{get}}\_{\mathtt{coeffs}}({{\mathcal{M}}}_{f})\)

7: end for

8: [c] ← smc_rubin([C])

9: return [c]

10: end procedure

Algorithm 2

Imputation algorithm via chained equations via SMC

INPUT:

\({{\mathcal{M}}}_{im}\): imputation model

\([{\bf{D}}]\in {{\mathbb{Z}}}_{p}^{m\times n}\): secret shared incomplete training data

M {0, 1}m×n: missing data mask

1: procedure SMC_MICE_IMPUTE\(({{\mathcal{M}}}_{im},[{\bf{D}}],{\bf{M}})\)

2: nlen([D])

3: for \(j=\overline{0,\ldots n}\) do

4: C ← [D]M=1 Filter only complete data

5: [X] ← [C]:,kj

6: [y] ← [C]:,j

7: \({\mathtt{smc}}\_{\mathtt{fit}}({{\mathcal{M}}}_{im},[{\bf{X}}],[{\bf{y}}])\)

8: \(\epsilon \leftarrow {\mathcal{N}}(0,0.01)\) Draw error from normal distribution

9: \([\widehat{{\bf{y}}}]\leftarrow {\mathtt{smc}}\_{\mathtt{predict}}({{\mathcal{M}}}_{im},{([{\bf{D}}]\cdot {\bf{M}})}_{:,k\ne j},\epsilon )\)

10: \({[{\bf{D}}]}_{:,j}\leftarrow [\widehat{{\bf{y}}}]\)

11: \({{\bf{M}}}_{:,j}\leftarrow {\bf{1}}\in {{\mathbb{R}}}^{m\times 1}\)

12: end for

13: end procedure

Algorithm 3

Linear regression via SMC (using batched instead of mini-batched gradient descent to simplify)

INPUT:

\([{\bf{X}}]\in {{\mathbb{Z}}}_{p}^{m\times n}\): secret shared training data

\([{\bf{y}}]\in {{\mathbb{Z}}}_{p}^{m\times 1}\): secret shared training labels

\({\mathcal{M}}\): linear regression model that stores initial, secret shared weights \(([{{\bf{w}}}_{{\mathcal{M}}}]\in {{\mathbb{Z}}}_{p}^{(n+1)\times 1})\), number of training epochs \(({e}_{{\mathcal{M}}}\in {\mathbb{N}})\), and step size \(({\eta }_{{\mathcal{M}}}\in {\mathbb{R}})\)

1: procedure SMC_FIT\(({\mathcal{M}},[{\bf{X}}],[{\bf{y}}])\)

2: \([\widetilde{{\bf{X}}}]\leftarrow ({\bf{X}}\parallel {\bf{1}})\) Append bias column

3: \([{\bf{C}}]\leftarrow {[\widetilde{{\bf{X}}}]}^{\top }\times [\widetilde{{\bf{X}}}]\)

4: \([{\bf{R}}]\leftarrow {[\widetilde{{\bf{X}}}]}^{\top }\times [{\bf{y}}]\)

5: if \({\mathtt{len}}({[\widetilde{{\bf{X}}}]}^{\top }) < 4\) then Closed-form solution

6: \([{{\bf{w}}}_{{\mathcal{M}}}]\leftarrow {[{\bf{C}}]}^{-1}\times [{\bf{R}}]\)

7: else Batched gradient descent

8: for \(j=\overline{0,\ldots {e}_{{\mathcal{M}}}}\) do

9: \([{{\bf{w}}}_{{\mathcal{M}}}]\leftarrow [{{\bf{w}}}_{{\mathcal{M}}}]+([{\bf{R}}]-[{\bf{C}}]\times [{{\bf{w}}}_{{\mathcal{M}}}])\cdot {\eta }_{{\mathcal{M}}}\)

10: end for

11: end if

12: end procedure

Algorithm 4

Logistic regression via SMC (using batched instead of mini-batched gradient descent to simplify)

INPUT:

\([{\bf{X}}]\in {{\mathbb{Z}}}_{p}^{m\times n}\): secret shared training data

\([{\bf{y}}]\in {{\mathbb{Z}}}_{p}^{m\times 1}\): secret shared training labels

\({\mathcal{M}}\): logistic regression model that stores initial weights \(({{\bf{w}}}_{{\mathcal{M}}}\in {{\mathbb{Z}}}_{p}^{(n+1)\times 1})\), number of training epochs \(({e}_{{\mathcal{M}}}\in {\mathbb{N}})\), and step size \(({\eta }_{{\mathcal{M}}}\in {\mathbb{R}})\)

1: procedure SMC_FIT \(({\mathcal{M}},[{\bf{X}}],[{\bf{y}}])\)

2: \([\widetilde{{\bf{X}}}]\leftarrow ({\bf{X}}\parallel {\bf{1}})\) Append bias column

3: for \(j=\overline{0,\ldots {e}_{{\mathcal{M}}}}\) do

4: \([{\bf{A}}]\leftarrow {\sigma }_{cheby}([\widetilde{{\bf{X}}}]\times [{{\bf{w}}}_{{\mathcal{M}}}],(0,1))\)

5: \([{{\bf{w}}}_{{\mathcal{M}}}]\leftarrow [{{\bf{w}}}_{{\mathcal{M}}}]-{[\widetilde{{\bf{X}}}]}^{\top }\times ([{\bf{A}}]-[{\bf{y}}])\cdot {\eta }_{{\mathcal{M}}}\)

6: end for

7: end procedure

The MHE-MICE is conceptually the same as its SMC counterpart (see Algorithm 5 and Algorithm 6). The main difference is in the input data format and the implementation of elementary matrix algebra operations. Namely, the input data to MHE-MICE is initially kept local, non-encrypted at each party, and only encrypted and shared when needed throughout the computation. This, for example, enables the pre-processing steps in the imputation algorithm (Algorithm 6) to be done independently at each party on top of local, non-encrypted data. Generally, any element-wise operation, such as addition, subtraction, and multiplication, is computed in the same manner—independently at each party—as long as the partition sizes of the operands are aligned between the parties. Moreover, computing the invariants in linear regression (i.e., the Gramian matrix \({\widetilde{X}}^{\top }\times \widetilde{X}\) and \({\widetilde{X}}^{\top }\times y\)) is also done independently at each party since the product of vertically and horizontally partitioned matrices is an additively partitioned matrix with each partition being a product of corresponding non-encrypted local shares. Some operations, however, require one of the operands to be aggregated (i.e., encrypted and shared among the parties) beforehand. For example, matrix multiplication of two additively partitioned matrices requires at least one operand to be aggregated beforehand. The result is then obtained by multiplying each additive share with the aggregated counterpart independently at each party. The aggregation strategy (i.e., deciding whether to aggregate the first or the second operand) directly impacts the partitioning of the result and the performance of all downstream operations. For example, aggregating the weights \([{{\bf{w}}}_{{\mathcal{M}}}]\) instead of training data \(\widetilde{X}\) in logistic regression in Algorithm 8 would result in a completely different algorithm downstream. In our particular implementation, it is better to aggregate \(\widetilde{X}\) first to avoid aggregating \([{{\bf{w}}}_{{\mathcal{M}}}]\) multiple times within the loop body. Moreover, multiplying vertically partitioned against the horizontally partitioned matrix, as well as two additively partitioned matrices, are the only two matrix multiplication instances encountered in our implementation of linear and logistic regression.

Algorithm 5

Regression analysis via multiple imputation using MHE INPUT:

\(X\in {{\mathbb{R}}}^{{m}_{i}\times n}\): incomplete training data partition held locally at i th party

\(y\in {{\mathbb{R}}}^{{m}_{i}}\): training labels partition held locally at i th party

\(M\in {\{0,1\}}^{{m}_{i}\times n}\): missing data mask held locally at ith party

\({{\mathcal{M}}}_{im}\): MHE imputation model

\({{\mathcal{M}}}_{f}\): MHE final analysis model

k: public number of multiple imputations

N: number of CKKS slots

\({\mathcal{C}}\): CKKS ciphertexts space (i.e., \({({{\mathbb{Z}}}_{p}[X]/(X+1))}^{2}\))

OUTPUT:

\({\bf{c}}\in {{\mathcal{C}}}^{\lceil n/N\rceil }\): aggregated final analysis model coefficients

1: procedure MHE_MICE_ANALYSIS \((X,y,M,{{\mathcal{M}}}_{im},{{\mathcal{M}}}_{f},k)\)

2: \({\bf{C}}\leftarrow {\bf{0}}\in {{\mathcal{C}}}^{k\times \lceil (n+1)/N\rceil }\) CKKS encrypted zeros

3: for \(j=\overline{0,\ldots {\mathtt{k}}}\) do

4: \({\mathtt{mhe}}\_{\mathtt{mice}}\_{\mathtt{impute}}({{\mathcal{M}}}_{im},X,M)\)

5: \({\mathtt{mhe}}\_{\mathtt{fit}}({{\mathcal{M}}}_{f},X,y)\)

6: \({{\bf{C}}}_{j}\leftarrow {\mathtt{get}}\_{\mathtt{coeffs}}({{\mathcal{M}}}_{f})\)

7: end for

8: cmhe_rubin(C)

9: returnc

10: end procedure

Algorithm 6

Imputation algorithm via chained equations using MHE

INPUT:

\({{\mathcal{M}}}_{im}\): imputation model

\(D\in {{\mathbb{R}}}^{{m}_{i}\times n}\): incomplete training data partition held locally at ith party

\(M\in {\{0,1\}}^{{m}_{i}\times n}\): missing data mask held locally at ith party

\({\mathcal{C}}\): CKKS ciphertexts space (i.e., \({({{\mathbb{Z}}}_{p}[X]/(X+1))}^{2}\))

1: procedure MHE_MICE_IMPUTE\(({{\mathcal{M}}}_{im},D,M)\)

2: nlen(D)

3: for \(j=\overline{0,\ldots n}\) do

4: CDM=1 Filter only complete data at each party

5: XC:,kj

6: yC:,j

7: \({\mathtt{mhe}}\_{\mathtt{fit}}({{\mathcal{M}}}_{im},X,y)\)

8: \(\epsilon \leftarrow {\mathcal{N}}(0,0.01)\) Draw error from normal distribution

9: \(\widehat{y}\leftarrow {\mathtt{mhe}}\_{\mathtt{predict}}({{\mathcal{M}}}_{im},{(D\cdot M)}_{:,k\ne j},\epsilon )\) Local partition of imputed column

10: \({D}_{:,j}\leftarrow \widehat{y}\)

11: \({M}_{:,j}\leftarrow {\bf{1}}\in {{\mathbb{R}}}^{{m}_{i}\times 1}\)

12: end for

13: end procedure

Algorithm 7

Linear regression via MHE (using batched instead of mini-batched gradient descent to simplify)

INPUT:

N: number of CKKS slots

\({\mathcal{C}}\): CKKS ciphertexts space (i.e., \({({{\mathbb{Z}}}_{p}[X]/(X+1))}^{2}\))

\(X\in {{\mathbb{R}}}^{{m}_{i}\times n}\): training data partition held locally at ith party

\(y\in {{\mathbb{R}}}^{{m}_{i}\times 1}\): training labels partition held locally at ith party

\({\mathcal{M}}\): linear regression model that stores initial, aggregated weights \(\left(\left[{{\bf{w}}}_{{\mathcal{M}}}\right]\right.\in {{\mathcal{C}}}^{\lceil (n+1)/N\rceil \times 1}\), number of training epochs \(({e}_{{\mathcal{M}}}\in {\mathbb{N}})\), and step size \(({\eta }_{{\mathcal{M}}}\in {\mathbb{R}})\)

1: procedure MHE_FIT \(({\mathcal{M}},X,y)\)

2: \(\widetilde{X}\leftarrow (X\parallel {\bf{1}})\) Append bias column locally at each party

3: \([C]\leftarrow {\widetilde{X}}^{\top }\times \widetilde{X}\) Additively shared local partitions \({\widetilde{X}}^{\top }\times \widetilde{X}\)

4: \([R]\leftarrow {\widetilde{X}}^{\top }\times y\) Additively shared local partitions of \({\widetilde{X}}^{\top }\times y\)

5: if \({\mathtt{len}}({\widetilde{X}}^{\top }) < 4\) then Closed-form solution

6: Raggregate([R])

7: \([{{\bf{w}}}_{{\mathcal{M}}}]\,\,\,\,\leftarrow {([C])}^{-1}\times {\bf{R}}\)

8: else Batched gradient descent

9: for \(j=\overline{0,\ldots {e}_{{\mathcal{M}}}}\) do

10: \({{\bf{w}}}_{{\mathcal{M}}}\leftarrow {\mathtt{aggregate}}([{{\bf{w}}}_{{\mathcal{M}}}])\)

11: \([{{\bf{w}}}_{{\mathcal{M}}}]\leftarrow [{{\bf{w}}}_{{\mathcal{M}}}]+([R]-[C]\times {{\bf{w}}}_{{\mathcal{M}}})\cdot {\eta }_{{\mathcal{M}}}\)

12: end for

13: end if

14: end procedure

Algorithm 8

Logistic regression via MHE (using batched instead of mini-batched gradient descent to simplify)

INPUT:

N: number of CKKS slots

\({\mathcal{C}}\): CKKS ciphertexts space (i.e., \({({{\mathbb{Z}}}_{p}[X]/(X+1))}^{2}\))

\(X\in {{\mathbb{R}}}^{{m}_{i}\times n}\): training data partition held locally at ith party

\(y\in {{\mathbb{R}}}^{{m}_{i}\times 1}\): training labels partition held locally at ith

\({\mathcal{M}}\): logistic regression model that stores initial, aggregated weights \(\left([{{\bf{w}}}_{{\mathcal{M}}}]\right.\in {{\mathcal{C}}}^{\lceil (n+1)/N\rceil \times 1}\), number of training epochs \(({e}_{{\mathcal{M}}}\in {\mathbb{N}})\), and step size \(({\eta }_{{\mathcal{M}}}\in {\mathbb{R}})\)

1: procedure SMC_FIT \(({\mathcal{M}},X,y)\)

2: \(\widetilde{X}\leftarrow (X\parallel {\bf{1}})\) Append bias column locally at each party

3: \(\widetilde{{\bf{X}}}\leftarrow {\mathtt{aggregate}}(\widetilde{X})\)

4: \({\widetilde{{\bf{X}}}}^{\top }\leftarrow {\mathtt{aggregate}}({\widetilde{X}}^{\top })\)

5: for \(j=\overline{0,\ldots {e}_{{\mathcal{M}}}}\) do

6: \([{\bf{P}}]\leftarrow \widetilde{{\bf{X}}}\times [{{\bf{w}}}_{{\mathcal{M}}}]\)

7: [A] ← σcheby([P], (0, 1))

8: \([{{\bf{w}}}_{{\mathcal{M}}}]\leftarrow [{{\bf{w}}}_{{\mathcal{M}}}]-{\widetilde{{\bf{X}}}}^{\top }\times ([{\bf{A}}]-y)\cdot {\eta }_{{\mathcal{M}}}\)

9: end for

10 end procedure

We implemented both SMC- and MHE-MICE in Sequre24,25—a Codon-based37, Pythonic domain-specific language for high-performance SMC computing—in less than 550 lines of high-level Pythonic code. To enable compiling to MHE, we extended Sequre with support for specialized distributed data types and compiler optimization passes to orchestrate multiparty HE computing and automatically handle workload distribution, data aggregation, and other intrinsic properties of HE, such as ciphertext maintenance and encoding29. Specifically, our compile-time optimization passes reduce the multiplication depth of the arithmetic expressions, prioritize computing on non-encrypted over the more expensive, encrypted data, and find an optimal aggregation and encoding strategy for the distributed data types. The distributed data types enable arithmetic on top of the private data collectively stored at multiple computing parties, where the data is kept in a non-encrypted form at each party and only partially encrypted when needed throughout the computation. Finally, to enable the essential homomorphic encryption operations (encryption, addition, multiplication, and rotation) and distributed HE operations such as collective bootstrapping, decryption and switching to secret sharing, we re-implemented Lattigo’s38 distributed CKKS scheme in Codon.