Generating synthetic blood transfusion data for haemoglobin deferral prediction

20 Jun 2023

This session was held on June 20 2023, during the 33rd Regional ISBT Congress that was held in Gothenburg, Sweden, from June 17-21.

The Blood services wider contribution to public health? session included the following presentations:

1. Mike Busch: mportant role for biorepositories in global surveillance and epidemiological studies
2. Khoa Manh Dinh: The effect of COVID-19 interventions on virus nasal carriage among Danish blood donors
3. Bertram Kjerulff: Influence of sex, age, BMI, and smoking on 47 circulating inflammatory and vascular stress biomarkers in 9,876 healthy individuals – Results from the Danish Blood Donor Study
4. Mars Stone: Development of a Nationwide Repeat Blood Donor Cohort to Monitor SARS-CoV-2 Serosurveillance and Population Immunity
5. Mart Janssen: Generating synthetic blood transfusion data for haemoglobin deferral prediction

MODERATORS: Antoine Lewin, Ole Birger Pedersen

After the presentation, there was a questions and answers session, which is also included in the recording.

Abstract

Generating synthetic blood transfusion data for haemoglobin deferral prediction

S Kroes¹, M van Leeuwen², R Groenwold³, M Janssen¹

¹Donor Medicine Research, Sanquin Blood Supply Foundation, Amsterdam, ²Leiden Institute of Advanced Computer Science, Leiden University, ³Clinical Epidemiology, Leiden University Medical Hospital, Leiden, Netherlands

Background: Synthetic data generation is becoming an increasingly popular approach to make privacy-sensitive data available for analysis. Recently, we proposed an approach for synthetic data generation (Kroes, Journal of the American Medical Informatics Association, 2022) by means of a mixed sum-product network (MSPN), that demonstrated both high utility and privacy in simulations, but the method has not been applied to real world personal data.

Aims: To test the capability of the MSPN approach for generating an anonymised dataset from personal blood donor data which is capable to reproduce analysis results obtained from the original dataset.

Methods: Data from the Dutch national blood bank consisting of 250,729 donation records were used to predict donor haemoglobin levels by means of support vector machine (SVM) models. These analyses were replicated with synthetic data generated with the MSPN approach. Privacy was evaluated by quantifying to what extent sensitive information can be extracted by using background information (i.e., attribute disclosure), whereas the quality of the analyses was evaluated by comparing precision and recall of the SVM models and the importance ranking of various predictor variables.

Results: Predictions from the SVM models trained on synthetic data were for 96% the same as the predictions made with the original SVM models. Precision was equal for both male and female donors, recall was 0.003 higher for males and 0.009 lower for female donors. The importance of the variables for Hb predictions, quantified and visualised with Shapley additive explanation values, were very similar. Opportunities for attribute disclosure were removed for all but two variables. Only the binary variables “Deferral Status” and “Sex” could still be inferred.

Summary/Conclusions: The similarities in predictions and predictive reasoning between the SVMs based on original and synthetic data indicate that the synthetic data generated by the MSPNs could be used instead of the original data without compromising predictive performance. This indicates the potential of this method for data sharing and explorative data exchange in practice. Future research should be targeted at further reducing the risk of attribute disclosure.