Quantum Machine Learning (QML) techniques have been recently attracting massive interest. However reported applications usually employ synthetic or well-known datasets. One of these techniques based on using a hybrid approach combining quantum and classic devices is the Variational Quantum Classifier (VQC), which development seems promising. Albeit being largely studied, VQC implementations for “real-world” datasets are still challenging on Noisy Intermediate Scale Quantum devices (NISQ). In this paper we propose a preprocessing pipeline based on Stokes parameters for data mapping. This pipeline enhances the prediction rates when applying VQC techniques, improving the feasibility of solving classification problems using NISQ devices. By including feature selection techniques and geometrical transformations, enhanced quantum state preparation is achieved. Also, a representation based on the Stokes parameters in the Poincaré Sphere is possible for visualizing the data. Our results show that by using the proposed techniques we improve the classification score for the incidence of acute comorbid diseases in Type 2 Diabetes Mellitus patients. We used the implemented version of VQC available on IBM’s framework Qiskit, and obtained with two and three qubits an accuracy of 70% and 72% respectively.

Several efforts have been made in recent time to advance quantum software capable of exploiting the power of the available Noisy Intermediate Scale Quantum (NISQ) devices. These devices are being developed on a variety of hardware platforms and technologies with a number of qubits ranging from fifty to a few hundred [

Quantum Machine Learning (QML) is one of the most encouraging applications, being actively studied by several research groups [

Several approaches to encode classical data into quantum states have been presented. These describe advantages including the experimental overhead reduction in terms of resources and the introduction of non-linearities in the data [

Multiple examples from the usage of quantum computing techniques in machine learning applications for well-known datasets have been presented. Datasets such as MNIST, Wine, Cancer and Iris are of common use to test these approaches [

In this paper we present a real case study of Type 2 Diabetes Mellitus (T2DM), this disease is the fourth cause of mortality, with rising prevalence this disease is a major public health problem [

This paper is organized as follows: First, we present the proposed pipeline to pre-process data describing each of the stages developed: Feature Scaling and selection, ellipsoidal coordinate mapping and Stokes parameters data representation, then we describe the employed quantum classifier and the experiments performed to classify acute comorbidities incidence in Type 2 Diabetes Mellitus (T2DM) patients. Finally, we present the obtained results and a discussion about these results and conclusions.

The current limitations of NISQ devices impose restrictions on Quantum Machine Learning techniques [

This step ensures same scaling for the numerical inputs in the model, enhancing the accuracy and speed of optimization methods during training. In general, this is a required step in the data preprocessing pipeline for most of the classical Machine Learning techniques [

Feature selection techniques are based on the idea of identifying and removing less relevant or redundant features, providing faster and more cost-effective predictors [

In this sense, we based our methodology in variable ranking, calculating the mean value from the scores obtained using the feature importance of four different classical classification methods. Our main goal was to find a subset of features from our dataset that give us the best performance in classification, using the minimum number of features considering the encoding transformation to define quantum states, and the current quantum devices limitations in terms of quantum volume. Therefore, we selected the top three features of the calculated score, this dimensionality constrained is imposed by the ellipsoidal coordinate mapping. Our chosen classification methods were Gradient Boosting [

The selected features were transformed into a coordinate space where it can be easily represented using the Poincaré sphere. We use an iterative method based on [

where

The constant

and

We executed the method specifying a dispersion error of 10^{−5} for each data point.

A convenient geometrical representation of the Quantum States is obtained when using the Bloch Sphere, also known as Poincaré Sphere, it has been used to describe polarization states by using the Stokes parameters. By defining the data in terms of ellipsoids, these definitions are mathematically analogous to Stokes parameters to describe polarization, however, they have no physical relation with them. The ellipses parameters are represented by:

where

The Variational Quantum Classifier (VQC) is a quantum method for supervised learning that allows performing classification problems in current NISQ devices. Based on a method proposed by Havlíček et al. [

One of the key components from this method is the feature map definition, which maps data into a potentially vastly higher-dimensional Hilbert space of a quantum system [

Havlíček et al. [

Diabetes Type 2 is a rising public health problem [

Several T2DM related complications have been studied through different classical Machine learning, Deep Learning and Data Mining techniques [

A dataset containing clinical information from patients diagnosed with Type 2 Diabetes Mellitus has been used. For each subject a successive 12-month time period was defined, during this period, a patient is considered diagnosed with T2DM if the disease onset date was prior to the established cut-off point. By following these criteria, the total study population was 149,015 filtered from a larger database containing Electronic Health Records (EHR) from Osakidetza (Basque Health Service) in Bilbao, Spain. This dataset includes clinical variables such as LDL-Cholesterol, Body Mass Index and glycated hemoglobin (A1C). Also, demographic variables including age, gender and socioeconomic status position were considered.

The study protocol was approved by the Clinical Research Ethics Committee of Euskadi (PI2014074), Spain. Informed consent was not obtained because patient health records were made anonymous and de-identified prior to analysis.

In particular, our concern is the prediction acute conditions, we studied the incidence of acute myocardial infarction, major amputation or avoidable hospitalizations. Following the methodology discussed in Section 2.2 for feature selection we used gender, cholesterol LDL and Johns Hopkins’ Aggregated Diagnosis Groups (ADG). These features were contained in the higher scores when gradient boosting, random forest, k-best and tree-based techniques were applied as feature selectors.

Furthermore, the included features are relevant for the diabetes patients care, there is a strong correlation between diabetes and cholesterol [

Then we randomly selected a balanced set of 250 samples for the data. These were split into two subsets, 200 for train and 50 for testing. After preprocessing the data, it is possible to represent the data points using the Poincare Sphere as depicted in

Data classification was performed by using the implemented version of VQC in IBM’s framework Qiskit version 0.11.1 and executed in the provided simulator Aer version 0.2.3 [

Data dimension | Metrics | Preprocessing | ||
---|---|---|---|---|

Normalized (−1,1) and zero standard deviation | Ellipsoidal transform | Poincaré sphere | ||

3 | Accuracy | 0.680 | 0.580 | |

Precision | 0.758 | 0.586 | 0.758 | |

Recall | 0.709 | 0.653 | 0.758 | |

F1-Score | 0.733 | 0.618 | 0.758 | |

2 | Accuracy | 0.500 | 0.620 | |

Precision | 0.413 | 0.689 | 0.793 | |

Recall | 0.600 | 0.666 | 0.718 | |

F1-Score | 0.489 | 0.677 | 0.754 |

To show the advantage of using the proposed pipeline and its elements we performed three experiments using VQC over the T2DM dataset. In the first experiment we normalized the data using zero standard deviation before passing to the model. In addition to the normalization, in the second experiment we also transformed the data to Ellipsoidal coordinates. Finally, in the third experiment we calculated the Stokes parameters additional to all the previous steps, giving the possibility to visualize the data points using a Poincaré sphere. By including all the data preprocessing elements, we obtained a pipeline that enhance data preparation for VQC application. The results of these experiments are presented in

Research on Quantum Machine Learning applications is advancing the uses of current state quantum computers, given the wide range of applications and the industry interest in machine learning techniques to solve practical problems. We consider that this work contributes in the usage of new techniques for the exploitation of NISQ devices in “real-world” applications of QML.

A milestone to pursue is to achieve quantum advantages for commercial applications. Machine learning is an area of computer science where statistics, data processing and analytics converge, given the relevance of data across the different fields and the breadth of applications. In particular, Quantum Machine Learning is being actively investigated by several research groups, as the exploit of quantum computing advantages could improve and expand the range of real-world machine learning applications.

In this paper we propose a pipeline to transform and preprocess data, making it feasible to be classified using Quantum Machine Learning techniques. By using this pipeline, we enhanced the quantum state preparation for VQC algorithm. Our results showed that by using the proposed techniques we obtained similar results when classifying the incidence of acute diseases in diabetes patients using a Variational Quantum Classifier with two and three qubits, with 70% and 72% accuracy respectively. We are currently studying and developing unsupervised and supervised machine learning techniques suitable for NISQ devices, given the current limitations on coherence times and qubits available on current devices. In particular, conducting further research on the application of the proposed pre-processing pipeline to improve the data suitability for different QML techniques such as Quantum Support Vector Machine. We are also evaluating the execution advantages of applying the proposed technique in different environments.

This work was supported by Osakidetza that provided the database. The study protocol was approved by the Clinical Research Ethics Committee of Euskadi (PI2014074), Spain. Informed consent was not obtained because patient health records were made anonymous and de-identified prior to analysis.