Let’s take a look at applying PCA to a dataset that has many more than just a few data series.
The Wisconsin breast cancer dataset from Scikit-learn contains thirty columns of data. The data apparently concerns the digitally-detected shape of the nuclei of cells in breast tumour biopsies. We are also told whether the diagnosis was “malignant” or “benign”.
Our goal is therefore to predict the diagnosis based on this mechanically-assembled information.
First let’s take a look at the data.
from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_breast_cancer import numpy as np cancer = load_breast_cancer(as_frame=True) X = cancer['data'] y = np.choose(cancer['target'], cancer['target_names']) print(X.columns) print(cancer['target_names'])
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension'], dtype='object') ['malignant' 'benign']
You can see the data looks a lot like the iris data in terms of how we load it. It just contains different data series and target names. Instead of four measurements on iris petals and sepals, we have thirty measurements on tumour cell nuclei. Instead of three iris species, we have two possible diagnoses.
To try to predict the diagnosis on the basis of the cell nucleus shape data, we’ll take the following steps.
- Load the cancer data and map the integers to diagnosis names.
- Perform a train-test split, so we can use some data for training and some data for checking whether the predictive ability of the model is good or not.
- Normalise the data. This is very important, because we may have data series that exist on wildly disparate scales.
- Apply principal component analysis to the data. Note that if we supply a floating-point number to the constructor of the Scikit-learn PCA model, it will create a sufficient number of components to explain that fraction of the variance. So 0.95 means “create a great enough number of components to explain 95% of the variance’.
- Fit a logistic regression model to the principal components of the training data.
- Use the model to make predictions on the test data segment.
- Display a confusion matrix with the results.
Here’s the code.
from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_breast_cancer from sklearn.metrics import normalized_mutual_info_score from sklearn.model_selection import train_test_split from sklearn.decomposition import PCA import matplotlib.pyplot as plt import pandas as pd from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay from sklearn.preprocessing import StandardScaler import numpy as np cancer = load_breast_cancer(as_frame=True) X = cancer['data'] y = np.choose(cancer['target'], cancer['target_names']) X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, train_size=0.7) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) pca = PCA(0.95) components = pca.fit_transform(X_train) print("Number of components:", pca.n_components_) model = LogisticRegression() model.fit(components, y_train) test_components = pca.transform(X_test) y_predicted = model.predict(test_components) labels = cancer['target_names'] cm = confusion_matrix(y_test, y_predicted, labels=labels) cm_display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels) cm_display.plot() plt.show()
You can see our predictions have been quite successful. We have five failed predictions (two benign cases that were predicted to be malignant and three malignant cases that were predicted to be beign) and 166 correct predictions.