Core Component Analysis (PCA) in Python

I have an array (26424 x 144) and I want to execute PCA on it using Python. However, there is no special place on the Internet that explains how to accomplish this task (there are some sites that simply make PCAs according to their own - there is no generalized way to make sure that I can find). Anyone with any help will do just fine.

+58
python numpy scikit-learn machine-learning pca
Nov 05 '12 at 0:10
source share
9 answers

You can find the PCA function in the matplotlib module:

import numpy as np from matplotlib.mlab import PCA data = np.array(np.random.randint(10,size=(10,3))) results = PCA(data) 

various ATP parameters will be saved. This is from the matplotlib mlab part, which is a MATLAB syntax compatibility layer

EDIT: on nextgenetics blog I found a great demonstration of how to execute and display PCA with matplotlib mlab module, have fun and check out this blog

+49
Nov 05
source share

I sent my answer, although another answer has already been accepted; the accepted answer is based on an obsolete function ; In addition, this obsolete function is based on a singular value decomposition (SVD), which (although completely correct) is much more intensive in terms of memory and processor of two common methods for calculating PCA. This is especially true here due to the size of the data array in the OP. Using the covariance PCA, the array used in the computation flow is only 144 x 144, not 26424 x 144 (the dimensions of the original data array).

Here is a simple working PCA implementation using SciPy's linalg module. Since this implementation first computes the covariance matrix and then performs all subsequent calculations in this array, it uses much less memory than SVDs based on SVDs.

(The linalg module in NumPy can also be used without changing the code below, except for the import statement, which will be from numpy import linalg as LA.)

Two key steps in this PCA implementation:

  • calculation of the covariance matrix ; and

  • taking eivenvectors and eigenvalues ​​of this cov matrix

In the function below, the parameter dims_rescaled_data strong> refers to the required number of dimensions in a scalable data matrix; this parameter has a default value of only two dimensions, but the code below is not limited to two, but can be any value lower than the column number of the original data array.




 def PCA(data, dims_rescaled_data=2): """ returns: data transformed in 2 dims/columns + regenerated original data pass in: data as 2D NumPy array """ import numpy as NP from scipy import linalg as LA m, n = data.shape # mean center the data data -= data.mean(axis=0) # calculate the covariance matrix R = NP.cov(data, rowvar=False) # calculate eigenvectors & eigenvalues of the covariance matrix # use 'eigh' rather than 'eig' since R is symmetric, # the performance gain is substantial evals, evecs = LA.eigh(R) # sort eigenvalue in decreasing order idx = NP.argsort(evals)[::-1] evecs = evecs[:,idx] # sort eigenvectors according to same index evals = evals[idx] # select the first n eigenvectors (n is desired dimension # of rescaled data array, or dims_rescaled_data) evecs = evecs[:, :dims_rescaled_data] # carry out the transformation on the data using eigenvectors # and return the re-scaled data, eigenvalues, and eigenvectors return NP.dot(evecs.T, data.T).T, evals, evecs def test_PCA(data, dims_rescaled_data=2): ''' test by attempting to recover original data array from the eigenvectors of its covariance matrix & comparing that 'recovered' array with the original data ''' _ , _ , eigenvectors = PCA(data, dim_rescaled_data=2) data_recovered = NP.dot(eigenvectors, m).T data_recovered += data_recovered.mean(axis=0) assert NP.allclose(data, data_recovered) def plot_pca(data): from matplotlib import pyplot as MPL clr1 = '#2026B2' fig = MPL.figure() ax1 = fig.add_subplot(111) data_resc, data_orig = PCA(data) ax1.plot(data_resc[:, 0], data_resc[:, 1], '.', mfc=clr1, mec=clr1) MPL.show() >>> # iris, probably the most widely used reference data set in ML >>> df = "~/iris.csv" >>> data = NP.loadtxt(df, delimiter=',') >>> # remove class labels >>> data = data[:,:-1] >>> plot_pca(data) 

The graph below is a visual representation of this PCA function based on aperture data. As you can see, the 2D transform purely separates class I from class II and class III (but not class II from class III, which actually requires a different dimension).

enter image description here

+81
Nov 05
source share

Another Python PCA using numpy. Same idea as @doug, but it did not start.

 from numpy import array, dot, mean, std, empty, argsort from numpy.linalg import eigh, solve from numpy.random import randn from matplotlib.pyplot import subplots, show def cov(data): """ Covariance matrix note: specifically for mean-centered data note: numpy `cov` uses N-1 as normalization """ return dot(XT, X) / X.shape[0] # N = data.shape[1] # C = empty((N, N)) # for j in range(N): # C[j, j] = mean(data[:, j] * data[:, j]) # for k in range(j + 1, N): # C[j, k] = C[k, j] = mean(data[:, j] * data[:, k]) # return C def pca(data, pc_count = None): """ Principal component analysis using eigenvalues note: this mean-centers and auto-scales the data (in-place) """ data -= mean(data, 0) data /= std(data, 0) C = cov(data) E, V = eigh(C) key = argsort(E)[::-1][:pc_count] E, V = E[key], V[:, key] U = dot(data, V) # used to be dot(VT, data.T).T return U, E, V """ test data """ data = array([randn(8) for k in range(150)]) data[:50, 2:4] += 5 data[50:, 2:5] += 5 """ visualize """ trans = pca(data, 3)[0] fig, (ax1, ax2) = subplots(1, 2) ax1.scatter(data[:50, 0], data[:50, 1], c = 'r') ax1.scatter(data[50:, 0], data[50:, 1], c = 'b') ax2.scatter(trans[:50, 0], trans[:50, 1], c = 'r') ax2.scatter(trans[50:, 0], trans[50:, 1], c = 'b') show() 

It gives the same thing much less

 from sklearn.decomposition import PCA def pca2(data, pc_count = None): return PCA(n_components = 4).fit_transform(data) 

As I understand it, using eigenvalues ​​(the first method) is better for high-dimensional data and fewer samples, while using a singular value decomposition is better if you have more samples than sizes.

+18
Jan 13 '15 at 23:16
source share

This quest is for numpy .

And here is a tutorial demonstrating how the analysis of pincipal components can be performed using numpy built-in modules such as mean,cov,double,cumsum,dot,linalg,array,rank .

http://glowingpython.blogspot.sg/2011/07/principal-component-analysis-with-numpy.html

Note that scipy also has a long explanation here - https://github.com/scikit-learn/scikit-learn/blob/babe4a5d0637ca172d47e1dfdd2f6f3c3ecb28db/scikits/learn/utils/extmath.py#L105

with the scikit-learn library that has more code examples - https://github.com/scikit-learn/scikit-learn/blob/babe4a5d0637ca172d47e1dfdd2f6f3c3ecb28db/scikits/learn/utils/extmath.py#L105

+12
Nov 05
source share

There are scikit-learn options here. StandardScaler was used with both methods, since PCA is on a scale

Method 1: Ask scikit-learn to select the minimum number of core components to keep at least x% (90% in the example below) of the variance.

 from sklearn.datasets import load_iris from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler iris = load_iris() # mean-centers and auto-scales the data standardizedData = StandardScaler().fit_transform(iris.data) pca = PCA(.90) principalComponents = pca.fit_transform(X = standardizedData) # To get how many principal components was chosen print(pca.n_components_) 

Method 2. Select the number of main components (in this case 2 is selected)

 from sklearn.datasets import load_iris from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler iris = load_iris() standardizedData = StandardScaler().fit_transform(iris.data) pca = PCA(n_components=2) principalComponents = pca.fit_transform(X = standardizedData) # to get how much variance was retained print(pca.explained_variance_ratio_.sum()) 

Source: https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

+5
Jan 02 '18 at 22:29
source share

UPDATE: matplotlib.mlab.PCA has been matplotlib.mlab.PCA since release 2.2 (2018-03-06).

The matplotlib.mlab.PCA library (used in this answer ) is not deprecated. Therefore, for all people arriving here via Google, I will post a complete working example tested with Python 2.7.

Use the following code with caution, as it uses the now obsolete library!

 from matplotlib.mlab import PCA import numpy data = numpy.array( [[3,2,5], [-2,1,6], [-1,0,4], [4,3,4], [10,-5,-6]] ) pca = PCA(data) 

Now in "pca.Y" is the original data matrix in terms of the base vectors of the main components. More information about the PCA can be found here .

 >>> pca.Y array([[ 0.67629162, -0.49384752, 0.14489202], [ 1.26314784, 0.60164795, 0.02858026], [ 0.64937611, 0.69057287, -0.06833576], [ 0.60697227, -0.90088738, -0.11194732], [-3.19578784, 0.10251408, 0.00681079]]) 

You can use matplotlib.pyplot to draw this data to make sure the PCA gives "good" results. The names list is only used to annotate our five vectors.

 import matplotlib.pyplot names = [ "A", "B", "C", "D", "E" ] matplotlib.pyplot.scatter(pca.Y[:,0], pca.Y[:,1]) for label, x, y in zip(names, pca.Y[:,0], pca.Y[:,1]): matplotlib.pyplot.annotate( label, xy=(x, y), xytext=(-2, 2), textcoords='offset points', ha='right', va='bottom' ) matplotlib.pyplot.show() 

Looking at our original vectors, we see that the data [0] ("A") and the data [3] ("D") are quite similar, like the data [1] ("B") and the data [2] (" C "). This is reflected in the 2D plot of the data converted to PCA.

PCA result plot

+4
Jan 27 '17 at 15:21
source share

In addition to all the other answers, here is some code to build biplot using sklearn and matplotlib .

 import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.decomposition import PCA import pandas as pd from sklearn.preprocessing import StandardScaler iris = datasets.load_iris() X = iris.data y = iris.target #In general a good idea is to scale the data scaler = StandardScaler() scaler.fit(X) X=scaler.transform(X) pca = PCA() x_new = pca.fit_transform(X) def myplot(score,coeff,labels=None): xs = score[:,0] ys = score[:,1] n = coeff.shape[0] scalex = 1.0/(xs.max() - xs.min()) scaley = 1.0/(ys.max() - ys.min()) plt.scatter(xs * scalex,ys * scaley, c = y) for i in range(n): plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5) if labels is None: plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center') else: plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center') plt.xlim(-1,1) plt.ylim(-1,1) plt.xlabel("PC{}".format(1)) plt.ylabel("PC{}".format(2)) plt.grid() #Call the function. Use only the 2 PCs. myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :])) plt.show() 

enter image description here

+3
May 28 '18 at 19:34
source share

I made a small script to compare various PCAs presented as an answer here:

 import numpy as np from scipy.linalg import svd shape = (26424, 144) repeat = 20 pca_components = 2 data = np.array(np.random.randint(255, size=shape)).astype('float64') # data normalization # data.dot(data.T) # (U, s, Va) = svd(data, full_matrices=False) # data = data / s[0] from fbpca import diffsnorm from timeit import default_timer as timer from scipy.linalg import svd start = timer() for i in range(repeat): (U, s, Va) = svd(data, full_matrices=False) time = timer() - start err = diffsnorm(data, U, s, Va) print('svd time: %.3fms, error: %E' % (time*1000/repeat, err)) from matplotlib.mlab import PCA start = timer() _pca = PCA(data) for i in range(repeat): U = _pca.project(data) time = timer() - start err = diffsnorm(data, U, _pca.fracs, _pca.Wt) print('matplotlib PCA time: %.3fms, error: %E' % (time*1000/repeat, err)) from fbpca import pca start = timer() for i in range(repeat): (U, s, Va) = pca(data, pca_components, True) time = timer() - start err = diffsnorm(data, U, s, Va) print('facebook pca time: %.3fms, error: %E' % (time*1000/repeat, err)) from sklearn.decomposition import PCA start = timer() _pca = PCA(n_components = pca_components) _pca.fit(data) for i in range(repeat): U = _pca.transform(data) time = timer() - start err = diffsnorm(data, U, _pca.explained_variance_, _pca.components_) print('sklearn PCA time: %.3fms, error: %E' % (time*1000/repeat, err)) start = timer() for i in range(repeat): (U, s, Va) = pca_mark(data, pca_components) time = timer() - start err = diffsnorm(data, U, s, Va.T) print('pca by Mark time: %.3fms, error: %E' % (time*1000/repeat, err)) start = timer() for i in range(repeat): (U, s, Va) = pca_doug(data, pca_components) time = timer() - start err = diffsnorm(data, U, s[:pca_components], Va.T) print('pca by doug time: %.3fms, error: %E' % (time*1000/repeat, err)) 

pca_mark is the answer in Mark's answer .

pca_doug is the answer to the question about dag .

Here is an example output (but the result largely depends on the size of the data and the pca_components parameters, so I would recommend running your own test with your own data. In addition, facebook pca is optimized for normalized data, so it will be faster and more accurate in this case):

 svd time: 3212.228ms, error: 1.907320E-10 matplotlib PCA time: 879.210ms, error: 2.478853E+05 facebook pca time: 485.483ms, error: 1.260335E+04 sklearn PCA time: 169.832ms, error: 7.469847E+07 pca by Mark time: 293.758ms, error: 1.713129E+02 pca by doug time: 300.326ms, error: 1.707492E+02 

EDIT:

The diffsnorm function from fbpca computes the error with the spectral-norm Schur decomposition.

+2
Apr 03 '18 at 12:12
source share

For the sake of def plot_pca(data): will work, you need to replace the lines

 data_resc, data_orig = PCA(data) ax1.plot(data_resc[:, 0], data_resc[:, 1], '.', mfc=clr1, mec=clr1) 

with lines

 newData, data_resc, data_orig = PCA(data) ax1.plot(newData[:, 0], newData[:, 1], '.', mfc=clr1, mec=clr1) 
0
Nov 17 '16 at 18:02
source share



All Articles