Hidden semantic analysis in Python mismatch

Question

Hidden semantic analysis in Python mismatch

I am trying to execute a Wikipedia article on hidden semantic indexing in Python using the following code:

documentTermMatrix = array([[ 0., 1., 0., 1., 1., 0., 1.], [ 0., 1., 1., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 1., 1.], [ 0., 0., 0., 1., 0., 0., 0.], [ 0., 1., 1., 0., 0., 0., 0.], [ 1., 0., 0., 1., 0., 0., 0.], [ 0., 0., 0., 0., 1., 1., 0.], [ 0., 0., 1., 1., 0., 0., 0.], [ 1., 0., 0., 1., 0., 0., 0.]]) u,s,vt = linalg.svd(documentTermMatrix, full_matrices=False) sigma = diag(s) ## remove extra dimensions... numberOfDimensions = 4 for i in range(4, len(sigma) -1): sigma[i][i] = 0 queryVector = array([[ 0.], # same as first column in documentTermMatrix [ 0.], [ 0.], [ 0.], [ 0.], [ 1.], [ 0.], [ 0.], [ 1.]])

As mathematics says, it should work:

 dtMatrixToQueryAgainst = dot(u, dot(s,vt)) queryVector = dot(inv(s), dot(transpose(u), queryVector)) similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainst[:,0] # gives 'matrices are not aligned' error. should be 1 because they're the same

What works with math that doesn't look right : (from here )

 dtMatrixToQueryAgainst = dot(s, vt) queryVector = dot(transpose(u), queryVector) similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainsst[:,0]) # gives 1, which is correct

Why does the route work, but the first does not, when everything I can find about LSA math shows that the first is correct? I feel like I'm missing something obvious ...

+6

python numpy scipy nlp latent-semantic-indexing

Jmjmh Apr 25 '12 at 23:24

source share

1 answer

Drew frank · Accepted Answer · 2012-05-01T01:11:38+0000

There are several inconsistencies in your code that cause errors before your confusion. This makes it difficult to understand what you tried and why you got confused (it is obvious that you did not run the code when it was inserted, or it would throw an exception earlier).

However, if I follow your intentions correctly, your first approach will be almost right. Consider the following code:

 documentTermMatrix = array([[ 0., 1., 0., 1., 1., 0., 1.], [ 0., 1., 1., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 1., 1.], [ 0., 0., 0., 1., 0., 0., 0.], [ 0., 1., 1., 0., 0., 0., 0.], [ 1., 0., 0., 1., 0., 0., 0.], [ 0., 0., 0., 0., 1., 1., 0.], [ 0., 0., 1., 1., 0., 0., 0.], [ 1., 0., 0., 1., 0., 0., 0.]]) numDimensions = 4 u, s, vt = linalg.svd(documentTermMatrix, full_matrices=False) u = u[:, :numDimensions] sigma = diag(s)[:numDimensions, :numDimensions] vt = vt[:numDimensions, :] lowRankDocumentTermMatrix = dot(u, dot(sigma, vt)) queryVector = documentTermMatrix[:, 0] lowDimensionalQuery = dot(inv(sigma), dot(uT, queryVector)) lowDimensionalQuery vt[:,0]

You should see that lowDimensionalQuery and vt[:,0] are equal. Think of vt as representing documents in a low dimensional subspace. First, we translate our query into this subspace to get lowDimensionalQuery , and then compare it with the corresponding vt column. Your error was trying to compare the converted query with the document vector from lowRankDocumentTermMatrix , which lives in the source space. Because the converted query has fewer elements than the "restored" document, Python complained.

Hidden semantic analysis in Python mismatch

More articles: