I am trying to execute a Wikipedia article on hidden semantic indexing in Python using the following code:
documentTermMatrix = array([[ 0., 1., 0., 1., 1., 0., 1.], [ 0., 1., 1., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 1., 1.], [ 0., 0., 0., 1., 0., 0., 0.], [ 0., 1., 1., 0., 0., 0., 0.], [ 1., 0., 0., 1., 0., 0., 0.], [ 0., 0., 0., 0., 1., 1., 0.], [ 0., 0., 1., 1., 0., 0., 0.], [ 1., 0., 0., 1., 0., 0., 0.]]) u,s,vt = linalg.svd(documentTermMatrix, full_matrices=False) sigma = diag(s) ## remove extra dimensions... numberOfDimensions = 4 for i in range(4, len(sigma) -1): sigma[i][i] = 0 queryVector = array([[ 0.], # same as first column in documentTermMatrix [ 0.], [ 0.], [ 0.], [ 0.], [ 1.], [ 0.], [ 0.], [ 1.]])
As mathematics says, it should work:
dtMatrixToQueryAgainst = dot(u, dot(s,vt)) queryVector = dot(inv(s), dot(transpose(u), queryVector)) similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainst[:,0]
What works with math that doesn't look right : (from here )
dtMatrixToQueryAgainst = dot(s, vt) queryVector = dot(transpose(u), queryVector) similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainsst[:,0])
Why does the route work, but the first does not, when everything I can find about LSA math shows that the first is correct? I feel like I'm missing something obvious ...