I have 3 million x 9 million sparse matrices with several billion non-zero entries. R and Python do not allow sparse matrices with lots of non-zero MAXINT entries, so I found myself using Julia.
While scaling this data with standard deviation is trivial, humiliation is of course a non-move in a naive way, as it would create a dense matrix of 200 terabytes.
The corresponding code for svd - julia can be found at https://github.com/JuliaLang/julia/blob/343b7f56fcc84b20cd1a9566fd548130bb883505/base/linalg/arnoldi.jl#L398
From my reading, the key element of this code is the AtA_or_AAt structure and several functions around them, in particular A_mul_B !. Copied below for your convenience.
struct AtA_or_AAt{T,S} <: AbstractArray{T, 2}
A::S
buffer::Vector{T}
end
function AtA_or_AAt(A::AbstractMatrix{T}) where T
Tnew = typeof(zero(T)/sqrt(one(T)))
Anew = convert(AbstractMatrix{Tnew}, A)
AtA_or_AAt{Tnew,typeof(Anew)}(Anew, Vector{Tnew}(max(size(A)...)))
end
function A_mul_B!(y::StridedVector{T}, A::AtA_or_AAt{T}, x::StridedVector{T}) where T
if size(A.A, 1) >= size(A.A, 2)
A_mul_B!(A.buffer, A.A, x)
return Ac_mul_B!(y, A.A, A.buffer)
else
Ac_mul_B!(A.buffer, A.A, x)
return A_mul_B!(y, A.A, A.buffer)
end
end
size(A::AtA_or_AAt) = ntuple(i -> min(size(A.A)...), Val(2))
ishermitian(s::AtA_or_AAt) = true
This is passed to the eigs function, where some kind of magic happens, and then the output is processed into the appropriate components for SVD.
I think the best way to do this job for a center-on-the-fly setup is to do something like a subclass of AtA_or_AAT with a version of AtA_or_AAT_centered, which more or less mimics behavior, but also saves column values ββand overrides A_mul_B! properly.
. , , , SVD ( , , , - ).
edit: , Julia, "Centered Sparse Matrix" , , , , . , , . , , , .