I have a bipartite graph, and I'm looking for the most efficient iterative way to split it into connected components. My recursive version began to overflow the stack with large data sets. I am ready to send from any language / pseudo-code, but for completeness I will code in C #.
My existing code is specialized for my data types. One septum is proteins, the other is spectra. Map and Set are C ++ stdlib workstations.
void recursivelyAssignProteinToCluster (long proteinId, long clusterId, Set<long> spectrumSet, Map<long, Set<long>> spectrumSetByProteinId, Map<long, Set<long>> proteinSetBySpectrumId, Map<long, long> clusterByProteinId) { // try to assign the protein to the current cluster var insertResult = clusterByProteinId.Insert(proteinId, clusterId); if (!insertResult.WasInserted) return; // recursively add all "cousin" proteins to the current cluster foreach (long spectrumId in spectrumSet) foreach (var cousinProteinId in proteinSetBySpectrumId[spectrumId]) { if (proteinId != cousinProteinId) { Set<long> cousinSpectrumSet = spectrumSetByProteinId[cousinProteinId]; recursivelyAssignProteinToCluster(cousinProteinId, clusterId, cousinSpectrumSet, spectrumSetByProteinId, proteinSetBySpectrumId, clusterByProteinId); } } } Map<long, long> calculateProteinClusters (NHibernate.ISession session) { var spectrumSetByProteinId = new Map<long, Set<long>>(); var proteinSetBySpectrumId = new Map<long, Set<long>>(); var query = session.CreateQuery("SELECT pi.Protein.id, psm.Spectrum.id " + GetFilteredQueryString(FromProtein, ProteinToPeptideSpectrumMatch)); foreach (var queryRow in query.List<object[]>()) { long proteinId = (long) queryRow[0]; long spectrumId = (long) queryRow[1]; spectrumSetByProteinId[proteinId].Add(spectrumId); proteinSetBySpectrumId[spectrumId].Add(proteinId); } var clusterByProteinId = new Map<long, long>(); int clusterId = 0; foreach (var pair in spectrumSetByProteinId) { long proteinId = pair.Key; // for each protein without a cluster assignment, make a new cluster if (!clusterByProteinId.Contains(proteinId)) { ++clusterId; recursivelyAssignProteinToCluster(proteinId, clusterId, pair.Value, spectrumSetByProteinId, proteinSetBySpectrumId, clusterByProteinId); } } return clusterByProteinId; }
As ShinTakezou said, I reorganized to put the stack in a heap, and it works great. I used the DepthFirstSearch method from the digEmAll example.
var clusterByProteinId = new Map<long, long>(); int clusterId = 0; var clusterStack = new Stack<KeyValuePair<long, Set<long>>>(); foreach (var pair in spectrumSetByProteinId) { long proteinId = pair.Key; if (clusterByProteinId.Contains(proteinId)) continue;