How can I write clustering results from mclust to a file?

I am using the mclust library for R ( http://www.stat.washington.edu/mclust ) to do some experimental EM-based GMM clustering. The package is wonderful and seems to usually find very good clusters for my data.

The problem is that I don’t know R at all, and although I managed to confuse the clustering process based on the help contents () and the extensive readme, I can’t understand for life how to output the actual cluster results to a file. I use the following absurdly simple script to perform clustering,

myData <- read.csv("data.csv", sep=",", header=FALSE) attach(myData) myBIC <- mclustBIC(myData) mySummary <- summary( myBIC, data=myData ) 

at this moment i have cluster results and summary. The data in data.csv is just a list of multidimensional points, one per line. Therefore, each row looks like "x, y, z" (in the case of three dimensions).

If I use 2d points (for example, only x and y vals), I can then use the internal graph function to get a very beautiful graph that displays the source points and color codes of each point based on the cluster to which it was assigned. Therefore, I know that all the information is somewhere in "myBIC", but the documents and help do not seem to give any information on how to print this data!

I want to print a new file based on the results, which, it seems to me, are encoded in myBIC. Sort of,

 CLUST x, y, z 1 1.2, 3.4, 5.2 1 1.2, 3.3, 5.2 2 5.5, 1.3, 1.3 3 7.1, 1.2, -1.0 3 7.2, 1.2, -1.1 

and then - hopefully also print the parameters / centroids of the individual gaussians / cluster that detected the clustering process.

Of course, this is absurdly easy to do, and I don't know R too much to understand this ...

EDIT: It seems like I went a little further. Performing the next fingerprint is somewhat critical matrix,

  > mySummary$classification [1] 1 1 2 1 3 [6] 1 1 1 3 1 [12] 1 2 1 3 1 [18] 1 3 

which, after reflection, which I understood, is actually a list of samples and their classifications. I think it is impossible to write directly with the write command, but a little more experimentation in the R console made me realize that I can do this:

 > newData <- mySummary$classification > write( newData, file="class.csv" ) 

and that the result really looks pretty good!

  $ head class.csv "","x" "1",1 "2",2 "3",2 

where the first column clearly corresponds to the index for the input, and the second column describes the assigned class identifier.

The mySummary $ parameters object appears to be nested, and has a bunch of sub-objects corresponding to individual gaussians and their parameters, etc. The "write" function does not work when I try to just write it, but individually writing each sub-object name is a bit tedious. This leads me to a new question: how can I iterate over a nested object in R and print elements in sequential order in a file descriptor?

I have this object mySummary $ parameters. It consists of several sub-elements, such as "mySummary $ parameters $ variance $ sigma", etc. I would just like to iterate over everything and print all the files in the same way as it does with the CLI automatically ...

+6
source share
1 answer

To calculate the clustering parameters themselves (average, variance, which cluster each point belongs to), you need to use Mclust . To record, you can use (for example) write.csv .

By default, Mclust calculates the parameters based on the most optimal model, as defined by the BIC, so if you want to do this, you can do:

 myMclust <- Mclust(myData) 

Then myMclust$BIC will contain the results for all other models (i.e. myMclust$BIC larger or smaller than mclustBIC(myData) ).

See ?Mclust in the Value: section for what other myMclust information myMclust . For example, myMclust$parameters$mean is the average for each cluster, myMclust$parameters$variance variance for each cluster, ...

However, myMclust$classification will contain to which cluster each point calculated for the most optimal model belongs to.

So, to get the desired result, you can do:

 # create some data for example purposes -- you have your read.csv(...) instead. myData <- data.frame(x=runif(100),y=runif(100),z=runif(100)) # get parameters for most optimal model myMclust <- Mclust(myData) # if you wanted to do your summary like before: mySummary <- summary( myMclust$BIC, data=myData ) # add a column in myData CLUST with the cluster. myData$CLUST <- myMclust$classification # now to write it out: write.csv(myData[,c("CLUST","x","y","z")], # reorder columns to put CLUST first file="out.csv", # output filename row.names=FALSE, # don't save the row numbers quote=FALSE) # don't surround column names in "" 

A note on write.csv - if you do not enter row.names=FALSE , you will get an extra column in csv containing the line number. In addition, quote=FALSE puts the column headings as CLUST,x,y,z , whereas otherwise they would be "CLUST","x","y","z" . This is your choice.

Suppose we wanted to do the same, but use parameters from another model that was not optimal. However, by default, Mclust only calculates parameters for the optimal model. To calculate the parameters for a specific model (for example, "EEI" ), you should:

 myMclust <- Mclust(myData,modelNames="EEI") 

and then proceed as before.

+8
source

Source: https://habr.com/ru/post/906267/


All Articles