matlab - Evaluating K-means accuracy -

i created 3-dimensional random data sets 4 defined patterns/classes in matlab. applied k-means algorithm on data see how k-means can classify samples based on created 4 patterns/classes.

i need following;

what function/code can use evaluate how k-means algorithm has identified classes of samples correctly? assuming set k=4 illustrated in image below:

enter image description here

how can automatically identify number of classes (k)? assuming classes in data unknown?

my aim evaluate k-mean's accuracy , how changes data (by pre-processing) affects algorithm’s ability identify classes. examples matlab code helpful!

one basic metric measure how "good" clustering in comparison known class labels called purity. example of supervised learning have idea of external metric labeling of instances based on real world data.

the mathematical definition of purity follows:

enter image description here

in words means is, quoting professor @ stanford university here,

to compute purity , each cluster assigned class frequent in cluster, , accuracy of assignment measured counting number of correctly assigned documents , dividing n.

a simple example if had naive clustering produced via kmeans k=2 looked like:

cluster1    label   1                      5           b   7           b   3           b   2           b  cluster2    label   4             6             8             9           b

in cluster1 there 4 instances of label b , 1 instance of label , cluster2 has 3 instances label , 1 instance of cluster b. looking total purity sum of purities of each cluster, in case k=2. purity of cluster1 maximum number of instances in respect given labels divided total number of instances in cluster1.

therefore purity of cluster1 is:

4/5 = 0.80

the 4 comes fact label occurs (b) occurs 4 times , there 5 total instances in cluster.

so follows purity of cluster2 is:

3/4 = 0.75

now total purity sum of purities 1.55. tell us? cluster considered "pure" if has purity of 1 since indicates of instances in cluster of same label. means original label classification pretty , kmeans did pretty job. "best" purity score entire data set equal original k-number of clusters since imply every cluster has individual purity score of 1.

however, need aware purity not best or telling metric. example, if had 10 points , chose k=10 every cluster have purity of 1 , therefore overall purity of 10 equal k. in instance better use different external metrics such precision, recall, , f-measure. suggest looking if can. , again reiterate, useful supervised learning have pre-knowledge of labeling system believe case question.

to answer second question... choosing k number of clusters difficult part kmeans without prior knowledge of data. there techniques mitigate problems presented choosing initial k-number of clusters , centroids. common algorithm called kmeans++. suggest looking further info.

Search This Blog

Szoka

matlab - Evaluating K-means accuracy -

Comments

Post a Comment

Popular posts from this blog

facebook - android ACTION_SEND to share with specific application only -

python - Creating a new virtualenv gives a permissions error -

javascript - cocos2d-js draw circle not instantly -