matlab - Evaluating K-means accuracy -
i created 3-dimensional random data sets 4 defined patterns/classes in matlab. applied k-means algorithm on data see how k-means can classify samples based on created 4 patterns/classes.
i need following;
- what function/code can use evaluate how k-means algorithm has identified classes of samples correctly? assuming set k=4 illustrated in image below:
- how can automatically identify number of classes (k)? assuming classes in data unknown?
my aim evaluate k-mean's accuracy , how changes data (by pre-processing) affects algorithm’s ability identify classes. examples matlab code helpful!
one basic metric measure how "good" clustering in comparison known class labels called purity. example of supervised learning have idea of external metric labeling of instances based on real world data.
the mathematical definition of purity follows:
in words means is, quoting professor @ stanford university here,
to compute purity , each cluster assigned class frequent in cluster, , accuracy of assignment measured counting number of correctly assigned documents , dividing n.
a simple example if had naive clustering produced via kmeans k=2 looked like:
cluster1 label 1 5 b 7 b 3 b 2 b cluster2 label 4 6 8 9 b
in cluster1 there 4 instances of label b , 1 instance of label , cluster2 has 3 instances label , 1 instance of cluster b. looking total purity sum of purities of each cluster, in case k=2. purity of cluster1 maximum number of instances in respect given labels divided total number of instances in cluster1.
therefore purity of cluster1 is:
4/5 = 0.80
the 4 comes fact label occurs (b
) occurs 4 times , there 5 total instances in cluster.
so follows purity of cluster2 is:
3/4 = 0.75
now total purity sum of purities 1.55
. tell us? cluster considered "pure" if has purity of 1 since indicates of instances in cluster of same label. means original label classification pretty , kmeans did pretty job. "best" purity score entire data set equal original k-number of clusters since imply every cluster has individual purity score of 1.
however, need aware purity not best or telling metric. example, if had 10 points , chose k=10 every cluster have purity of 1 , therefore overall purity of 10 equal k. in instance better use different external metrics such precision, recall, , f-measure. suggest looking if can. , again reiterate, useful supervised learning have pre-knowledge of labeling system believe case question.
to answer second question... choosing k number of clusters difficult part kmeans without prior knowledge of data. there techniques mitigate problems presented choosing initial k-number of clusters , centroids. common algorithm called kmeans++. suggest looking further info.
Comments
Post a Comment