python - How to group sets by similarity in contained elements -
i using python 2.7. have routes composed of arrays of nodes connect each other. nodes identified string key, ease use numbers:
sample_route = [1,2,3,4,7] #obviously over-simplified; real things 20-40 elements long
i create set
made of tuple pairs of point connections using zip, end like:
set([(1,2),(2,3),(3,4),(4,7)])
i need way filter out routes similar (like 1 or 2 added nodes) , use minimal route out of near-duplicates. plan right to:
start first (probably optimal) route. iterate through rest of routes, , use following formula calculate similarity sequence in step 1:
matching = len(value1.difference(value2)) + len(value2.difference(value1)) #value1, value2 = 2 compared sets
the lower number, more similar. what way group routes based on similarity other routes? different lengths. have never taken statistics course.
example:
sets = [ set([(1,2),(2,3),(3,4),(4,5),(5,10)]), set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]), set([(1,7),(7,3),(3,8),(8,7),(7,6),(6,5),(5,10)]), set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]), set([(1,9),(9,4),(4,5),(5,10)]), set([(1,9),(9,4),(4,6),(6,5),(5,10)]) ]
in example, groupings may [[1,2,4],[3],[5,6]]
, 1, 2, , 4 similar, 5 , 6 similar, , 3 near of others. 1 2 have score of 2, , 3 6 have score of 8, examples. sort of data using (though these easy read simplifications).
there time benefit in this. if can remove redundant routes, trim off considerable amount of time.
i recommend looking networkx package. allows create directed graphs such describing. measuring similarity of 2 routes, recommend jaccard similarity index. here code showing example illustrated.
first, import few libraries: graphs, plotting, , numeric python. build directed graph adding nodes numbered 1 8. build connections node node build path. networkx package has built-in capability find paths in graph 1 node another: nx.all_simple_paths(g, start_node, end_node)
.
once have of paths, can compute matrix, j
, of jaccard similarity between paths. how want cluster paths similarity you.
import networkx nx import matplotlib.pyplot plt import numpy np g = nx.digraph() g.add_nodes_from(range(1,8)) g.add_edges_from([(1,2),(2,3),(3,4),(4,7)]) #path 1,2,3,4,7 g.add_edges_from([(4,5),(5,7)]) #path 1,2,3,4,5,7 g.add_edges_from([(4,6),(6,7)]) #path 1,2,3,4,6,7 paths_iter = nx.all_simple_paths(g,1,7) paths = [p p in paths] np.random.seed(100000) nx.draw_spring(g, with_labels=true) plt.show() def jaccard(v1, v2): return (len(np.intersect1d(v1,v2))+0.0)/len(np.union1d(v1,v2)) j = np.zeros([len(paths),len(paths)]) in range(j.shape[0]): j in range(i, j.shape[1]): j[i,j] = j[j,i] = jaccard(paths[i],paths[j]) print j > [[ 1. 0.71428571 0.83333333] > [ 0.71428571 1. 0.83333333] > [ 0.83333333 0.83333333 1. ]]
edit since have metric compare similarity of paths each other, k-means clustering cluster paths together.
from scipy.cluster.vq import kmeans2
i don't have enough of code or data anymore point.
Comments
Post a Comment