python - How to group sets by similarity in contained elements -


i using python 2.7. have routes composed of arrays of nodes connect each other. nodes identified string key, ease use numbers:

sample_route = [1,2,3,4,7]  #obviously over-simplified; real things 20-40 elements long 

i create set made of tuple pairs of point connections using zip, end like:

set([(1,2),(2,3),(3,4),(4,7)]) 

i need way filter out routes similar (like 1 or 2 added nodes) , use minimal route out of near-duplicates. plan right to:

start first (probably optimal) route. iterate through rest of routes, , use following formula calculate similarity sequence in step 1:

matching = len(value1.difference(value2)) + len(value2.difference(value1)) #value1, value2 = 2 compared sets 

the lower number, more similar. what way group routes based on similarity other routes? different lengths. have never taken statistics course.

example:

sets = [   set([(1,2),(2,3),(3,4),(4,5),(5,10)]),   set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]),   set([(1,7),(7,3),(3,8),(8,7),(7,6),(6,5),(5,10)]),   set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]),   set([(1,9),(9,4),(4,5),(5,10)]),   set([(1,9),(9,4),(4,6),(6,5),(5,10)]) ] 

in example, groupings may [[1,2,4],[3],[5,6]], 1, 2, , 4 similar, 5 , 6 similar, , 3 near of others. 1 2 have score of 2, , 3 6 have score of 8, examples. sort of data using (though these easy read simplifications).

there time benefit in this. if can remove redundant routes, trim off considerable amount of time.

i recommend looking networkx package. allows create directed graphs such describing. measuring similarity of 2 routes, recommend jaccard similarity index. here code showing example illustrated.

first, import few libraries: graphs, plotting, , numeric python. build directed graph adding nodes numbered 1 8. build connections node node build path. networkx package has built-in capability find paths in graph 1 node another: nx.all_simple_paths(g, start_node, end_node).

once have of paths, can compute matrix, j, of jaccard similarity between paths. how want cluster paths similarity you.

import networkx nx import matplotlib.pyplot plt import numpy np  g = nx.digraph() g.add_nodes_from(range(1,8)) g.add_edges_from([(1,2),(2,3),(3,4),(4,7)]) #path 1,2,3,4,7 g.add_edges_from([(4,5),(5,7)])             #path 1,2,3,4,5,7 g.add_edges_from([(4,6),(6,7)])             #path 1,2,3,4,6,7 paths_iter = nx.all_simple_paths(g,1,7) paths = [p p in paths]  np.random.seed(100000) nx.draw_spring(g, with_labels=true) plt.show()  def jaccard(v1, v2):     return (len(np.intersect1d(v1,v2))+0.0)/len(np.union1d(v1,v2))  j = np.zeros([len(paths),len(paths)]) in range(j.shape[0]):     j in range(i, j.shape[1]):         j[i,j] = j[j,i] = jaccard(paths[i],paths[j])  print j  > [[ 1.          0.71428571  0.83333333] >  [ 0.71428571  1.          0.83333333] >  [ 0.83333333  0.83333333  1.        ]] 

graph paths

edit since have metric compare similarity of paths each other, k-means clustering cluster paths together.

from scipy.cluster.vq import kmeans2

i don't have enough of code or data anymore point.


Comments

Popular posts from this blog

facebook - android ACTION_SEND to share with specific application only -

python - Creating a new virtualenv gives a permissions error -

javascript - cocos2d-js draw circle not instantly -