r - Tips for reducing the size of an ID field -


i have dataset 30m rows. of columns include id fields large integers. e.g.

library(data.table) dt <- data.table(someid = c(8762438732197823423,                             1236487432893428732,                             290234987238237842)) 

i suspect reducing size of these ids speed joins , other process on data. what's method doing this? example, map unique id's {1, 2, 3...} or {a, b, c, ...}. not sure datatype (integer or character?) best storing these id fields.

the maximum size of integer in r 2^31 - 1 2147483647. integers larger can stored in doubles, @roland points out above, there limit of precision significand of 53 bits, maximum integer can stored accurately 2^53. try 2^53 - (2^53 - 1) , you'll 1. try (2^53 + 1) - 2^53 , you'll 0.

for original question, there's not in if you're talking using data.table joins:

library("data.table") library("microbenchmark") f <- d <- <- data.table(x = runif(100, 0, 2^53 - 1), y = rnorm(100)) g <- e <- b <- data.table(x = sample(a$x, 40), z = rnorm(40)) d$x <- sample(1:100) e$x <- d$x[match(b$x, a$x)] f$x <- as.character(f$x) g$x <- as.character(g$x) setkey(a, x) setkey(b, x) setkey(d, x) setkey(e, x) setkey(f, x) setkey(g, x) microbenchmark(a[b], d[e], f[g], times = 1000l)  ##unit: microseconds ## expr     min       lq     mean   median       uq      max neval ## a[b] 569.079 623.0495 696.8870 649.4150 699.8465 4160.852  1000 ## d[e] 570.141 621.2795 719.7463 647.6455 708.5170 5305.024  1000 ## f[g] 549.968 598.4520 665.4203 622.8720 674.1880 3667.156  1000 

if wanted though, create new data.frame or data.table columns containing old , new ids, , use match function seen in code above change ids over.


Comments

Popular posts from this blog

facebook - android ACTION_SEND to share with specific application only -

python - Creating a new virtualenv gives a permissions error -

javascript - cocos2d-js draw circle not instantly -