r - Tips for reducing the size of an ID field -
i have dataset 30m rows. of columns include id fields large integers. e.g.
library(data.table) dt <- data.table(someid = c(8762438732197823423, 1236487432893428732, 290234987238237842))
i suspect reducing size of these ids speed joins , other process on data. what's method doing this? example, map unique id's {1, 2, 3...} or {a, b, c, ...}. not sure datatype (integer or character?) best storing these id fields.
the maximum size of integer in r 2^31 - 1
2147483647. integers larger can stored in doubles, @roland points out above, there limit of precision significand of 53 bits, maximum integer can stored accurately 2^53
. try 2^53 - (2^53 - 1)
, you'll 1. try (2^53 + 1) - 2^53
, you'll 0.
for original question, there's not in if you're talking using data.table joins:
library("data.table") library("microbenchmark") f <- d <- <- data.table(x = runif(100, 0, 2^53 - 1), y = rnorm(100)) g <- e <- b <- data.table(x = sample(a$x, 40), z = rnorm(40)) d$x <- sample(1:100) e$x <- d$x[match(b$x, a$x)] f$x <- as.character(f$x) g$x <- as.character(g$x) setkey(a, x) setkey(b, x) setkey(d, x) setkey(e, x) setkey(f, x) setkey(g, x) microbenchmark(a[b], d[e], f[g], times = 1000l) ##unit: microseconds ## expr min lq mean median uq max neval ## a[b] 569.079 623.0495 696.8870 649.4150 699.8465 4160.852 1000 ## d[e] 570.141 621.2795 719.7463 647.6455 708.5170 5305.024 1000 ## f[g] 549.968 598.4520 665.4203 622.8720 674.1880 3667.156 1000
if wanted though, create new data.frame or data.table columns containing old , new ids, , use match
function seen in code above change ids over.
Comments
Post a Comment