as.h2o() in R to upload files to h2o environment takes a long time -
i using h2o carry out modelling, , having tuned model, used carry out lot of predictions approx 6bln predictions/rows, per prediction row needs 80 columns of data
the dataset have broken down input dataset down in 500 x 12 million row chunks each relevant 80 columns of data.
however upload data.table
12 million 80 columns h2o takes quite long time, , doing 500 times me taking prohibitively long time...i think because parsing object first before uploaded.
the prediction part relatively quick in comparison....
are there suggestions speed part up? changing number of cores help?
below reproducible example of issues...
# load libraries library(h2o) library(data.table) # start h2o using cores... localh2o = h2o.init(nthreads=-1,max_mem_size="16g") # create test input dataset temp <- cj(v1=seq(20), v2=seq(7), v3=seq(24), v4=seq(60), v5=seq(60)) temp <- do.call(cbind,lapply(seq(16),function(y){temp})) colnames(temp) <- paste0('v',seq(80)) # part takes long time!! system.time(tmp.obj <- as.h2o(localh2o,temp,key='test_input')) #|======================================================================| 100% # user system elapsed #357.355 6.751 391.048
since running h2o locally, want save data file , use:
h2o.importfile(localh2o, file_path, key='test_intput')
this have each thread read parts of file in parallel. if run h2o on separate server, need copy data location server can read (most people don't set servers read file system on laptops).
as.h2o()
serially uploads file h2o. h2o.importfile()
, h2o server finds file , reads in parallel.
it looks using version 2 of h2o. same commands work in h2ov3, of parameter names have changed little. new parameter names here: http://cran.r-project.org/web/packages/h2o/h2o.pdf
Comments
Post a Comment