concurrency - Thrust execution policy issues kernel to default stream -


i designing short tutorial exhibiting various aspects , capabilities of thrust template library.

unfortunately, seems there problem in code have written in order show how use copy/compute concurrency using cuda streams.

my code found here, in asynchronouslaunch directory: https://github.com/gnthibault/cuda_thrust_introduction/tree/master/asynchronouslaunch

here abstract of code generates problem:

//stl #include <cstdlib> #include <algorithm> #include <iostream> #include <vector> #include <functional>  //thrust #include <thrust/device_vector.h> #include <thrust/host_vector.h> #include <thrust/execution_policy.h> #include <thrust/scan.h>  //cuda #include <cuda_runtime.h>  //local #include "asynchronouslaunch.cu.h"  int main( int argc, char* argv[] ) {     const size_t fullsize = 1024*1024*64;     const size_t halfsize = fullsize/2;      //declare 1 host std::vector , initialize random values     std::vector<float> hostvector( fullsize );     std::generate(hostvector.begin(), hostvector.end(), normalrandomfunctor<float>(0.f,1.f) );      //and 2 device vector of half size     thrust::device_vector<float> devicevector0( halfsize );     thrust::device_vector<float> devicevector1( halfsize );      //declare  , initialize 2 cuda stream     cudastream_t stream0, stream1;     cudastreamcreate( &stream0 );     cudastreamcreate( &stream1 );      //now, perform alternate scheme copy/compute     for( int = 0; < 10; i++ )     {         //wait end of copy host before starting copy device         cudastreamsynchronize(stream0);         //warning: thrust::copy not handle asynchronous behaviour host/device copy, must use cudamemcpyasync         cudamemcpyasync(thrust::raw_pointer_cast(devicevector0.data()), thrust::raw_pointer_cast(hostvector.data()), halfsize*sizeof(float), cudamemcpyhosttodevice, stream0);         cudastreamsynchronize(stream1);         //second copy occur sequentially after first 1         cudamemcpyasync(thrust::raw_pointer_cast(devicevector1.data()), thrust::raw_pointer_cast(hostvector.data())+halfsize, halfsize*sizeof(float), cudamemcpyhosttodevice, stream1);          //compute on device, here inclusive scan, histogram equalization instance         thrust::transform( thrust::cuda::par.on(stream0), devicevector0.begin(), devicevector0.end(), devicevector0.begin(), computefunctor<float>() );         thrust::transform( thrust::cuda::par.on(stream1), devicevector1.begin(), devicevector1.end(), devicevector1.begin(), computefunctor<float>() );          //copy host         cudamemcpyasync(thrust::raw_pointer_cast(hostvector.data()), thrust::raw_pointer_cast(devicevector0.data()), halfsize*sizeof(float), cudamemcpydevicetohost, stream0);         cudamemcpyasync(thrust::raw_pointer_cast(hostvector.data())+halfsize, thrust::raw_pointer_cast(devicevector1.data()), halfsize*sizeof(float), cudamemcpydevicetohost, stream1);     }      //full synchronize before exit     cudadevicesynchronize();      cudastreamdestroy( stream0 );     cudastreamdestroy( stream1 );      return exit_success; } 

here results of 1 instance of program, observed through nvidia visual profile:

kernels issued default stream

as yo can see, cudamemcopy (in brown) both issued stream 13 , 14, kernels generated thrust thrust::transform issued default stream (in blue in capture)

by way, using cuda toolkit version 7.0.28, gtx680 , gcc 4.8.2.

i grateful if tell me wrong code.

thank in advance

edit: here code consider solution:

//stl #include <cstdlib> #include <algorithm> #include <iostream> #include <functional> #include <vector>   //thrust #include <thrust/device_vector.h> #include <thrust/host_vector.h> #include <thrust/execution_policy.h>   //cuda #include <cuda_runtime.h>  //local definitions  template<typename t> struct computefunctor {     __host__ __device__     computefunctor() {}      __host__ __device__     t operator()( t in )     {         //naive functor generates expensive useless instructions         t =  cos(in);         for(int = 0; < 350; i++ )         {             a+=cos(in);         }         return a;     } };  int main( int argc, char* argv[] ) {     const size_t fullsize =  1024*1024*2;     const size_t nbofstrip = 4;     const size_t stripsize =  fullsize/nbofstrip;      //allocate host pinned memory in order use asynchronous api , initialize random values     float* hostvector;     cudamallochost(&hostvector,fullsize*sizeof(float));     std::fill(hostvector, hostvector+fullsize, 1.0f );      //and 1 device vector of same size     thrust::device_vector<float> devicevector( fullsize );      //declare  , initialize 2 cuda stream     std::vector<cudastream_t> vstream(nbofstrip);     for( auto = vstream.begin(); != vstream.end(); it++ )     {         cudastreamcreate( &(*it) );     }      //now, perform alternate scheme copy/compute in loop using copytodevice/compute/copytohost each stream scheme:     for( int = 0; < 5; i++ )     {         for( int j=0; j!=nbofstrip; j++)         {             size_t offset = stripsize*j;             size_t nextoffset = stripsize*(j+1);             cudastreamsynchronize(vstream.at(j));             cudamemcpyasync(thrust::raw_pointer_cast(devicevector.data())+offset, hostvector+offset, stripsize*sizeof(float), cudamemcpyhosttodevice, vstream.at(j));             thrust::transform( thrust::cuda::par.on(vstream.at(j)), devicevector.begin()+offset, devicevector.begin()+nextoffset, devicevector.begin()+offset, computefunctor<float>() );             cudamemcpyasync(hostvector+offset, thrust::raw_pointer_cast(devicevector.data())+offset, stripsize*sizeof(float), cudamemcpydevicetohost, vstream.at(j));         }     }     //on devices not possess multiple queues copy engine capability, solution serializes command if have been issued different streams     //why ? because in point of view of copy engine, single ressource in case, there time dependency between htod(n) , dtoh(n) ok, there     // false dependency between dtoh(n) , htod(n+1), preclude copy/compute overlap      //full synchronize before testing second solution     cudadevicesynchronize();      //now, perform alternate scheme copy/compute in loop using copytodevice each stream /compute each stream /copytohost each stream scheme:     for( int = 0; < 5; i++ )     {         for( int j=0; j!=nbofstrip; j++)         {             cudastreamsynchronize(vstream.at(j));         }         for( int j=0; j!=nbofstrip; j++)         {             size_t offset = stripsize*j;             cudamemcpyasync(thrust::raw_pointer_cast(devicevector.data())+offset, hostvector+offset, stripsize*sizeof(float), cudamemcpyhosttodevice, vstream.at(j));         }         for( int j=0; j!=nbofstrip; j++)         {             size_t offset = stripsize*j;             size_t nextoffset = stripsize*(j+1);             thrust::transform( thrust::cuda::par.on(vstream.at(j)), devicevector.begin()+offset, devicevector.begin()+nextoffset, devicevector.begin()+offset, computefunctor<float>() );          }         for( int j=0; j!=nbofstrip; j++)         {             size_t offset = stripsize*j;             cudamemcpyasync(hostvector+offset, thrust::raw_pointer_cast(devicevector.data())+offset, stripsize*sizeof(float), cudamemcpydevicetohost, vstream.at(j));         }     }     //on device not possess multiple queues in copy engine, solution yield better results, on other, should show identic results      //full synchronize before exit     cudadevicesynchronize();      for( auto = vstream.begin(); != vstream.end(); it++ )     {         cudastreamdestroy( *it );     }     cudafreehost( hostvector );      return exit_success; } 

compiled using nvcc ./test.cu -o ./test.exe -std=c++11

there 2 things point out. both of these (now) referenced in this related question/answer may wish refer to.

  1. the failure of thrust issue underlying kernels non-default streams in case seems related this issue. can rectified (as covered in comments question) updating the latest thrust version. future cuda versions (beyond 7) include fixed thrust well. central issue being discussed in question.

  2. the question seems suggest 1 of goals overlap of copy , compute:

    in order show how use copy/compute concurrency using cuda streams 

    but won't achievable, don't think, code crafted, if item 1 above fixed. overlap of copy compute operations requires proper use of cuda streams on copy operation (cudamemcpyasync) as pinned host allocation. code proposed in question lacking use of pinned host allocation (std::vector not use pinned allocator default, afaik), , not expect cudamemcpyasync operation overlap kernel activity, if should otherwise possible. rectify this, pinned allocator should used, , 1 such example given here.

for completeness, question otherwise lacking mcve, expected questions of type. makes more difficult others attempt test issue, , explicitly close reason on so. yes, provided link external github repo, behavior frowned on. mcve requirement explicitly states necessary pieces should included in question (not external reference.) since lacking piece, afaict, "asynchronouslaunch.cu.h", seems have been relatively straightforward include 1 additional piece in question. problem external links when break in future, question becomes less useful future readers. (and, forcing others navigate external github repo looking specific files not conducive getting help, in opinion.)


Comments

Popular posts from this blog

facebook - android ACTION_SEND to share with specific application only -

python - Creating a new virtualenv gives a permissions error -

javascript - cocos2d-js draw circle not instantly -