1.
Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel?
Correct Answer
B. 1000
Explanation
In this scenario, since the variable is declared as a shared memory variable, there will be one version of the variable created for each thread block. Since there are 1000 thread blocks, there will be 1000 versions of the variable created throughout the execution of the kernel.
2.
For the tiled single-precision matrix multiplication kernel, assume that the tile size is 32X32 and the system has a DRAM burst size of 128 bytes. How many DRAM bursts will be delivered to the processor as a result of loading one A-matrix tile by a thread block?Which one do you like?
Correct Answer
B. 32
Explanation
When loading one A-matrix tile by a thread block, each element in the tile is a single-precision value, which requires 4 bytes of memory. The tile size is 32x32, so there are a total of 32*32 = 1024 elements in the tile. Multiplying the number of elements by the size of each element gives 1024 * 4 = 4096 bytes. Since the DRAM burst size is 128 bytes, dividing the total size of the tile (4096 bytes) by the burst size (128 bytes) gives 4096 / 128 = 32 bursts. Therefore, 32 DRAM bursts will be delivered to the processor as a result of loading one A-matrix tile by a thread block.
3.
We want to use each thread to calculate two (adjacent) output elements of a vector addition. Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index of the first element?
Correct Answer
C. I= (blockIdx.x*blockDim.x + threadIdx.x)*2
Explanation
The expression i= (blockIdx.x*blockDim.x + threadIdx.x)*2 is the correct answer. It correctly maps the thread/block indices to the data index of the first element by multiplying the sum of blockIdx.x and blockDim.x with threadIdx.x, and then multiplying the result by 2. This ensures that each thread calculates two adjacent output elements of the vector addition.
4.
We are to process a 600X800 (800 pixels in the x or horizontal direction, 600 pixels in the y or vertical direction) picture with the PictureKernel(). That is m’s value is 600 and n’s value is 800. __global__ void PictureKernel(float* d_Pin, float* d_Pout, int n, int m) { // Calculate the row # of the d_Pin and d_Pout element to process int Row = blockIdx.y*blockDim.y + threadIdx.y; // Calculate the column # of the d_Pin and d_Pout element to process int Col = blockIdx.x*blockDim.x + threadIdx.x; // each thread computes one element of d_Pout if in range if ((Row < m) && (Col < n)) { d_Pout[Row*n+Col] = 2*d_Pin[Row*n+Col]; }} Assume that we decided to use a grid of 16X16 blocks. That is, each block is organized as a 2D 16X16 array of threads. How many warps will be generated during the execution of the kernel?
Correct Answer
C. 38*8*50
Explanation
During the execution of the kernel, each block will have 16x16 threads, which means a total of 256 threads per block. The grid is organized as a 2D 16x16 array of blocks, so there will be a total of 16x16 blocks, which is 256 blocks. Each warp consists of 32 threads, so the total number of warps generated during the execution of the kernel will be 256 blocks x 256 threads/block divided by 32 threads/warp. This gives us a total of 2048 warps.
5.
If a CUDA device’s SM (streaming multiprocessor) can take up to 1,536 threads and up to 8 thread blocks. Which of the following block configuration would result in the most number of threads in each SM?
Correct Answer
C. 512 threads per block
Explanation
A CUDA device's SM can take up to 1,536 threads and up to 8 thread blocks. To maximize the number of threads in each SM, we need to find the block configuration that results in the most threads. Since each SM can take up to 1,536 threads, we need to divide this number by the number of threads per block.
If we choose 64 threads per block, we would have 1,536 / 64 = 24 blocks per SM, resulting in 24 * 64 = 1,536 threads.
If we choose 128 threads per block, we would have 1,536 / 128 = 12 blocks per SM, resulting in 12 * 128 = 1,536 threads.
If we choose 512 threads per block, we would have 1,536 / 512 = 3 blocks per SM, resulting in 3 * 512 = 1,536 threads.
If we choose 1,024 threads per block, we would have 1,536 / 1,024 = 1.5 blocks per SM, resulting in 1.5 * 1,024 = 1,536 threads.
Therefore, the block configuration that would result in the most number of threads in each SM is 512 threads per block.
6.
Assume the following simple matrix multiplication kernel __global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) { int Row = blockIdx.y*blockDim.y+threadIdx.y; int Col = blockIdx.x*blockDim.x+threadIdx.x; if ((Row < Width) && (Col < Width)) { float Pvalue = 0; for (int k = 0; k < Width; ++k) {Pvalue += M[Row*Width+k] * N[k*Width+Col];} P[Row*Width+Col] = Pvalue; } } which of the following is true?
Correct Answer
C. M[Row*Width+k] is not coalesced but N[k*Width+Col] and P[Row*Width+Col] both are
Explanation
In this matrix multiplication kernel, the memory accesses for M[Row*Width+k] and N[k*Width+Col] are coalesced because they are accessed in a continuous manner by the threads in a warp. However, the memory access for P[Row*Width+Col] is not coalesced because each thread is accessing a different location in the output matrix P. Therefore, M[Row*Width+k] is not coalesced but N[k*Width+Col] and P[Row*Width+Col] are both coalesced.
7.
For the simple reduction kernel, if the block size is 1,024 and the warp size is 32, how many warps in a block will have divergence during the 5th iteration?
Correct Answer
D. 32
Explanation
During the 5th iteration of the simple reduction kernel, each thread in a warp will perform the reduction operation independently. Since the warp size is 32, all 32 threads in a warp will execute the same instructions in lockstep. Therefore, there will be no divergence within a warp. Since there are no divergent warps, the answer is 0.
8.
For the following basic reduction kernel code fragment, if the block size is 1024 and warp size is 32, how many warps in a block will have divergence during the iteration where stride is equal to 1? unsigned int t = threadIdx.x;Unsigned unsigned int start = 2*blockIdx.x*blockDim.x;partialSum[t] = input[start + t];partialSum[blockDim.x+t] = input[start+ blockDim.x+t];for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2){ __syncthreads(); if (t % stride == 0) {partialSum[2*t]+= partialSum[2*t+stride];}}
Correct Answer
A. 0
Explanation
During the iteration where the stride is equal to 1, there will be no warps in a block that have divergence. This is because the condition "t % stride == 0" ensures that only threads with thread index divisible by the stride will execute the code inside the if statement. Since the stride is 1, all threads will satisfy this condition and there will be no divergence. Therefore, the answer is 0.
9.
SM implements zero overhead scheduling because –
Correct Answer
C. Both are correct
Explanation
The given answer states that both statements are correct. The first statement explains that in SM (Streaming Multiprocessor), zero overhead scheduling is implemented when a warp (a group of 32 threads) whose next instruction has its operands ready for consumption is eligible for execution. This means that the scheduler does not have to wait for the operands to be available, reducing overhead. The second statement explains that all threads in a warp execute the same instruction when selected, which is also true. Therefore, both statements are correct.
10.
__device__ constant int mask=10 will have memory, lifetime and scope defined as
Correct Answer
D. Constant, grid and application
Explanation
The keyword "__device__" indicates that the variable "mask" is a device variable, meaning it resides in the global memory of the GPU. The qualifier "constant" specifies that the variable is read-only and its value cannot be modified during execution. The scope of the variable is defined as "grid", meaning it is accessible to all threads within a grid. Lastly, the lifetime of the variable is "application", indicating that it persists throughout the duration of the application execution. Therefore, the correct answer is "constant, grid and application".