Name: **Enrolment No:** ## **UPES** ## **End Semester Examination, May 2025** Course: GPU Programming Program: B. Tech (CSE), Graphics & Gaming Semester : VI nics & Gaming Time : 03 hrs. Course Code: CSGG3018 Max. Marks: 100 **Instructions:** Please attempt according to the provided time and given weightage. ## SECTION A (5Qx4M=20Marks) | S. No. | | Marks | CO | |--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|-----| | Q 1 | Define the term GPGPUs. List two tools used for GPGPU development. | 3+1 | CO1 | | Q 2 | Discuss the key difference between task parallelism and data parallelism, providing relevant examples to illustrate each concept. | 4 | CO1 | | Q 3 | Describe the function of the Thread Execution Manager in GPU architecture and list its primary responsibilities (name each responsibility only, no explanation required). | 2+2 | CO2 | | Q 4 | Differentiate deadlocks from race conditions, providing effective prevention/detection methods. | 4 | CO1 | | Q 5 | Differentiate between Concurrency and Parallelism by explaining at least two key differences | 4 | CO1 | ## SECTION B (4Qx10M= 40 Marks) | Q 6 | List two factors that limit a CUDA kernel from achieving a million times speedup even when the compute-to-global memory access (CGMA) ratio is one. Suggest mitigations for each of these issues. | 10 | CO3 | |-----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|-----| | Q 7 | <ul> <li>(i) Explain how tiled matrix multiplication improves performance in GPU computing.</li> <li>(ii) Using a 4×4 matrices A and B, illustrate the two-phase computation steps involved in the process of tiled matrix multiplication. Showing</li> <li>(a) how a 2x2 tile of A and B is loaded into shared memory (Phase 1)</li> <li>(b) the step-by-step computation of one output tile in C (Phase 2) using the loaded titles.</li> </ul> | 2+8 | CO3 | | Q 8 | (i) Given below is a CUDA kernelglobal void kernel(int *a) { | | | | | <pre>int i = threadIdx.x + blockIdx.x * blockDim.x; if (i % 3 == 0) { a[i] = blockIdx.x * 10 + threadIdx.x; } The launch configuration is kernel&lt;&lt;&lt;2, 6&gt;&gt;&gt;(a); If the initial state of a = [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,</pre> | 5 + 2.5<br>+2.5 | CO2,<br>CO3 | |------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-------------| | | The launch configuration is kernel<<<1, 6>>>(a); What is the output array a? (iii) For the below CUDA kernelglobal void kernel(int *a) { int i = threadIdx.x + blockIdx.x * blockDim.x; a[i] = i + 1; } The launch configuration is kernel<<<2, 4>>>(a); What is the output array a? | | | | Q 9 | (i) For each CUDA memory types <b>registers, shared, global, local</b> , and <b>constant</b> , compare two defining traits such as: scope, performance, lifetime or access constraints. OR | 10 | 600 | | | <ul> <li>(ii) Given two matrices, A and B, each of dimension m × m:</li> <li>(a) Write a simple CUDA kernel to compute the product C = A × B.</li> <li>Assume matrices are stored in row-major order.</li> <li>(b) Specify the kernel launch configuration when the matrix dimension is 100 × 100. Justify your choices.</li> </ul> | 7+3 | CO2,<br>CO3 | | | SECTION-C<br>(2Qx20M=40 Marks) | | | | Q 10 | (i) Discuss at least <b>two</b> key advantages of <b>OpenACC</b> . | | | | | <ul> <li>(ii) Give two code examples:</li> <li>1. A loop without data dependencies.</li> <li>2. A loop with data dependencies.</li> <li>Describe in detail how OpenACC handles each case when #pragma acc parallel loop is applied. Clearly state how the compiler reacts in each case.</li> <li>OR</li> <li>(iii) Compare OpenACC's kernels construct ("#pragma acc kernels")</li> </ul> | 5+15 | CO4 | | | with <b>parallel</b> construct paired with the <b>loop</b> directive (" <b>#pragma acc parallel loop</b> "), providing suitable code examples to illustrate differences in parallelization approach and compiler behavior. | 20 | | |------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|-----| | Q 11 | <ul> <li>(i) Explain the concepts of global dimensions, local work-groups and work-items in OpenCL. Compare these compute model concepts to its CUDA programming equivalents.</li> <li>(ii) Illustrate OpenCL's memory model with a labeled diagram showing: processing elements, compute units, register/global memory, and their access relationships.</li> </ul> | 12+8 | CO3 |