Implementing Bottom-Up Iterative Merge Sort in CUDA: A Performance Analysis

In this project for Google Summer of Code (GSOC) ’21 under JuliaLang organization, Ashwani Rathee implemented an efficient parallel merge sort algorithm in CUDA. He explored various aspects such as defining tasks between CPU and GPU, optimizing performance using different sizes of THREAD_PER_BLOCK, comparing effects with multiple threads solving subproblems simultaneously while waiting for others to finish, and more. The study concluded by highlighting future work opportunities like implementing parallel merge sort algorithms, stress testing larger arrays up to 10^18 size, optimizing further using shared memory or combining GPU and CPU approaches at specific levels in conjunction with existing libraries such as thrust::sort().

Ashwani also discussed the learning outcomes from this project which included gaining insights into merge sort algorithms that had previously eluded him. He learned about basic CUDA concepts while working on medium-difficulty tasks related to data transfer between devices and optimizing performance through parallelization techniques.

The results showed a comparison of CPU vs GPU approaches for various array sizes ranging from 10^1 to 10^4, indicating an overhead due to sending/receiving data across boundaries. The study suggests further exploration into larger arrays’ behavior as well as utilizing shared memory and other optimization techniques like combining different algorithms at specific levels or leveraging multiple threads within each device for improved performance.

In conclusion, this project provided valuable insights into parallel computing concepts using CUDA while implementing efficient merge sort algorithms. It also opened doors to future research opportunities in optimizing data transfer between devices, utilizing shared memory effectively, and combining different approaches at specific levels for better overall performance.\

Complete Article after the Jump: Here!