I had the following problem:
Each CUDA Thread might possible write to any cell in a 3d array in global memory. Thus, some synchronization is required.
The easiest solution is using the atomic operations which are provided by CUDA. Unfortunately, the performance isn’t great. Therefore, I tried improving the performance by implementing a locking mechanism of my own. Simply, using a lock.
The result was that this approach is absolutely useless. I experienced a performance loss of nearly 1000x