Using CUDA C to develop a simple data parallel program

2.1 Data parallelism

Taking an example about Calculate image.

2.2 CUDA C program structure

CUDA C 的代码包括：Host 端(CPU)和Device端(GPU);

CUDA 程序执行流程：Host ⇒ Device ⇒ 等待Device执行完毕 ⇒ Host

2.3 A vector addition kernel

Device 端的命名通常加一个suffix _d

Host 端的命名通常加一个suffix _h

线性执行代码描述如下：

// Compute vector sum C_h = A_h + B_h
void vecAdd(float* A_h, float* B_h, float* C_h, int n) {
	for (int i = 0; i < n; ++i) {
		C_h[i] = A_h[i] + B_h[i];
	}
} 
int main() { 
	// Memory allocation for arrays A, B, and C
	// I/O to read A and B, N elements each
	... 
	vecAdd(A, B, C, N); 
}

按照并行修改，描述如下：

void vecAdd(float* A_h, float* B_h, float* C_h, int n) {
	int size = n* sizeof(float);
	float *d_A *d_B, *d_C;

	// Part 1: Allocate device memory for A, B, and C
  // Copy A and B to device memory
  ...
  
  // Part 2: Call kernel – to launch a grid of threads
  // to perform the actual vector addition
  ...
  
  // Part 3: Copy C from the device memory
  // Free device vectors
}

也就是 allocate memory_d ⇒ copy memory_h to memory_d ⇒ call Kernel ⇒ copy res from memory_d ⇒ free memory(_d _h)

2.4 Device global memory and data transfer

像传统说的内存我们称为main memory(主存)，而GPU 中的“main memory”则叫做：global memory，也就是通常说的“显存”，以此来区分两个不同的存储。

Data is transferred from host to device is equal to data is transferred from Host main memory to Device global memory.

cudaMalloc(param_1, size) ：

　　作用和malloc() 差不多；

　　The first parameter is the address of a pointer variable.

　　The second paramter is size( bytes, uint64_t type )

Notice：

Address must Convert to void**

This parameter allows the cudaMalloc function to write the address of the allocated memory into the provided pointer variable regardless of its type.
cudaMalloc() is different from malloc() ，malloc() takes only one parameter and its return value is a pointer, which points to address of the allocated object. cudaMalloc() can return error value.

cudaFree(param_1) ：和free() 作用一样，pass the value as an argument；

Example of using cuadMalloc& cudaFree

float * A_d;
uint64_t size = n * sizeof(float);
cudaMalloc((void **) &A_d, size);
...
cudaFree(A_d);

cudaMemcpy():

　　The first parameter is a pointer to the destination location for the data object to be copied.

　　The second parameter points to the source location.

　　The third parameter specifies the number of bytes to be copied.

　　The fourth parameter indicates the types of memory involved in the copy: from host to host, from host to device, from device to host, from device to device.

Complete Version of Vecadd code

void vecAdd(float* A_h, float* B_h, float* C_h, int n) { 
    int size = n * sizeof(float); 
    float *A_d, *B_d, *C_d; 

    cudaMalloc((void **) &A_d, size); 
    cudaMalloc((void **) &B_d, size); 
    cudaMalloc((void **) &C_d, size); 

    cudaMemcpy(A_d, A_h, size, cudaMemcpyHostToDevice); 
    cudaMemcpy(B_d, B_h, size, cudaMemcpyHostToDevice); 

    // Kernel invocation code – to be shown later 
    ... 

    cudaMemcpy(C_h, C_d, size, cudaMemcpyDeviceToHost); 

    cudaFree(A_d); 
    cudaFree(B_d); 
    cudaFree(C_d); 
}

Error Checking

　　　　主要还是需要自己来写一些判断，比如下面判断分配如果超过可用的内存就会报错；

　　　　cudaError_t err 5 cudaMalloc((void) &A_d, size);
　　　　if (error! 5 cudaSuccess) {
	　　　　printf(“%s in %s at line %d\\n”, cudaGetErrorString(err),
	　　　　__FILE__, __LINE__);
	　　　　exit(EXIT_FAILURE);
　　　　}