SUMA de Vectores: Hands-on - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/clases/preguntas_y_manejo_de... · SUMA de Vectores: Hands-on Clase X ... Problema 1: memoria → Manejo de

SUMA de Vectores: Hands-on

Clase Xhttp://fisica.cab.cnea.gov.ar/gpgpu/index.php/en/icnpg/clases

Algunas preguntas practicas

(1) ¿Que pasa si los vectores a sumar son “muy” grandes?

(2) ¿Como saber en que placa corrió mi job?

(3) ¿Que argumentos puede recibir y que vale hacer dentro de un kernel ?

(4) ¿Como medir el tiempo empleado para transferencias CPU ↔ GPU?

¿Que pasa si los vectores a sumar son muy grandes ?

● Problema 1: memoria → Manejo de errores

● Problema 2: indexado → Cambiar Kernels

Problema 1Device 0: "GeForce GT 620M" CUDA Driver Version / Runtime Version 5.0 / 5.0 CUDA Capability Major/Minor version number: 2.1 Total amount of global memory: 1024 MBytes ( 2) Multiprocessors x ( 48) CUDA Cores/MP: 96 CUDA Cores

// SUMA-Vectores#define N 200000000int main(){

... /* alocacion de memoria en device */ float *d_A, *d_B; cudaMalloc((void**)&d_A, sizeof(float) * N); cudaMalloc((void**)&d_B, sizeof(float) * N);

...}

¿ CUAL ES EL PROBLEMA ?

Problema 1Device 0: "GeForce GT 620M" CUDA Driver Version / Runtime Version 5.0 / 5.0 CUDA Capability Major/Minor version number: 2.1 Total amount of global memory: 1024 MBytes ( 2) Multiprocessors x ( 48) CUDA Cores/MP: 96 CUDA Cores

// SUMA-Vectores#define N 200000000int main(){


...}

● Pista: 1 float = 4 bytes … ¿CUAL ES EL PROBLEMA?

CPU → MemTotal: 3932884 kB.

Problema 1SUMA-Vectoresmain.cu

#include <stdio.h>#include <stdlib.h>#include <sys/time.h>#include <cuda.h>#include "vector_io.h"#include "vector_ops.h"

#ifndef N#define N 1000000000#endif

#ifndef VECES#define VECES 10#endif

● Experimentar con N● HANDLE_ERROR()● Device Properties● /proc/meminfo

Problema 1 → manejo de errores #define N 200000000int main(){

... /* alocacion de memoria en device */ float *d_A, *d_B; cudaError_t error; error=cudaMalloc((void**)&d_A, sizeof(float) * N); if (error != cudaSuccess) { printf("cudaMalloc d_A error %d, linea(%d)\n", error, __LINE__); exit(EXIT_FAILURE); } error=cudaMalloc((void**)&d_B, sizeof(float) * N);

...}

CUDA Runtime APICONSULTAR:

Problema 1 → manejo de errores...#define N 200000000int main(){


checkCUDAError("alocando d_A y d_B"); ...

}void checkCUDAError(const char *msg){ cudaError_t err = cudaGetLastError(); if( cudaSuccess != err) { fprintf(stderr, "Cuda error: %s: %s.\n", msg,

cudaGetErrorString( err) ); exit(EXIT_FAILURE); }}

CUDA Runtime API

Problema 1 → manejo de errores...#include "curso.h" #define N 200000000int main(){

... /* alocacion de memoria en device */ float *d_A, *d_B; cudaError_t error; HANDLE_ERROR(cudaMalloc((void**)&d_A, sizeof(float) * N)); HANDLE_ERROR(cudaMalloc((void**)&d_B, sizeof(float) * N));

... HANDLE_ERROR(

cudaMemcpy(d_A,h_A,sizeof(float)*N, cudaMemcpyHostToDevice) );

HANDLE_ERROR(cudaMemcpy(d_B, h_B, sizeof(float) * N, cudaMemcpyHostToDevice)

); ...

} CUDA Runtime API

HANDLE_ERROR (cuda by example) → MACRO: se reemplaza por un fragmento de código (preprocessor)http://gcc.gnu.org/onlinedocs/cpp/index.html#Top

Problema 1 → manejo de errores...#include <helper_cuda.h>#define N 200000000int main(){

... /* alocacion de memoria en device */ float *d_A, *d_B; checkCudaErrors(cudaMalloc((void**)&d_A, sizeof(float) * N)); checkCudaErrors(cudaMalloc((void**)&d_B, sizeof(float) * N));

... checkCudaErrors(

cudaMemcpy(d_A,h_A,sizeof(float)*N, cudaMemcpyHostToDevice) );

checkCudaErrors(cudaMemcpy(d_B, h_B, sizeof(float) * N, cudaMemcpyHostToDevice)

); ...

}

CUDA Runtime API

Problema 2

#define dim 40000000/* Suma de vectores. Resultado queda en el primer argumento */int vector_ops_suma_par(float *v1, float *v2){ dim3 nThreads(512); //dim3 nBlocks((dim / nThreads.x) + (dim % nThreads.x ? 1 : 0)); //alternativa dim3 nBlocks((dim+nThreads.x-1)/nThreads.x);

kernel_suma<<<nBlocks, nThreads>>>(v1, v2, dim); …}

Device 0: "GeForce GT 620M" Total amount of global memory: 1024 MBytes (1073479680 bytes) Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

¿ CUAL ES EL PROBLEMA ?

Problema 2

#define dim 40000000.../* Suma de vectores. Resultado queda en el primer argumento */int vector_ops_suma_par(float *v1, float *v2){ dim3 nThreads(512); //dim3 nBlocks((dim / nThreads.x) + (dim % nThreads.x ? 1 : 0)); //alternativa dim3 nBlocks((dim+nThreads.x-1)/nThreads.x);

kernel_suma<<<nBlocks, nThreads>>>(v1, v2, dim); …}

Device 0: "GeForce GT 620M" Total amount of global memory: 1024 MBytes (1073479680 bytes) Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

¿ CUAL ES EL PROBLEMA ?Pista: 1 float=4 bytesPista: nBlocks=¿? Ocurriría antes que el problema 1 !!

Problema 2 → gridDim

#define dim 40000000.../* Suma de vectores. Resultado queda en el primer argumento */int vector_ops_suma_par(float *v1, float *v2){ dim3 nThreads(512); dim3 nBlocks(512); kernel_suma<<<nBlocks, nThreads>>>(v1, v2, dim); checkCUDAError("invocación de kernel_suma"); …}

/* suma de cada elemento del vector */__global__ void kernel_suma(float *v1, float *v2, int dim){ int id = threadIdx.x + (blockIdx.x * blockDim.x);

while(id < dim) { v1[id] = v1[id] + v2[id];

id+= blockDim.x * gridDim.x; }}

Problema 2 → gridDim

Thread 0 calcula: v1[0]=v1[0]+v2[0]; v1[gridDim.x]=v1[gridDim.x]+v2[gridDim.x]; (si gridDim.x < dim)v1[2*gridDim.x]=v1[2*gridDim.x]+v2[2*gridDim.x]; (si 2*gridDim.x < dim)

Thread id calcula: v1[id]=v1[id]+v2[id]; v1[id+gridDim.x]=v1[id+gridDim.x]+v2[id+gridDim.x]; (si id+gridDim.x < dim)v1[id+2*gridDim.x]=v1[id+2*gridDim.x]+v2[id+2*gridDim.x]; (si id+2*gridDim.x < dim)



id+= blockDim.x * gridDim.x; }}

Problema 2 → gridDim/* suma de cada elemento del vector */__global__ void kernel_suma(float *v1, float *v2, int dim){ int id = threadIdx.x + (blockIdx.x * blockDim.x);


id += blockDim.x * gridDim.x; }}

gridDim.x*blockDim.x

dim

DATOS

GRID = BLOQUES DE THREADS

Serializa la tarea de cada thread...

¿Como saber en que placa corrió mi job?

int main(){

cudaDeviceProp deviceProp;int dev; cudaGetDevice(&dev);cudaGetDeviceProperties(&deviceProp, dev);printf("\nDevice %d: \"%s\"\n", dev, deviceProp.name);

....}

CUDA Runtime API

¿ Que tipo de argumentos puede recibir un kernel ?


if (id < dim) { v1[id] = v1[id] + v2[id]; }}

...kernel_suma<<<nBlocks, nThreads>>>(v1, v2, dim);...

Punteros a memoria alocada de device (GPU) Variable del host

Se copia al device constant memory

Dereferencia: Seria incorrecto hacerlo en una Funcion del host


/* suma de cada elemento del vector */__global__ void kernel_suma(float *v1, float *v2){ int id = threadIdx.x + (blockIdx.x * blockDim.x);

if (id < dim) { v1[id] = v1[id] + v2[id]; }}

...kernel_suma<<<nBlocks, nThreads>>>(v1, v2);...

#define dim 10000000...

¿?

MACRO: se reemplaza por un fragmento de código (preprocessor)http://gcc.gnu.org/onlinedocs/cpp/index.html#Top


/* suma dim vectores en el plano ... */__global__ void kernel_suma(punto *w1, punto *w2, int dim){ int id = threadIdx.x + (blockIdx.x * blockDim.x);

if (id < dim) { v1[id].a = v1[id].a + v2[id].a; v1[id].b = v1[id].b + v2[id].b; }}

struct punto{

float a,b;};...

punto *w1, *w2;cudaMalloc((void**)&w1, sizeof(punto) * N); cudaMalloc((void**)&w2, sizeof(punto) * N);...

Limite para el tamaño de los argumentos es 4KB


/* suma dim vectores en el plano ... */__global__ void kernel_suma(punto *w1, punto *w2, Parametros par){ int id = threadIdx.x + (blockIdx.x * blockDim.x);

if (id < par.dim) { v1[id].a = v1[id].a + v2[id].a; v1[id].b = v1[id].b + v2[id].b; }}

struct Parametros{

int dim;float numero;

};...

Parametros params;params.dim = N; params.numero=83.2;kernel_suma<<<nBlocks, nThreads>>>(v1, v2, params);

Limite para el tamaño de los argumentos es 4KB

¿ Que “vale” hacer dentro de un kernel ?

CUDA-C PROGRAMMING GUIDE

Los fuentes compilados con nvcc pueden incluir una mezcla de código de HOST y de DEVICE.

● HOST: soporta todo el C++ standard.● DEVICE: soporta parte (ver E.1. Code Samples) con

algunas restricciones (E.2. Restrictions).

cat /usr/local/cuda-5.5/samples/*/*/*.cu | grep -A 5 "__global__" | lesscat /usr/local/cuda-5.5/samples/*/*/*.h | grep -A 5 "__global__" | less

CHUSMEAR CUDA SAMPLES

http://stackoverflow.com/questions/8302506/parameters-to-cuda-kernels

¿ Que “vale” hacer dentro de un kernel?

Si lo dice Mark Harris ...

http://stackoverflow.com/questions/9309195/copying-a-struct-containing-pointers-to-cuda-device/9323898#9323898

Consultar Foros

http://stackoverflow.com/questions/8302506/parameters-to-cuda-kernels



¿Tiempo empleado para transferir de CPU a GPU?

#! /bin/bash##$ -cwd#$ -j y#$ -S /bin/bash## pido la cola gpu.q#$ -q gpu.q## pido una placa#$ -l gpu=1##ejecuto el binario

/usr/local/cuda-5.5/bin/nvprof ./main

http://docs.nvidia.com/cuda/profiler-users-guide/index.html

nvprof

Experimentar con SUMA-Vectores:● Cambiar NVECES● Cambiar N● Hacer mas intensivo el calculo



¿ Como montar localmente mi home del cluster ?

● Instalar “sshfs”. Por ejemplo en Ubuntu:

sudo aptitude update

sudo aptitude install sshfs

sudo adduser yourusername fuse● Como se usa:

mkdir ~/Desktop/sftp

sshfs [email protected]:/cluster/dir/to/mount ~/Desktop/sftp● Desmontar sin desloguearse:

fusermount -u ~/Desktop/sftp

Ventajas: ● Editar codigos en el cluster corriendo localmente el editor que mas le guste. ● “Plotear” datos en el cluster corriendo localmente el plotter que mas le guste.● Mover archivos como si estuvieran en una carpeta local.● Alternativa grafica: “connect to server”, Nautilus, o similar.

Documents

SUMA de Vectores: Hands-on - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/clases/preguntas_y_manejo_de... · SUMA de Vectores: Hands-on Clase X ... Problema 1: memoria → Manejo de