In this post, I'll be sharing my work related to optimizing matrix/vector operation with the support of loop unrolling and SSE instructions at the hardware level. I completed this task as a part of an assignment for CS 4342 Advance Computer Architecture module. This homework was taken from Marcus Holm's <marcus.holm@it.uu.se>

Code I wrote for this project is available here https://github.com/TharinduRusira/matrix-sse

First, performance of the loop unrolled version of the matrix-vector multiplication was measured against its serial implementation. Code for unrolled(by a factor of 4) Matrix-Vector multiplication is below.

The following sample code is a vectorized version of a matrix-vector multiplication. This version has later been extended to solve matrix-matrix multiplication problem. Obtained results are listed below.

In this homework, we compare performance of serial versions of matrix-vector and matrix-matrix multiplications compared to their parallelized counterparts. The obtained results are given below.

Results for Matrix-Matrix multiplication

**High Performance Computing and Programming****Lab 4 — SIMD and Vectorization**.Code I wrote for this project is available here https://github.com/TharinduRusira/matrix-sse

First, performance of the loop unrolled version of the matrix-vector multiplication was measured against its serial implementation. Code for unrolled(by a factor of 4) Matrix-Vector multiplication is below.

void matvec_unrolled(size_t n,float** mat_a, float* vec_b) { int i,j; clock_t start = clock(); for(i=0;i<n;i++){ v[i]=0.0; for(j=0;j<n;j+=4){ v[i] += mat_a[i][j+0]*vec_b[j+0] + mat_a[i][j+1]*vec_b[j+1] + mat_a[i][j+2]*vec_b[j+2] + mat_a[i][j+3]*vec_b[j+3]; } } cout << "time taken ="<< clock() - start << endl; } |

The following sample code is a vectorized version of a matrix-vector multiplication. This version has later been extended to solve matrix-matrix multiplication problem. Obtained results are listed below.

void mat_vec_mul_sse(size_t N, size_t M, float* avec, float** amat)

{

float* vec = avec;//rand_vec_gen(M_col);

float** mat = amat; //rand_mat_gen(N_row,M_col);

cout << "vector size =" <<M<<"x1\n Matrix size="<<N<<"x"<<M<<"\n"<<endl;

__m128 vec_b,vec_mat;

float c[N]; // store the final result

for(int i=0;i<N;i++) // for each row of the matrix

{

__m128 vec_c = _mm_setzero_ps(); // accumilate the sum of products

for(int j=0;j<M;j+=4) // for each 4 number block in a row

{

vec_b = _mm_loadu_ps(vec+j);

vec_mat = _mm_loadu_ps(mat[i]+j);

vec_c = _mm_add_ps(_mm_dp_ps(vec_mat,vec_b,0xFF),vec_c);

}

c[i] = _mm_cvtss_f32(vec_c);

}

}void mat_vec_mul_sse(size_t N, size_t M, float* avec, float** amat)

{

float* vec = avec;//rand_vec_gen(M_col);

float** mat = amat; //rand_mat_gen(N_row,M_col);

cout << "vector size =" <<M<<"x1\n Matrix size="<<N<<"x"<<M<<"\n"<<endl;

__m128 vec_b,vec_mat;

float c[N]; // store the final result

for(int i=0;i<N;i++) // for each row of the matrix

{

__m128 vec_c = _mm_setzero_ps(); // accumilate the sum of products

for(int j=0;j<M;j+=4) // for each 4 number block in a row

{

vec_b = _mm_loadu_ps(vec+j);

vec_mat = _mm_loadu_ps(mat[i]+j);

vec_c = _mm_add_ps(_mm_dp_ps(vec_mat,vec_b,0xFF),vec_c);

}

c[i] = _mm_cvtss_f32(vec_c);

}

}

In this homework, we compare performance of serial versions of matrix-vector and matrix-matrix multiplications compared to their parallelized counterparts. The obtained results are given below.

Before that, here are the CPU information of the old notebook I used to run the experiments.

*-cpu

description: CPU

product: Intel(R) Core(TM)2 Duo CPU T6670 @ 2.20GHz

vendor: Intel Corp.

physical id: 0

bus info: cpu@0

version: 6.7.10

serial: 0001-067A-0000-0000-0000-0000

slot: Intel(R) Genuine processor

size: 1200MHz

capacity: 1200MHz

width: 64 bits

clock: 200MHz

capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr

**sse sse2**ss ht tm pbe nx x86-64 constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl vmx est tm2**ssse3**cx16 xtpr pdcm**sse4_1**xsave lahf_lm ida dtherm tpr_shadow vnmi flexpriority cpufreq
Results for Matrix-Vector multiplication

Dimensions |
Execution Time(clock cycles) |
|||||

Matrix |
Vector |
Serial mat-vec mul(x) |
serial mat-vec mul with loop unrolling(y) |
Speedup(x/y) |
SSE version(z) |
Speedup(x/z) |

[100,100] | [100,1] | 87.33 | 51.33 | 1.70 | 40 | 1.28 |

[200,200] | [200,1] | 348.67 | 209.00 | 1.67 | 155.67 | 1.34 |

[400,400] | [400,1] | 1349.67 | 816.33 | 1.65 | 660.33 | 1.24 |

[800,800] | [800,1] | 5447.67 | 3205.00 | 1.70 | 2532.33 | 1.27 |

[1600,1600] | [1600,1] | 21769.0 | 12892.5 | 1.69 | 9882.5 | 1.31 |

Results for Matrix-Matrix multiplication

Dimensions |
Execution Time(clock cycles) |
|||

Matrix A |
Matrix B |
serial mat-mat mul |
SSE version |
Speedup |

[100,100] | [100,100] | 11,092.00 | 3,919.0 | 2.83 |

[200,200] | [200,200] | 87,910.33 | 29,970.67 | 2.93 |

[400,400] | [400,400] | 737,988.0 | 236,188.0 | 3.13 |

[800,800] | [800,800] | 8,961,429.0 | 2,009,663.50 | 4.46 |