## Saturday, April 11, 2015

### Optimizing vector computations using Streaming SIMD Extensions(SSE) and loop unrolling

In this post, I'll be sharing my work related to optimizing matrix/vector operation with the support of loop unrolling and SSE instructions at the hardware level. I completed this task as a part of an assignment for CS 4342 Advance Computer Architecture module. This homework was taken from Marcus Holm's <marcus.holm@it.uu.se> High Performance Computing and Programming Lab 4 — SIMD and Vectorization.

Code I wrote for this project is available here https://github.com/TharinduRusira/matrix-sse

First, performance of the loop unrolled version of the matrix-vector multiplication was measured against its serial implementation. Code for unrolled(by a factor of 4) Matrix-Vector multiplication is below.

void matvec_unrolled(size_t n,float** mat_a, float* vec_b)
{
int i,j;
clock_t start = clock();
for(i=0;i<n;i++){
v[i]=0.0;
for(j=0;j<n;j+=4){
v[i] += mat_a[i][j+0]*vec_b[j+0]
+ mat_a[i][j+1]*vec_b[j+1]
+ mat_a[i][j+2]*vec_b[j+2]
+ mat_a[i][j+3]*vec_b[j+3];
}
}
cout << "time taken ="<< clock() - start << endl;

}

The following sample code is a vectorized version of a matrix-vector multiplication. This version has later been extended to solve matrix-matrix multiplication problem. Obtained results are listed below.

void mat_vec_mul_sse(size_t N, size_t M, float* avec, float** amat)
{

float* vec = avec;//rand_vec_gen(M_col);
float** mat = amat; //rand_mat_gen(N_row,M_col);
cout << "vector size =" <<M<<"x1\n Matrix size="<<N<<"x"<<M<<"\n"<<endl;

__m128 vec_b,vec_mat;

float c[N]; // store the final result

for(int i=0;i<N;i++) // for each row of the matrix
{
__m128 vec_c = _mm_setzero_ps(); // accumilate the sum of products

for(int j=0;j<M;j+=4) // for each 4 number block in a row
{

}
c[i] = _mm_cvtss_f32(vec_c);
}

}

In this homework, we compare performance of serial versions of matrix-vector and matrix-matrix multiplications compared to their parallelized counterparts. The obtained results are given below.

Before that, here are the CPU information of the old notebook I used to run the experiments.

*-cpu
description: CPU
product: Intel(R) Core(TM)2 Duo CPU     T6670  @ 2.20GHz
vendor: Intel Corp.
physical id: 0
bus info: cpu@0
version: 6.7.10
serial: 0001-067A-0000-0000-0000-0000
slot: Intel(R) Genuine processor
size: 1200MHz
capacity: 1200MHz
width: 64 bits
clock: 200MHz
capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx x86-64 constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm ida dtherm tpr_shadow vnmi flexpriority cpufreq

Results for Matrix-Vector multiplication

 Dimensions Execution Time(clock cycles) Matrix Vector Serial mat-vec mul (x) serial mat-vec mul with loop unrolling (y) Speedup (x/y) SSE version (z) Speedup (x/z) [100,100] [100,1] 87.33 51.33 1.70 40 1.28 [200,200] [200,1] 348.67 209.00 1.67 155.67 1.34 [400,400] [400,1] 1349.67 816.33 1.65 660.33 1.24 [800,800] [800,1] 5447.67 3205.00 1.70 2532.33 1.27 [1600,1600] [1600,1] 21769.0 12892.5 1.69 9882.5 1.31

Results for Matrix-Matrix multiplication

 Dimensions Execution Time(clock cycles) Matrix A Matrix B serial mat-mat mul SSE version Speedup [100,100] [100,100] 11,092.00 3,919.0 2.83 [200,200] [200,200] 87,910.33 29,970.67 2.93 [400,400] [400,400] 737,988.0 236,188.0 3.13 [800,800] [800,800] 8,961,429.0 2,009,663.50 4.46

### Optimizing Matrix Multiplication problem using OpenMP and Improved Cache Utilization

It's been a while since my last post, it was a hectic final semester but I should admit that it's one of those semesters I enjoyed the most. This one project I liked was parallelizing the famous matrix multiplication problem. I worked with my lab-buddy Chalitha for this assignment and it was fun.

Experiments were run on a machine with the following specifications.

Architecture:       x86_64
CPU op-mode(s):     32-bit, 64-bit
Byte Order:         Little Endian
CPU(s):             4
On-line CPU(s) list:   0-3
Core(s) per socket: 2
Socket(s):          1
NUMA node(s):       1
Vendor ID:          GenuineIntel
CPU family:         6
Model:              37
Stepping:           5
CPU MHz:            1199.000
BogoMIPS:           5321.42
Virtualization:     VT-x
L1d cache:          32K
L1i cache:          32K
L2 cache:           256K
L3 cache:           3072K
NUMA node0 CPU(s): 0-3
Model Name    : Intel(R) Core(TM) i5 CPU    M 480  @ 2.67GHz

Code is available in GitHub at https://github.com/TharinduRusira/ParallelMatrix
Initially the parallel loops in the problem were  distributed among the different cores of the machine using #pragma omp parallel for directives available in the standard OpenMP library.

for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
C[i][j]=0;
for(k=0;k<N;k++)
{
C[i][j]+= A[i][k]*B[k][j];
}
}
}

The parallelized code was further optimized by taking the transpose of B before performing A*B. In the native solution for A*B, B is accessed in the column-major order which is not desirable for good  cache performance.

for(i=0;i<N;i++) {
for(j=0;j<N;j++){
B_t[i][j] = B[j][i];
}
}
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
C[i][j]=0;
for(k=0;k<N;k++)
{
C[i][j]+= A[i][k]*B_t[j][k];
} } }

The three versions of the code, serial, parallel, and optimized were run multiple iterations and average runtimes were collected with a 95% statistical confidence. Following figures show the speedups achieved by each version and later we show the relative speedups between serial-parallel versions and serial-optimized versions. The maximum average speedup reached 4x in the later iterations of the optimized version of the code.
 Figure 1. Run time comparison of the three versions of matrix multiplication

 Figure 2. Performance comparison between serial and parallel versions

 Figure 3. Performance comparison between serial and optimized parallel versions

## Wednesday, December 31, 2014

### I write Biology too...

It's been an exciting final semester in the university and to make things more colorful, I decided to take a Bioinformatics class. I have never had a methodical Biology education, though I had spent quite a lot of hours in my junior-high science class looking at the cells in an onion peel or doodling the cellular structure of the Chlamydomonas.
It's always better late than never; I have time to learn a little more biology now. The first assignment of the Bioinformatics class was to write a concise essay on cells, cell reproduction and basic genetics.
I append below what I wrote for the class. Even though the content is nothing more than what's taught in an ordinary high school Biology class, I must say that I enjoyed learning how life actually works and apparently our life is pretty interesting (and complicated).

## Saturday, November 22, 2014

### Building a Scanner and Parser for your programming language

As the CA component of Compiler Design module, I was given this project to implement a lexical analyzer and the parser(semantic analyzer) for C- (a language specification from the famous text book, Compiler Construction: Principles and Practice)

flex(lex) was used to implement the lexical analyzer while the parser was built using bison(yacc). Flex uses the header file generated by bison in order to identify tokens defined in the bison script.

Bison script defines different token types that are valid in our language specification.

## Saturday, July 5, 2014

### Saving and Indexing Shp resources in Arches

In this post, I am going to explain my work with internals of Arches data processing and how it models heritage data in the back-end. I described how the shapefile reader works in my previous post and in this, I will be explaining how shapefile records are collaborating with PostGIS and Elasticsearch.