Saturday, April 11, 2015

Optimizing vector computations using Streaming SIMD Extensions(SSE) and loop unrolling

In this post, I'll be sharing my work related to optimizing matrix/vector operation with the support of loop unrolling and SSE instructions at the hardware level. I completed this task as a part of an assignment for CS 4342 Advance Computer Architecture module. This homework was taken from Marcus Holm's <marcus.holm@it.uu.se> High Performance Computing and Programming Lab 4 — SIMD and Vectorization.

Code I wrote for this project is available here https://github.com/TharinduRusira/matrix-sse

First, performance of the loop unrolled version of the matrix-vector multiplication was measured against its serial implementation. Code for unrolled(by a factor of 4) Matrix-Vector multiplication is below.

void matvec_unrolled(size_t n,float** mat_a, float* vec_b)
{
int i,j;
clock_t start = clock();
for(i=0;i<n;i++){
v[i]=0.0;
for(j=0;j<n;j+=4){
v[i] += mat_a[i][j+0]*vec_b[j+0]
+ mat_a[i][j+1]*vec_b[j+1]
+ mat_a[i][j+2]*vec_b[j+2]
+ mat_a[i][j+3]*vec_b[j+3];
}
}
cout << "time taken ="<< clock() - start << endl;

}

The following sample code is a vectorized version of a matrix-vector multiplication. This version has later been extended to solve matrix-matrix multiplication problem. Obtained results are listed below.


void mat_vec_mul_sse(size_t N, size_t M, float* avec, float** amat)
{

float* vec = avec;//rand_vec_gen(M_col);
float** mat = amat; //rand_mat_gen(N_row,M_col);
cout << "vector size =" <<M<<"x1\n Matrix size="<<N<<"x"<<M<<"\n"<<endl;

__m128 vec_b,vec_mat;

float c[N]; // store the final result

for(int i=0;i<N;i++) // for each row of the matrix
{
__m128 vec_c = _mm_setzero_ps(); // accumilate the sum of products

for(int j=0;j<M;j+=4) // for each 4 number block in a row
{

vec_b = _mm_loadu_ps(vec+j);
vec_mat = _mm_loadu_ps(mat[i]+j);
vec_c = _mm_add_ps(_mm_dp_ps(vec_mat,vec_b,0xFF),vec_c);
}
c[i] = _mm_cvtss_f32(vec_c);
}

}



In this homework, we compare performance of serial versions of matrix-vector and matrix-matrix multiplications compared to their parallelized counterparts. The obtained results are given below. 

Before that, here are the CPU information of the old notebook I used to run the experiments.

*-cpu
          description: CPU
          product: Intel(R) Core(TM)2 Duo CPU     T6670  @ 2.20GHz
          vendor: Intel Corp.
          physical id: 0
          bus info: cpu@0
          version: 6.7.10
          serial: 0001-067A-0000-0000-0000-0000
          slot: Intel(R) Genuine processor
          size: 1200MHz
          capacity: 1200MHz
          width: 64 bits
          clock: 200MHz
          capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx x86-64 constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm ida dtherm tpr_shadow vnmi flexpriority cpufreq

Results for Matrix-Vector multiplication


Dimensions
Execution Time(clock cycles)


Matrix Vector Serial mat-vec mul
(x)
serial mat-vec mul with loop unrolling
(y)
Speedup
(x/y)
SSE version
(z)
Speedup
(x/z)
[100,100] [100,1] 87.33 51.33 1.70 40 1.28
[200,200] [200,1] 348.67 209.00 1.67 155.67 1.34
[400,400] [400,1] 1349.67 816.33 1.65 660.33 1.24
[800,800] [800,1] 5447.67 3205.00 1.70 2532.33 1.27
[1600,1600] [1600,1] 21769.0 12892.5 1.69 9882.5 1.31


Results for Matrix-Matrix multiplication



Dimensions

Execution Time(clock cycles)

Matrix A Matrix B serial mat-mat mul SSE version Speedup
[100,100] [100,100] 11,092.00 3,919.0 2.83
[200,200] [200,200] 87,910.33 29,970.67 2.93
[400,400] [400,400] 737,988.0 236,188.0 3.13
[800,800] [800,800] 8,961,429.0 2,009,663.50 4.46


Optimizing Matrix Multiplication problem using OpenMP and Improved Cache Utilization

It's been a while since my last post, it was a hectic final semester but I should admit that it's one of those semesters I enjoyed the most. This one project I liked was parallelizing the famous matrix multiplication problem. I worked with my lab-buddy Chalitha for this assignment and it was fun.

Experiments were run on a machine with the following specifications.

Architecture:       x86_64
CPU op-mode(s):     32-bit, 64-bit
Byte Order:         Little Endian
CPU(s):             4
On-line CPU(s) list:   0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s):          1
NUMA node(s):       1
Vendor ID:          GenuineIntel
CPU family:         6
Model:              37
Stepping:           5
CPU MHz:            1199.000
BogoMIPS:           5321.42
Virtualization:     VT-x
L1d cache:          32K
L1i cache:          32K
L2 cache:           256K
L3 cache:           3072K
NUMA node0 CPU(s): 0-3
Model Name    : Intel(R) Core(TM) i5 CPU    M 480  @ 2.67GHz

Code is available in GitHub at https://github.com/TharinduRusira/ParallelMatrix
Initially the parallel loops in the problem were  distributed among the different cores of the machine using #pragma omp parallel for directives available in the standard OpenMP library.


#pragma omp parallel for num_threads(NUM_THREADS) default(none) shared(A,B,C,N) private(i,j,k)
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
C[i][j]=0;
for(k=0;k<N;k++)
{
C[i][j]+= A[i][k]*B[k][j];
}
}
}



The parallelized code was further optimized by taking the transpose of B before performing A*B. In the native solution for A*B, B is accessed in the column-major order which is not desirable for good  cache performance.


#pragma omp parallel for num_threads(NUM_THREADS) shared(B,B_t) private(i,j)
for(i=0;i<N;i++) {
for(j=0;j<N;j++){
B_t[i][j] = B[j][i];
}
}
#pragma omp parallel for num_threads(NUM_THREADS) default(none) shared(A,B_t,C,N) private(i,j,k)
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
C[i][j]=0;
for(k=0;k<N;k++)
{
C[i][j]+= A[i][k]*B_t[j][k];
} } }


The three versions of the code, serial, parallel, and optimized were run multiple iterations and average runtimes were collected with a 95% statistical confidence. Following figures show the speedups achieved by each version and later we show the relative speedups between serial-parallel versions and serial-optimized versions. The maximum average speedup reached 4x in the later iterations of the optimized version of the code.
Figure 1. Run time comparison of the three versions of matrix multiplication

Figure 2. Performance comparison between serial and parallel versions

Figure 3. Performance comparison between serial and optimized parallel versions

Wednesday, December 31, 2014

I write Biology too...

It's been an exciting final semester in the university and to make things more colorful, I decided to take a Bioinformatics class. I have never had a methodical Biology education, though I had spent quite a lot of hours in my junior-high science class looking at the cells in an onion peel or doodling the cellular structure of the Chlamydomonas.
It's always better late than never; I have time to learn a little more biology now. The first assignment of the Bioinformatics class was to write a concise essay on cells, cell reproduction and basic genetics.
I append below what I wrote for the class. Even though the content is nothing more than what's taught in an ordinary high school Biology class, I must say that I enjoyed learning how life actually works and apparently our life is pretty interesting (and complicated).







Saturday, November 22, 2014

Building a Scanner and Parser for your programming language

As the CA component of Compiler Design module, I was given this project to implement a lexical analyzer and the parser(semantic analyzer) for C- (a language specification from the famous text book, Compiler Construction: Principles and Practice)

flex(lex) was used to implement the lexical analyzer while the parser was built using bison(yacc). Flex uses the header file generated by bison in order to identify tokens defined in the bison script.

Bison script defines different token types that are valid in our language specification.


Saturday, July 5, 2014

Saving and Indexing Shp resources in Arches

In this post, I am going to explain my work with internals of Arches data processing and how it models heritage data in the back-end. I described how the shapefile reader works in my previous post and in this, I will be explaining how shapefile records are collaborating with PostGIS and Elasticsearch.






Sunday, June 15, 2014

Shapefile reader for Arches

In my previous post I introduced how shapefile support can be provided for Arches in a very abstract and font-end perspective. By the time I write this post, I have started working on data processing methods. So this post will describe how things happen in the back-end. There are certain things that need to be glued together on order to complete the workflow and that will come in a later post.

Shapefile is a file format defined by ESRI in order to manage geographical data[1][2].
Currently Arches do not support shapefile format. With this improvement, Arches will be enabled to read user's legacy data from a shapefile and store them inside Arches as Arches resource instances.

Django GDAL(GeoSpacial Data Abstraction Library) interface was used in the development[3][4].
Current code is available in the data_import_mapping development branch at bitbucket repository trusira/arches[5]

Fork me on GitHub