Sunday, June 7, 2015

Parallel computation of Pi with OpenMP (series approximation)

In my previous post, earlier today,  Pi was calculated using a randomized method and the solution was compared against its parallel implementation. Here in this post, I adopt a series based approximation and run the same tests.

This[1] series approximation is based on BBP-type[2] formulas.


Saturday, June 6, 2015

Pi just got my parallel treatment

Pi, the mysterious number, often plays a its own famous role in the world of computing, especially when it comes to HPC and the power of computing. There is a bunch of good enough numerical approximations to pi. Among them, I picked a solution based on random numbers (Monte Carlo) and tried to implement a parallel version of the problem using OpenMP.

First, things went awry, parallel version falling way worse than the serial version. Had a hard time trying to figure out  what went wrong because my simple OpenMP parallel loop looked perfectly fine. However, I later got to know the real problem was hidden behind the rand() I used to generate random numbers. It looked like the rand() function and its initialization seed prevented the loop from exploiting its parallel property. So, I moved to std::default_random_engine and std::uniform_real_distribution<double> that come with c++11.  With the new random number generator, results were pretty convincing.

Saturday, April 11, 2015

Optimizing vector computations using Streaming SIMD Extensions(SSE) and loop unrolling

In this post, I'll be sharing my work related to optimizing matrix/vector operation with the support of loop unrolling and SSE instructions at the hardware level. I completed this task as a part of an assignment for CS 4342 Advance Computer Architecture module. This homework was taken from Marcus Holm's <> High Performance Computing and Programming Lab 4 — SIMD and Vectorization.

Code I wrote for this project is available here

First, performance of the loop unrolled version of the matrix-vector multiplication was measured against its serial implementation. Code for unrolled(by a factor of 4) Matrix-Vector multiplication is below.

Optimizing Matrix Multiplication problem using OpenMP and Improved Cache Utilization

It's been a while since my last post, it was a hectic final semester but I should admit that it's one of those semesters I enjoyed the most. This one project I liked was parallelizing the famous matrix multiplication problem. I worked with my lab-buddy Chalitha for this assignment and it was fun.

Experiments were run on a machine with the following specifications.

Architecture:       x86_64
CPU op-mode(s):     32-bit, 64-bit
Byte Order:         Little Endian
CPU(s):             4
On-line CPU(s) list:   0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s):          1
NUMA node(s):       1
Vendor ID:          GenuineIntel
CPU family:         6
Model:              37
Stepping:           5
CPU MHz:            1199.000
BogoMIPS:           5321.42
Virtualization:     VT-x
L1d cache:          32K
L1i cache:          32K
L2 cache:           256K
L3 cache:           3072K
NUMA node0 CPU(s): 0-3
Model Name    : Intel(R) Core(TM) i5 CPU    M 480  @ 2.67GHz

Wednesday, December 31, 2014

I write Biology too...

It's been an exciting final semester in the university and to make things more colorful, I decided to take a Bioinformatics class. I have never had a methodical Biology education, though I had spent quite a lot of hours in my junior-high science class looking at the cells in an onion peel or doodling the cellular structure of the Chlamydomonas.
It's always better late than never; I have time to learn a little more biology now. The first assignment of the Bioinformatics class was to write a concise essay on cells, cell reproduction and basic genetics.
I append below what I wrote for the class. Even though the content is nothing more than what's taught in an ordinary high school Biology class, I must say that I enjoyed learning how life actually works and apparently our life is pretty interesting (and complicated).

Saturday, November 22, 2014

Building a Scanner and Parser for your programming language

As the CA component of Compiler Design module, I was given this project to implement a lexical analyzer and the parser(semantic analyzer) for C- (a language specification from the famous text book, Compiler Construction: Principles and Practice)

flex(lex) was used to implement the lexical analyzer while the parser was built using bison(yacc). Flex uses the header file generated by bison in order to identify tokens defined in the bison script.

Bison script defines different token types that are valid in our language specification.

Fork me on GitHub