Saturday, July 18, 2015

Parallel vector dot product with p-threads

In this post, I report my results for parallel implementation of the vector dot product problem. (i.e., SUM(A[i]*B[i])). The multiplication step of the problem is embarrassingly parallel with the requirement of aggregation at the end(to obtain final result) which is achieved by a global variable protected by a mutex.

Full source code can be found here[1]

Monte Carlo Pi and P-threads

In my last two posts[1][2], I looked into the parallel version of the Pi calculation problem with the support of OpenMP. Here, we will revisit the same problem with pthreads library which is a little more low-level but gives the user greater control over the granularity of parallelization.

Pthread based parallel portion of the implementation is given below and the full source code can be found in GitHub[3].

Sunday, June 7, 2015

Parallel computation of Pi with OpenMP (series approximation)


In my previous post, earlier today,  Pi was calculated using a randomized method and the solution was compared against its parallel implementation. Here in this post, I adopt a series based approximation and run the same tests.

This[1] series approximation is based on BBP-type[2] formulas.

pi=4sum_(k=0)^(infty)((-1)^k)/(2k+1)


Saturday, June 6, 2015

Pi just got my parallel treatment


Pi, the mysterious number, often plays a its own famous role in the world of computing, especially when it comes to HPC and the power of computing. There is a bunch of good enough numerical approximations to pi. Among them, I picked a solution based on random numbers (Monte Carlo) and tried to implement a parallel version of the problem using OpenMP.

First, things went awry, parallel version falling way worse than the serial version. Had a hard time trying to figure out  what went wrong because my simple OpenMP parallel loop looked perfectly fine. However, I later got to know the real problem was hidden behind the rand() I used to generate random numbers. It looked like the rand() function and its initialization seed prevented the loop from exploiting its parallel property. So, I moved to std::default_random_engine and std::uniform_real_distribution<double> that come with c++11.  With the new random number generator, results were pretty convincing.


Saturday, April 11, 2015

Optimizing vector computations using Streaming SIMD Extensions(SSE) and loop unrolling

In this post, I'll be sharing my work related to optimizing matrix/vector operation with the support of loop unrolling and SSE instructions at the hardware level. I completed this task as a part of an assignment for CS 4342 Advance Computer Architecture module. This homework was taken from Marcus Holm's <marcus.holm@it.uu.se> High Performance Computing and Programming Lab 4 — SIMD and Vectorization.

Code I wrote for this project is available here https://github.com/TharinduRusira/matrix-sse

First, performance of the loop unrolled version of the matrix-vector multiplication was measured against its serial implementation. Code for unrolled(by a factor of 4) Matrix-Vector multiplication is below.

Optimizing Matrix Multiplication problem using OpenMP and Improved Cache Utilization

It's been a while since my last post, it was a hectic final semester but I should admit that it's one of those semesters I enjoyed the most. This one project I liked was parallelizing the famous matrix multiplication problem. I worked with my lab-buddy Chalitha for this assignment and it was fun.

Experiments were run on a machine with the following specifications.

Architecture:       x86_64
CPU op-mode(s):     32-bit, 64-bit
Byte Order:         Little Endian
CPU(s):             4
On-line CPU(s) list:   0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s):          1
NUMA node(s):       1
Vendor ID:          GenuineIntel
CPU family:         6
Model:              37
Stepping:           5
CPU MHz:            1199.000
BogoMIPS:           5321.42
Virtualization:     VT-x
L1d cache:          32K
L1i cache:          32K
L2 cache:           256K
L3 cache:           3072K
NUMA node0 CPU(s): 0-3
Model Name    : Intel(R) Core(TM) i5 CPU    M 480  @ 2.67GHz

Fork me on GitHub