Saturday, January 30, 2016

Python/C++ Linking error

Python.h provides wrapper functions to embed python code in C/C++ code. If gcc complains about undefined references,

undefined reference to `Py_Initialize'
undefined reference to `Py_Finalize'

this is probably due to a linking error. Additionally, make sure Python.h is found by gcc.

gcc *.c -I($PATH_TO_Python.h) -lpython2.7

Saturday, July 18, 2015

Parallel vector dot product with p-threads

In this post, I report my results for parallel implementation of the vector dot product problem. (i.e., SUM(A[i]*B[i])). The multiplication step of the problem is embarrassingly parallel with the requirement of aggregation at the end(to obtain final result) which is achieved by a global variable protected by a mutex.

Full source code can be found here[1]

Monte Carlo Pi and P-threads

In my last two posts[1][2], I looked into the parallel version of the Pi calculation problem with the support of OpenMP. Here, we will revisit the same problem with pthreads library which is a little more low-level but gives the user greater control over the granularity of parallelization.

Pthread based parallel portion of the implementation is given below and the full source code can be found in GitHub[3].

Sunday, June 7, 2015

Parallel computation of Pi with OpenMP (series approximation)


In my previous post, earlier today,  Pi was calculated using a randomized method and the solution was compared against its parallel implementation. Here in this post, I adopt a series based approximation and run the same tests.

This[1] series approximation is based on BBP-type[2] formulas.

pi=4sum_(k=0)^(infty)((-1)^k)/(2k+1)


Saturday, June 6, 2015

Pi just got my parallel treatment


Pi, the mysterious number, often plays a its own famous role in the world of computing, especially when it comes to HPC and the power of computing. There is a bunch of good enough numerical approximations to pi. Among them, I picked a solution based on random numbers (Monte Carlo) and tried to implement a parallel version of the problem using OpenMP.

First, things went awry, parallel version falling way worse than the serial version. Had a hard time trying to figure out  what went wrong because my simple OpenMP parallel loop looked perfectly fine. However, I later got to know the real problem was hidden behind the rand() I used to generate random numbers. It looked like the rand() function and its initialization seed prevented the loop from exploiting its parallel property. So, I moved to std::default_random_engine and std::uniform_real_distribution<double> that come with c++11.  With the new random number generator, results were pretty convincing.


Saturday, April 11, 2015

Optimizing vector computations using Streaming SIMD Extensions(SSE) and loop unrolling

In this post, I'll be sharing my work related to optimizing matrix/vector operation with the support of loop unrolling and SSE instructions at the hardware level. I completed this task as a part of an assignment for CS 4342 Advance Computer Architecture module. This homework was taken from Marcus Holm's <marcus.holm@it.uu.se> High Performance Computing and Programming Lab 4 — SIMD and Vectorization.

Code I wrote for this project is available here https://github.com/TharinduRusira/matrix-sse

First, performance of the loop unrolled version of the matrix-vector multiplication was measured against its serial implementation. Code for unrolled(by a factor of 4) Matrix-Vector multiplication is below.

Fork me on GitHub