Optimization: 100x performance boost (even without DevOps!) :-)


(First of all, apologies for the mention of ‘DevOps’ above:this post has, as far as I can tell – since I really don’t know what DevOps is, haven’t found any place where to download it… – nothing what so ever to do with DevOps, but since everybody and his mother are talking about it, I thought it would be fun to see if mentioning it would impact the hit count…! 🙂

No DevOps to be found here thus.

Instead, we are taking a peek into something less fluffy, more concrete and of significant importance to you if you are in the business of building anything with software:  optimization and algorithm analysis, that is, the raw performance boost obtained by utilizing your limited cubic inches optimally…

This post is a follow up to my previous post on the topic, details about the program I’ve now optimized, with a performance boost of 100x, can be found here.

A short repetition of the application under benchmarking, for those of you too lazy to click the link above:

The application is a simulation of Thomas Schelling’s ‘Segration Model’ , taking place on a matrix of cells, each cell either being occupied by an agent, or empty. Each agent is happy and remains on the cell where it currently resides, iff the majority of its neighbor cells are either empty or populated with agents of the same kind as itself. If instead the agend finds itself surrounded by more than a given treshold of agents of the opposite type, the agent will want to move to a new slot, where the number of opposing agents is under the treshold.

From a performance point of view, this application is clearly dominated by quadratic growth of execution time with respect to the size of the matrix: an increase by a factor of 10 of the side of the matrix will result in 10 x 10, that is, 100 times more cells to process.

My first implementation, where I deliberately did not pay any attention to performance, I just implemented ‘the obvious’ solution for every function of the program, had acceptable performance, in the order of a few minutes, for all sizes of matrix sides up to somewhere around 100. But as soon as the matrix side grew beyond about 100, execution times increased dramatically.  Already at a side of 200, execution took almost 30 min, and a side of 1000 took 230h (!) – that’s right, almost 10 days of continuous execution!.

As can be seen in the image below, the execution time, with my initial, ‘naive’ algorithms, grew not even quadratically – which is bad enough – but by power 4 with matrix side size.


Since the problem itself is fundamentally quadratic – the number of cells grows quadratically with matrix side, whether we like it or not – there isn’t much hope doing better than overall quadratic performance, but a growth by power 4 is clearly unacceptable.

Unfortunately, this situation, with extreme performance degradation as soon as the datasets start to grow, if far from uncommon in the world of software. I’m sure you’ve experienced many different applications, often labelled as ‘industrial strength’ or something similar,  that perform beautifully in demo settings, on  trivially sized datasets, but as soon as these applications are subjected to real world sized datasets, they stall, and become painful to use.

The fundamental problem is that not enough attention were given, in the early design phases, to the performance of the various algorithms and data structures chosen for the application, and the fact that these performance problems are difficult to produce while testing, few if any software development organizations take the time to design and implement datasets large enough for meaningful performance testing.

So, back to my simulation. After my testing revealed that performance sucks as soon as the dataset (the number of cells) grows beyond a certain number, I started analysing my program.

It didn’t take very long for me to discover the most likely culprit function, that is, the function responsible of making the performance not only quadratic, but square of that quadratic: it was the function findNewSlot(), which fundamentally is responsible for finding a new, better slot for an agent unhappy with its current position.

My initial, naive implementation of findNewSlot() basically was a loop, over the entire matrix, looking for an acceptable empty slot. Even though I break:ed out of the loop as soon as such a free slot was found, the performance of that function was quadratic. Given the fact that the problem itself has a quadratic signature, and combining that signature with a quadratic function, gives the observed overall power 4 execution.

The graph below demonstrates this:


The blue plot shows execution time results from a number of runs with my original, naive implementation of findNewSlot(), each run with increasing matrix side. On top of that blue plot, I’ve fitted a power law (power 4) function, which  fits the plot almost perfectly.

The yellow line shows a similar plot of CPU-time, after I changed findNewSlot() from performing a linear search over the entire matrix, to instead use a queue of free cells, that is, instead of iterating over all the cells in the matrix, looking for a free slot with better characteristics than the source slot, the function now just grabbed the first free slot from the queue of free slots, and analyzed it. If the analysis determined that the slot was not good enough, the newly found slot was pushed back to the end of the queue, and a new slot was popped from the front, until either a better slot was found, or the queue had been exhausted.

This change improved performance by a factor of 2, as can be seen from the  yellow plot.

However, the overall performance was still quadratic, as can be seen from the cyan line, which plots a (different) power 4 function.

(for those of you knowing your math, you’ll also realize that the yellow plot still has power 4 performance, from the fact that it has the same slope as the red & blue plots above it: on a log-log graph, the slope for a plot is given by the power.)

So, still unhappy with the overall performance – it would still take me 5 days of continuous execution to run the simulation on a million cell population – I decided to have yet another look at findNewSlot().

After some analysis, I realized that a lot of time was most likely spent on analysing and refusing empty slots, that is, the slot popped from the free slot queue,  did not provide  a better environment for the agent looking for a new slot than the one it was currenly sitting on, and thus was refused, pushed back to the end of the free queue, and a new cell from the free list was popped.

To minimize these refusals, I implemented two free_lists, one for the green cells, one for the black, basically by pushing each refused cell to the ‘opposite’ queue. That is, if a green agent moved or refused a cell, that cell was now pushed onto the free list of the opposite type of agent, the idea being that since the slot was refused, the likelihood for it being acceptable for the opposite type of agent is high.

This simple change almost completely eliminated extensive searches in the original free list, because now, for each cell that was processed, a freed-up slot would now go to the queue where it was much more likely to be accepted in the next round, since that slot now had a much higher likelihood being located in an area dominated by neighbors of a specific,acceptable type.

By this change, the overall performance of the simulation was improved by a factor of 100, which changed the overall execution from power 4 to quadratic. You can observe this by the fact that the slope of the pink plot is twice as small as the two previous plots, and by the quadratic function (brown) that I’ve fitted next to it.

A performance boost of factor 100 is enourmous by any standard. Apply a factor of 100, either way, to anything you are familiar with, such as your salary, or your daily commuting time, and you’ll get a solid feel for how big a change by factor 100 is.

Yet, it only required some careful analysis (no profiler was used to identify the bottlenecks!) to figure out where the execution bottleneck resided, and then some creative ideas on how to remove the quadratic behavior of findNewSlot().

Now, with that new design, the 1.000.000 cell simulation that previously took 10 days of execution, runs in less than 10 min! 

Bottom line: don’t get fooled by the fact that your new app or program performs beautifully in the lab: you need real world sized data to really know whether your app or program will be able to cope with professional workloads. 


About swdevperestroika

High tech industry veteran, avid hacker reluctantly transformed to mgmt consultant.
This entry was posted in development, Math, software and tagged , , , , , , . Bookmark the permalink.

One Response to Optimization: 100x performance boost (even without DevOps!) :-)

  1. tevirselrahc says:

    But…wouldn’t that be DevOps: Development (implementation in this case) Optimisations? (;-)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s