## Single simulation run consuming 230 h (!) of cpu-time – root cause analysis of performance

I think I’ve broken some kind of record of CPU-time consumption for a single simulation run… for a fairly trivial application…

As my frequent readers know, I’ve been playing lately with a simulation of Shelling’s segration model (see previous posts for details).

This post will demonstrate how to do performance analysis without using a profiler.

Fundamentally, the simulation occurs on a grid of cells, with a given size for the side. For instance, if the side of the grid is 10, then the grid will be a 10 x 10 matrix, with 100 cells.

Typically, the simulation halts, finds a final state where the state of the board can not be improved further, in less than 10 iterations.

For a 100 x 100 matrix, finding the end state takes somewhere around 90 CPU seconds, the vast majority being user time, a tiny portion system time. No I/O occuring.

The simulation is done on a 4 core Intel based laptop, as follows:

What happens if I increase the side of the grid by a factor of 10, that is, making the side 1000 units long…? Does the CPU-time increase linearly…?

Well, no. 230h does not follow from scaling linearly by a factor of 10, from 90 sec.

Linear scaling was hardly to be expected anyway, since by increasing the side of the matrix by factor 10, the number of cells within the grid increases by a factor of 100, that is, from 10.000 to 1.000.000. In other words, the number of cells to process grows quadratically with the side.

Unfortunately, 100 x 90 sec (2.5h)  is still a far cry from 230h’s… There’s still a factor of almost 100 to explain…

Let’s see if we can figure out where this factor 100 comes from… It can’t be due to increased memory consumption, since the memory requirements of the simulation are modest, even for the 1M cell simulation, nor can it be due to I/O, since there’s no I/O… it must be in the algorithms….

The number of iterations needed to find a halting state is more or less independent of the number of cells, so we can approximate that we always run 10 iterations, regardless of the number of cells. That is, the number of iterations is independent of the number of cells on the grid.

So, it it not the case that the 230 h of processing come from additional iterations.

However, each iteration will process 100 x more cells if the side of the matrix is increased by a factor of 10, so if an iteration for 1000 cells takes t time, then we can expect an iteration for 1M cells to take  100 times t, that is, 100t.

As shown above, with a matrix with side 100, we use 90 CPU seconds. Applying the reasoning above, our extended 1M matrix with side 1000 should thus take 90 x 100 sec, or about 2.5h.

Still, a factor ~100 off.

Let’s look at the main loop of the simulation, in pseudo-code:

for i in iterations:

for every cell in matrix:

neighbors()

countFoes()

if foes > FOEMAX:

findNewPlace()

#end loop

Basically, in each iteration, we do 2 function calls, followed by an if-statement, and depending on the outcome of that if-statement, yet another function call.

We all know that the function call overhead of Python is substantial, but the number of calls grows quadratically with the side of the matrix, thus we would expect our 1M cell simulation to finish in about 2.5 h instead of 230h.

Let’s then look into each of the functions:

findNeighbors() basically creates an alias name for a slice of the matrix, and returns that name:

```def neighbors(r,c): #returns a neighborhood matrix with self in middle
m = cells[r:r+3,c:c+3]
return m```

Nothing particularly time consuming there, nothing that depends on the number of cells.

countFoes(neighbors) takes the neighbors identified above, and counts the number of ‘foes’ .  It calls np.flatten() on neighbors, and then does a list comprehension creating a list of the real neighbors, basically excluding ‘self’ from the 3 x 3 matrix.  Then it calls np.count() on that list.

Thus, a couple of calls, but nothing remarkable in countFoes(). Again, nothing that depends on the number of cells.

So, let’s look into findNewPlace()

The main structure of findNewPlace() is to loop over the cells of the matrix, looking for the first free slot. When a free slot is found, there is a move of the ‘crowded’ agent to the found free slot. That operation is basically just a couple of assignments.

However, since the number of iterations within findNewPlace() depends on the number of cells, we should expect that this function will consume 100x time, when the side of the matrix is increased by 10.

Thus, here we have it, the reason our 1M simulation takes 230h, instead of the 2.5h we would have expected: in our function findNewPlace() we again have a quadratic dependence on the number of cells: an increase of the side of the matrix by factor 10, results in 100 times more iterations in findNewPlace().

Should we be interested in optimizing the performance of this simulation for large matrixes, we should focus on changing the implementation of findNewPlace().  This is where the bottleneck of our program resides.

Below some images of the 1M cell simulation: