SSE2 EMA and SMA

One of the things I like doing, is optimizing algorithms as much as possible. I took 2 very common investing algorithms and implemented them in assembly using the SSE2 instruction set.

Benchmarks:

SMA takes O(n^2) time to run, where EMA takes O(n) time to run as SMA has to sum the entire period every round, where EMA can use the previous EMA value to stay linear.

SMA:

This graph shows the number of SMA's per second per thread. On the x axis, you have the SMA Period used for calculation. The result is exactly what we expected, as each increase in period requires one more 'loop' of summing.

Calculations on the graph were calculated for every period. Raw Data

EMA:

There is no point for a graph with EMA, as it is just going to be a flat line, given you use the previous EMA value. For EMA I have the result of 318,374,731 operations per second, where an operation per second represents a 'round' of EMA. So given you had 79.5 million points of data on four different sets of data, it would take about one second to calculate the SMA values for all data points on all four sets of data on one core (at the same time).

A bit less technical:

You may be wondering why it needs to be four sets of data at 79.5 million points per second at a time, rather than one set of data at 318 million points per second. The reason for this is because there is no way to split up the workload evenly between all four floating point values that SSE2 allows to do at a time. The issue is that one EMA value depends on the last, which means unless you have 4 sets of data, SSE2 provides no performance gain. But this really doesn't matter anyways... I can see it much more beneficial to be doing multiple sets of data at a slower rate, than one set at a fast rate. Besides, who watches only one stock?

Benchmarking Reference:

All of the numbers discussed above were benchmarked on my Intel i7 970 processor clocked at 3.20GHz. Ram is clocked at 1333MHz. The numbers mentioned above will roughly be on par if you have a 3.2GHz system with 1333MHz ram, otherwise they're subject to change based on the power of your machine in reference to mine.

Notes:

Note that if you get a Sandy Bridge or more recent processor you can use the AVX instruction set to process eight sets of data at a time rather than four. This means you can literally double all of the performance figures mentioned above given I made an AVX implementation (I don't have a Sandy Bridge).

Source Code:

Source Code