SSE2 EMA Crossover Backtest

I've always been interested in using computers to perform technical analysis and backtesting. One of my ideas was to try to optimize a backtest as much as possible, as every tiny performance increase could save days or weeks of processing time.

Implementation:

To start off with a simple backtester, I wrote a program that is written in x86 assembly using the SSE2 instruction set to perform four backtests at a time per thread. The algorithm I decided to implement was buying when EMA crossed above the current price, and selling when the EMA crossed below the current price. The EMA period used is 25.

Benchmarks:

My test sample was doing 10,000 backtests on 8048 points of data. The test resulted in a backtesting speed of 81,750,267 data points per second per thread. This number is the raw amount of points per second per thread, but we don't actually do that many points on one set of data, we actually do 20,437,566 points of data per second per thread on four different sets of data.

A bit less technical:

You may be wondering why it needs to be four sets of data at 20 million points per second at a time, rather than one set of data at 81 million points per second. The reason for this is because there is no way to split up the workload evenly between all four floating point values that SSE2 allows to do at a time. The issue is that one EMA value depends on the last, which means unless you have 4 sets of data, SSE2 provides no performance gain. But this really doesn't matter anyways... I can see it much more beneficial to be doing multiple sets of data at a slower rate, than one set at a fast rate. Besides, who watches only one stock?

Putting it into perspective:

My machine is a 6-core machine with hyperthreading, meaning I have 12 logical cores, and at 4 sets of data per thread, I can calculate 48 backtests simultaneously at 20.4 million data points per second. Given there are 252 days of trading per year, and we're backtesting on 5 second bars for 20 years, that would be 23,587,200 bars. Given our rate of backtesting of 20.4 million points per second, it would take about 1.16 seconds to calculate a backtest for this algorithm on 48 different stock symbols for 5 second bars for the past 20 years. Not too bad eh?

Benchmarking Reference:

All of the numbers discussed above were benchmarked on my Intel i7 970 processor clocked at 3.20GHz. Ram is clocked at 1333MHz. The numbers mentioned above will roughly be on par if you have a 3.2GHz system with 1333MHz ram, otherwise they're subject to change based on the power of your machine in reference to mine.

Notes:

Note that if you get a Sandy Bridge or more recent processor you can use the AVX instruction set to process eight sets of data at a time rather than four. This means you can literally double all of the performance figures mentioned above given I made an AVX implementation (I don't have a Sandy Bridge).

Source Code:

Source Code