Tick Database, Now Want to Run SQL

DevBrian · Jul 26, 2012

Quote from amazingIndustry:

It makes sense to observe this about BoundedCapacity, because it most likely runs other checks on the internal queue which may add overhead. Your numbers still sound low. I ran the same test on my very old 2core laptop and still get to 11million items per second. Have you followed my advice to setup an Action<T> instead of calling a method each time? Or do you observe the 5mil messages/sec while incrementing the counter directly within the actionblock?
More...

After removing a bool check (if the background thread should keep processing) on the producer side, speed is above 10 million items per second.

Constructor:
_actionBlock = new ActionBlock<long>(l => { _handled++; }, new ExecutionDataflowBlockOptions() { SingleProducerConstrained = true });

On background producer thread:
while (true)
{
if (_actionBlock.InputCount <= 1000000)
_actionBlock.Post(1);
}

If I add the method call back in on the consumer side, speed drops to the 5 million items per second rate.

Academically, it's interesting to see that difference. But if I were to use this code to feed tick data to a strategy, I would need to eventually call a private method. Implementing a strategy within a lamda expression which isn't a option for us.

amazingIndustry · Jul 26, 2012

you do not need to deal with lambda expressions. You can pass in a reference to an Action<T> block directly into the constructor of the ActionBlock. The Action<T> works in the exact same way as a private method. You can define the Action<T> in the same location as your private methods and pass it to some sort of ConfigMethod within the class in which you setup the ActionBlocks. Simple as that. By the way I would look for ways to get around BoundedCapacity because it indeed slows you down. There are ways you can slow down your producer if the consumer is the bottleneck and cannot be sped up. Or you can implement your own custom blocking mechanism and instead of, worst case, blocking at almost each iteration because the queue is full just after adding an additional element. The custom mechanism would know when the queue is, for instance half empty and signal to the source that it wants it to fill it up again. But its more boiler plate code which TPL Dataflow takes care of.

I would recommend you to now run the thing with your actual algorithms, because most likely your bottleneck is not the passing of messages anymore.

Quote from DevBrian:

After removing a bool check (if the background thread should keep processing) on the producer side, speed is above 10 million items per second.

Constructor:
_actionBlock = new ActionBlock<long>(l => { _handled++; }, new ExecutionDataflowBlockOptions() { SingleProducerConstrained = true });

On background producer thread:
while (true)
{
if (_actionBlock.InputCount <= 1000000)
_actionBlock.Post(1);
}

If I add the method call back in on the consumer side, speed drops to the 5 million items per second rate.

Academically, it's interesting to see that difference. But if I were to use this code to feed tick data to a strategy, I would need to eventually call a private method. Implementing a strategy within a lamda expression which isn't a option for us.
More...

DevBrian · Jul 26, 2012

Quote from amazingIndustry:

you do not need to deal with lambda expressions. You can pass in a reference to an Action<T> block directly into the constructor of the ActionBlock. The Action<T> works in the exact same way as a private method. You can define the Action<T> in the same location as your private methods and pass it to some sort of ConfigMethod within the class in which you setup the ActionBlocks. Simple as that. By the way I would look for ways to get around BoundedCapacity because it indeed slows you down. There are ways you can slow down your producer if the consumer is the bottleneck and cannot be sped up. Or you can implement your own custom blocking mechanism and instead of, worst case, blocking at almost each iteration because the queue is full just after adding an additional element. The custom mechanism would know when the queue is, for instance half empty and signal to the source that it wants it to fill it up again. But its more boiler plate code which TPL Dataflow takes care of.

I would recommend you to now run the thing with your actual algorithms, because most likely your bottleneck is not the passing of messages anymore.
More...

True, my bottleneck has always been the strategies themselves. And none of our strategies can process greater than probably 500k messages a second. Is this your experience as well?

I ask, because it makes one question the usefulness of any data stream optimized to read faster than your fastest strategy.

amazingIndustry · Jul 27, 2012

same issues here, but I manage to process ticks as part of a full strategy a lot faster than 500k/second. You want to look at basic optimization techniques, but first of all I would run the code through a profiler as another poster suggested.

Quote from DevBrian:

True, my bottleneck has always been the strategies themselves. And none of our strategies can process greater than probably 500k messages a second. Is this your experience as well?

I ask, because it makes one question the usefulness of any data stream optimized to read faster than your fastest strategy.
More...

NetTecture · Jul 27, 2012

Quote from amazingIndustry:

same issues here, but I manage to process ticks as part of a full strategy a lot faster than 500k/second. You want to look at basic optimization techniques, but first of all I would run the code through a profiler as another poster suggested.
More...

And then learnproper programming

* Get rid of all floats (Double, float), use integer arithmetic.

* Doing that NEVER EVER do a division, always use multiply/shift, which is many times faster.

The result is a definable result and granularity on all operations (contrary to floats where you would have to round) and a performance difference in factory - around up to factor 50. PLUS on many processsors the FPU is a much more limted ressource than the IPU

amazingIndustry · Jul 27, 2012

getting rid of all double, float variable types may not apply to all trading strategies. You could go lower and lower level and arrive at coding assembler but that would get into a very deep discussion about the pros and cons. Not everything has to run on integer values, the speedup sometimes warrants that, yes, but often times it is negligible.

Its hard to judge what bogs down code without looking at it, thats why I recommended a profiler. I would not go as far as you as postulating to get rid of all non integer types. I run on floats and doubles and do more than fine. I guess your advice comes in handy to really tweak things in the end but I am afraid we are dealing here with much more subtle, trivial programming inefficiencies here. Just polling the queue count each time in the while loop alone is a no-go.

Quote from NetTecture:

And then learnproper programming

* Get rid of all floats (Double, float), use integer arithmetic.

* Doing that NEVER EVER do a division, always use multiply/shift, which is many times faster.

The result is a definable result and granularity on all operations (contrary to floats where you would have to round) and a performance difference in factory - around up to factor 50. PLUS on many processsors the FPU is a much more limted ressource than the IPU
More...

NetTecture · Jul 27, 2012

Quote from amazingIndustry:

getting rid of all double, float variable types may not apply to all trading strategies. You could go lower and lower level and arrive at coding assembler but that would get into a very deep discussion about the pros and cons. Not everything has to run on integer values, the speedup sometimes warrants that, yes, but often times it is negligible.
More...

Actually it does wonderfully apply to trading.

Start with bars - open, high, low, close are TICKS - not arbitrary numbers. So, store and aggregate them i nticks.

Then go fro mthere like another 6 digits (hard factor) and use integer from now on. That is a fixed factor from the orriginal bar to the expanded resolution one. But instaed of 6 digits (like 1.000.000) use a power of 2 number, so going down to the original tick resolution is a simple shift.

You may not be able to avoid ALL divisions, but most of them.

Obviously that doesn ot help if some overly smart developers decided to use floats all over their framework.

It is EXTREMELY expensive to dean with any price comparison after any mathematical operation in floats because not only do you ahve to do the operation, but also have to.... round it to the proper resolution, which means you basically are wasting a LOOOOT of cycles.

Even something like x * TickSize in float is NOT guaranteed to be equal to A * TickSize + B * TickSize - the two floats may be CLOSE, but not IDENTICAL.

So that in ints, and they are not identical, but then I compare:

((A * TickSize) >>x) + ((B * TickSize)>>X).

Funny thing is that >>X is a VERY cheap operation these days thanks to barrel shift registers. Most likely 1 tick.
The alternative is:

Math.Round(A*TickSize, x) + Math.Round(B*TickSize,x)

And you do NOT want to know the cost of the Math.Round operation. It is EXPENSIVE.

That is NOT relevant when you do real time processing, but when you do extensive backtests + optimization it may cut down processing times by a factor of 10 and more, depending how heavy the operation is

When you use modern Opterons you get the additional benefit that the 2 cores of a bulldozer module... have ONE SHARED FPU UNIT - but two separate Integer units

amazingIndustry · Jul 27, 2012

With all due respect but I think you did not read my post carefully. I never claimed that it would not apply to trading, quite the contrary, I stated clearly that it has it applications and can speed up things. I claimed that the previous poster apparently is dealing with much more trivial issues in his code that could hand him at least an order of magnitude speedup if he implemented more efficient code. One aspect is what you pointed him to but its something that can be very complicated and time consuming to implement. Please consider that most financial and math/quant libraries DO PERUSE double or float variable types, you would have to re-write ALL such libraries or write your own from scratch if you wanted to go all integer. This is in most cases not feasible nor advised. We do not need to argue about your points because yes they are correct and I think most everyone will agree with your suggestions but you need to balance that with the benefit you derive from that vs the time it takes to implement and what specific libraries the project is accessing. Blanket statements are rarely the solution. It really depends what computations the code within the strategy in question is performing.

There are very important differences even whether you divide a float by 2 or multiply it by 0.5 in terms of precision and computational requirements. Yes they all matter but my claim is: First tackle the issues that can be adjusted and changed in a matter of minutes rather than delving right into issues that require hours if not days to carefully plan, think through and then implement.

Quote from NetTecture:

Actually it does wonderfully apply to trading.

Start with bars - open, high, low, close are TICKS - not arbitrary numbers. So, store and aggregate them i nticks.

Then go fro mthere like another 6 digits (hard factor) and use integer from now on. That is a fixed factor from the orriginal bar to the expanded resolution one. But instaed of 6 digits (like 1.000.000) use a power of 2 number, so going down to the original tick resolution is a simple shift.

You may not be able to avoid ALL divisions, but most of them.

Obviously that doesn ot help if some overly smart developers decided to use floats all over their framework.

It is EXTREMELY expensive to dean with any price comparison after any mathematical operation in floats because not only do you ahve to do the operation, but also have to.... round it to the proper resolution, which means you basically are wasting a LOOOOT of cycles.

Even something like x * TickSize in float is NOT guaranteed to be equal to A * TickSize + B * TickSize - the two floats may be CLOSE, but not IDENTICAL.

So that in ints, and they are not identical, but then I compare:

((A * TickSize) >>x) + ((B * TickSize)>>X).

Funny thing is that >>X is a VERY cheap operation these days thanks to barrel shift registers. Most likely 1 tick.
The alternative is:

Math.Round(A*TickSize, x) + Math.Round(B*TickSize,x)

And you do NOT want to know the cost of the Math.Round operation. It is EXPENSIVE.

That is NOT relevant when you do real time processing, but when you do extensive backtests + optimization it may cut down processing times by a factor of 10 and more, depending how heavy the operation is

When you use modern Opterons you get the additional benefit that the 2 cores of a bulldozer module... have ONE SHARED FPU UNIT - but two separate Integer units
More...

NetTecture · Jul 27, 2012

I agree. And that is where there is ONE thing only to solve it efficiently - a profiler. Plus possibly some simple mathematical optimizations. For example I know one library where a moving average is calculated by - adding the elements in it, then dividing by number of elements.

Evers time.

instead of adding up, remember that for next run, then remove the element falling out and add the one going in.

For a longer moving average that is significant (50 additions instead of 1 addition, 1 substraction).

But a profiler will show you where the problem is straight in. The one from Visual Studio (mind you, not the lower ties) is VERY good in that and ALSO can properly deal helping debugging issues with TPM.

Now we just need Visual Studio 2012 released sooonish... their profilers are a LOT better than the one from 2010.

amazingIndustry · Jul 27, 2012

I run VS11 beta and the profiler is awesome especially the one profiling concurrent operations. I love it.

P.S.: By the way, if the compiler and profiler did their job, then the entire operation of floats is pipelined. So if you have a loop, for example, the entire loop will only take x cycles longer to run on the FPU rather than each loop iteration. The difference then between running on ints vs floats becomes negligible. (Even running a decent loop of 1000 elements) on floats vs ints and assuming about 5 cycle overhead on each iteration will only cost you an extra 10 or so microseconds, depending on how your CPU is clocked. This 5 cycle overhead on the whole loop will be so small that you would not even be able to measure it on Windows machines.

Quote from NetTecture:

I agree. And that is where there is ONE thing only to solve it efficiently - a profiler. Plus possibly some simple mathematical optimizations. For example I know one library where a moving average is calculated by - adding the elements in it, then dividing by number of elements.

Evers time.

instead of adding up, remember that for next run, then remove the element falling out and add the one going in.

For a longer moving average that is significant (50 additions instead of 1 addition, 1 substraction).

But a profiler will show you where the problem is straight in. The one from Visual Studio (mind you, not the lower ties) is VERY good in that and ALSO can properly deal helping debugging issues with TPM.

Now we just need Visual Studio 2012 released sooonish... their profilers are a LOT better than the one from 2010.
More...