Log in or Sign up

ET News & Sponsor Info

General Topics

Markets

Technical Topics

Brokerage Firms

Company Specific

Interactive Brokers

Tools of the Trade

Trading for a Living

Community Lounge

Site Support

Feedback

Why use a database?

Discussion in 'Data Sets and Feeds' started by onelot, Oct 9, 2004.

prophet
- 449
  Posts
- 27
  Likes
Quote from linuxtrader:
On the first point I can tell you that I would never approve a project where the design engineer did not know the limitations of their system design: If they cant predict how their system will respond to a spike in demand/load or a change in the problem regime then I simply send them back to their desk to rework their idea before I approve a dollar of funding towards implementation time.
More...

The debate between Sparohok and myself concerned optimization vs. expressiveness for backtesting, not real time execution of a system. I assumed your post was also talking about backtesting.

I agree completely with you regarding the need to handle real time market bandwidth/spikes when executing a system. However, I would consider this a core design necessity, on par with code correctness, or else merely a real time hardware/bandwidth requirement. It is not what I would consider a âdesign limitâ which you previously equated with cost, performance and scalability in the established context of backtesting systems.

On the second point our experience differs: at a certain level in most businesses people are not wholly incompetant.
More...

Not wholly incompetent, yes.

I've never met anyone that produces a design that can not be improved upon in subsequent iterations.
More...

Thatâs odd. There are plenty of dead end designs created by even highly competent programmers. Read on...

However if you start with a good system design that matches the problem regime and you are careful in your implementation then you can arrive at something that requires very little change over a broad spectrum of applications.
More...

This may be true for consumer or business software applications where compatibility and familiarity must be prioritized. It is NOT true for trading system development, where there are no rules. If you impose rules or "design limits" as you previously described in terms of cost, performance, and scalability, you will only diminish your chances of success.

The idea that most programmers are incompetent is not true today: very few techniques or practices are secret today ... part of the reason why software people are commoditized and seeeing their incomes decrease or stagnate.
More...

Yes, very few techniques are secret. You are wrong on the first point. Many programmers and architects really are incompetent. You do realize that it is often not the technically superior solution that prevails? Notice how applications are still not being written for Linux as aggressively as for Windows, even though Linux has some compelling technical and cost advantages? Notice how insecure MS Windows is without any third-party firewalls and anti-virus? Notice how Intel is finally giving into their own engineers saying that it is wrong to keep increasing clock speed and pipeline depth, just about abandoning Pentium 4, while AMD is beating them with a better design? History may regard the entire Pentium 4 architecture as a mistake. Tell me again how there arenât incompetent programmers/engineers out there, and Iâll point out more counter examples.

#91 Oct 15, 2004

Share
Sparohok
- 1,124
  Posts
- 2
  Likes
Hi again Prophet,

Another day of this? I guess so.

If by optimization you mean designing good, efficient algorithms, then by all means we are in agreement. But, that is not the generally accepted definition in the computer science community.

The design process should be focused on getting the lowest computational complexity. Optimization, on the other hand, is the process of improving performance without changing computational complexity. In other words, good design will change an O(n^2) problem to an O(n log n) problem. Optimization will take an O(n log n) design and make it two, three, four times as fast... but not any more scalable. I seldom waste time with optimization until I run into an unacceptable performance problem. Good design, on the other hand, needs to be built in from day one.

A profiler will never help you design a good algorithm. It is exclusively an optimization tool. Although I guess it may help you find out when you are using a very bad algorithm, and as such it can be a useful learning tool.

As for flat files versus databases, consider the problem of filtering tick data for bad ticks. With a CSV file, you read the whole thing into memory, run your filter on it, and write it out again. What happens if the power goes out while you are writing the file? You are screwed, your data is gone. What happens if you want to change just one tick? You still have to rewrite the whole file, or write some pretty hairy code to patch one value in an ASCII file. What happens if the file doesn't fit in physical memory? What if you want to access a single tick? What if you want to delete a tick? An O(1) problem just became an O(n) problem. What if you want to access the file from more than one program at once? More than one computer? Yes, all these issues can be addressed by writing more code, but someone has already written all that code and more, it is called a database. More important, they've debugged it for you. Why reinvent the wheel?

Quote from prophet:
You have it backwards. More market data allows greater statistical significance. Less data leads to over-fit results. Please prove your opposite point of view.
More...

If your methodology is statistically invalid and you are overfitting the data, more market data and more analysis will appear to increase your returns when all it is really increasing is the implementation shortfall. The sheer quantity of data hides the logical flaws in the design. It is much easier to see why something is working or not working when applied to one stock than 1000. Given that the vast majority of backtested strategies do not work in practice, I think this is a real problem.

Martin

#92 Oct 15, 2004

Share
linuxtrader Guest
- 255
  Posts
- 0
  Likes
Quote from prophet:

... It is NOT true for trading system development, where there are no rules. If you impose rules or "design limits" as you previously described in terms of cost, performance, and scalability, you will only diminish your chances of success. ....

/B]
More...

Well, the hallmark of a system headed for the dumpster is starting out without knowing what you intend to build, how long it will be in service and the points with repsect to system load or problem space where it becomes useless or invalid.

Trading systems require just as much discipline as any other software system if you expect to build anything of long term value.

Building high quality software that is extensible and high performance is just not that difficult, nor is it very expensive if you know what you are doing.

I guess we will agree to disagree.
Have a great weekend ....

#93 Oct 15, 2004

Share
Sparohok
- 1,124
  Posts
- 2
  Likes
I wrote:

If your methodology is statistically invalid and you are overfitting the data, more market data and more analysis will appear to increase your returns when all it is really increasing is the implementation shortfall.
More...

Just to clarify this point, I certainly agree with Prophet that once you have a working, tested, valid strategy the natural thing to do is scale it up, get more data, widen to more markets, etc. I couldn't agree more and I am not arguing for complacency or resting on laurels. But all this leads back to my original thesis: correctness and good design first, optimization later (and only if necessary).

Martin

#94 Oct 15, 2004

Share
prophet
- 449
  Posts
- 27
  Likes
Quote from Sparohok:

Hi again Prophet,
Another day of this? I guess so.
More...

Hi Martin, Yeah it is getting old, and taking too much time. Maybe we should call a truce? Agree to disagree?

If by optimization you mean designing good, efficient algorithms, then by all means we are in agreement. But, that is not the generally accepted definition in the computer science community.
More...

Perhaps I am using the term âoptimizationâ a little broadly. The choice and design of an algorithm is usually considered optimization of the overall program. Some of the most successful low-level optimizations are actually algorithmic changes such as loop-order rearrangement.

The design process should be focused on getting the lowest computational complexity. Optimization, on the other hand, is the process of improving performance without changing computational complexity. In other words, good design will change an O(n^2) problem to an O(n log n) problem. Optimization will take an O(n log n) design and make it two, three, four times as fast... but not any more scalable. I seldom waste time with optimization until I run into an unacceptable performance problem. Good design, on the other hand, needs to be built in from day one.
More...

Why split hairs here? Optimization usually means reducing running time. You are saying that simple weakly-algorithmic optimizations like caching of computational intermediates are a waste of time because they donât fall in the category of algorithimic complexity. That goes against common sense.

A profiler will never help you design a good algorithm. It is exclusively an optimization tool. Although I guess it may help you find out when you are using a very bad algorithm, and as such it can be a useful learning tool.
More...

I never said a profiler would produce a good algorithm. Why donât you try replying to the things I actually said?

As for flat files versus databases, consider the problem of filtering tick data for bad ticks. With a CSV file, you read the whole thing into memory, run your filter on it, and write it out again. What happens if the power goes out while you are writing the file? You are screwed, your data is gone.
More...

Donât be silly. Never heard of redundant power? Never heard of computational checkpointing or caching of computational intermediates to disk?

What happens if you want to change just one tick? You still have to rewrite the whole file, or write some pretty hairy code to patch one value in an ASCII file. What happens if the file doesn't fit in physical memory? What if you want to access a single tick? What if you want to delete a tick? An O(1) problem just became an O(n) problem.
More...

Conserning my systems... There is one file per day, per marketâ¦ minimal memory requirements. Data is corrected for gaps when itâs downloaded from the server, each night, ONCE per file, and only if there was a data outage which is rare (5 times in the last year). When the data is demanded by the backtester, the CSV is loaded, filtered for obvious last/bid/ask out-of-line ticks, then written out as binary, fixed record length file, ONCE. Fortunately, my Globex/Interactive Brokers data is very clean. Subsequent requests access the binary file. There is very little overhead in reading a single tick, not that I do that. Just seek to the record in the appropriate file. Normally I load one day, one market at a time to process. Very few algorithms do random access to single ticks! Most look at blocks at a time. You know that right?

What if you want to access the file from more than one program at once? More than one computer? Yes, all these issues can be addressed by writing more code, but someone has already written all that code and more, it is called a database. More important, they've debugged it for you. Why reinvent the wheel?
More...

Just share the files out among multiple computers. I only have one computer writing to a given set of data, partitioned by market. This isnât credit card transaction processing, and does not need the complexity and management issues of a 100 GB transactional database. Itâs fast and efficient market data analysis.

If your methodology is statistically invalid and you are overfitting the data, more market data and more analysis will appear to increase your returns when all it is really increasing is the implementation shortfall. The sheer quantity of data hides the logical flaws in the design. It is much easier to see why something is working or not working when applied to one stock than 1000. Given that the vast majority of backtested strategies do not work in practice, I think this is a real problem.
More...

Back to the competence versus amount of computation debate. Two separate issues, which can interact together in many different ways. Why do you keep confusing these issues, and the discussion?

Iâm no market God. Sure my systems have flaws and imperfections. All systems have flaws. Sure there are incredibly broken methodologies out there, but we're not talking about that incompetent extreme. We're talking about real systems that struggle to find statistically valid patterns in the noise. All things being equal, most statistical measures are more valid over larger data sets. Why should trading systems be different?

Iâm sure not going to reduce the amount of data I process in the hopes that Iâm too incompetent to handle it, or that the extra data is tricking me into thinking Iâm making more money than I am! I now live off my trading system income. My systems trade 6E, NQ and ER2 full time, fully mechanically, 1 or 2 contracts per market, 23 hours/day 5 days/week. I currently operate 51 real time trading systems organized in voting groups, each group managing a fixed position size per market. I am working to bring ES and YM systems online, as well as some European index futures and other FX futures markets too, maybe stocks or options someday too. The point is I see and feel the effect of my actions and choices every single trading day. I can say with great confidence that computaional power and lots of data are tremendous advantages.

#95 Oct 15, 2004

Share
prophet
- 449
  Posts
- 27
  Likes
Quote from Sparohok:

Just to clarify this point, I certainly agree with Prophet that once you have a working, tested, valid strategy the natural thing to do is scale it up, get more data, widen to more markets, etc. I couldn't agree more and I am not arguing for complacency or resting on laurels. But all this leads back to my original thesis: correctness and good design first, optimization later (and only if necessary).
More...

You're only half agreeing with me here. We all agree on correctness and good design, including efficient design from the start. But what if the strategy requires a certain minimum amount of data to test as valid? According to you, you'd reject the strategy if it didn't validate with the amount of data you deem appropriate, or perhaps some insansely small amount of optimization like the 10 minutes per day you talked about previously.

The reality is you may often find partially-valid or promising but untradable strategies... not worth the time or capital risk to deploy in real time. The logical next step is to analyze the results and/or add data and computation to try to make them more profitable, given one's time, CPU or capital restrictions. I have tested billions of system variants, all semi-profiable (insufficient Sharpe), all non-correlated with each other, and all requiring diversification (more computation) or more data to validate better. I have a few 100GB of daily returns for these variants, and have had to develop some interesting tools to sort through them. I'll send you some on DVD-R if you want. The point is that it is easy to come up with partially valid systems, and only a fool would disregard the methodology because adding data or computation was bad or a burden.

#96 Oct 15, 2004

Share
Sparohok
- 1,124
  Posts
- 2
  Likes
Prophet,

I'm not critiquing your system or your software, no need to get defensive. I am making generic recommendations for people writing backtesting and automated trading software, and my recommendation is to use a database, to focus on correctness and design rather than optimization, and to start with a manageable dataset. If the characteristics of your system allow you to use flat files, that's fine; I'm not telling you to switch. I think for "typical" quant work a database has all the advantages I listed, but as we've said before, every application is different. You're obviously a smart guy, I'm sure if you end up needing a database to solve some future problem, you'll use the best tool for the job. Certainly, in my case, if I need to switch to flat files or BLOBs for performance, I'm just going to have to bite the bullet and do it.

Also, I don't think designing bad strategies has anything to do with competence. I'm certainly not accusing anyone of incompetence. Almost every idea I have had about the market has been a bad idea. As we all know, it's not easy to overcome trading costs and be consistantly profitable in all market conditions. My goal is to have the right tools and the right methodology to accurately test the ideas so I can only trade the good ones, and so far that's worked out quite well for me.

Work with me here Prophet! I'm trying to find the common ground so we can bring this to an end, I don't think at this point it's a particularly fruitful discussion.

Martin

#97 Oct 15, 2004

Share
Sparohok
- 1,124
  Posts
- 2
  Likes
Quote from prophet:

The point is that it is easy to come up with partially valid systems, and only a fool would disregard the methodology because adding data or computation was bad or a burden.
More...

Couldn't agree more.

I'm not saying never optimize. I'm saying optimize only when necessary - when the actual performance does not meet requirements.

Martin

#98 Oct 15, 2004

Share
prophet
- 449
  Posts
- 27
  Likes
Iâm sorry I come off as defensive. I'm just trying to have an interesting debate here, and hoping the truth will come out.

Quote from Sparohok:
Prophet,
I'm not critiquing your system or your software, no need to get defensive. I am making generic recommendations for people writing backtesting and automated trading software, and my recommendation is to use a database, to focus on correctness and design rather than optimization, and to start with a manageable dataset. If the characteristics of your system allow you to use flat files, that's fine; I'm not telling you to switch. I think for "typical" quant work a database has all the advantages I listed, but as we've said before, every application is different. You're obviously a smart guy, I'm sure if you end up needing a database to solve some future problem, you'll use the best tool for the job. Certainly, in my case, if I need to switch to flat files or BLOBs for performance, I'm just going to have to bite the bullet and do it.
More...

I wonât argue these points anymore. Let our previous posts stand for others to judge.

Also, I don't think designing bad strategies has anything to do with competence.
More...

You are missing or ignoring the distinction between (bad) strategies that simply don't test profitable, and improper, incompetent testing. The designer is incompetent when they ignore (or aren't educated on) the basic rules of statistical validity, such as using insufficient data for the time frame being tested, or assuming one can always execute at the last price without allowing for or knowing the spread. These are some of the major flaws in beginning system testing, in addition to the major programming/design problems we've already mentioned. This is the result of incompetence, or the failure to self educate. Is overfitting of data relate to incompetence? Maybe, maybe not. Again there is this distinction, so it could be either way. However, if one brings in more data to validate the fit, and it distorts the results even more, as you claim will happen, then that is a clear case of the designerâs incompetence, or inability or lack of motivation to adhere to principles of statistical validity. The extra data did not cause the problem. It usualy can help. Such designers will have a difficult time regardless of amount of data or computation. See the flaw in your reasoning?

Work with me here Prophet! I'm trying to find the common ground so we can bring this to an end, I don't think at this point it's a particularly fruitful discussion.
More...

Iâm trying to do the same! Iâm not trying to be negative or mean. If I sound curt at times, I apologize. I am only trying to get you to think about what I am saying a little deeper, partly because there are inevitable flaws in how we word things, and misunderstand each other.

#99 Oct 15, 2004

Share
Sparohok
- 1,124
  Posts
- 2
  Likes
You're right, incompetence or lack of knowledge or lack of dedication can lead to bad strategies. I was just trying to point out the converse, bad strategies do not mean incompetant trader. Even the best quant trader will have many bad ideas which is why we need to test our ideas extensively and creatively, while constantly being cautious of overfitting.

Martin

#100 Oct 15, 2004

Share

(You must log in or sign up to reply here.)

Search