I think all math does is shed light on what is often hidden by the sheer complexity of the [inter]relations and the (in|de)duction needed to find them. Then, coded, the computer does most of the heavy lifting. I agree though, I have never seen anything like human intuition. Let me give you an example. Someone recently asked me how I came up with my model. I tell you, all I did was intuit my way to it, and it came almost naturally to me. Granted, I have been doing this for a very long time, so it is not like I just drank some OJ and poof out the top of my head it came. I estimate that it would take a computer probably something like 500 trillion simulations to find it. And then, it might not understand how to trade it. What is strange is, I don't even know why I was drawn to this model. It is as if it found me instead of me finding it. I tell you it is very strange. No question in my mind, the brain is not running some serial algorithm called creativity. Intuition and creativity are somehow exploring truly colossal search spaces way beyond even the most powerful computers, all almost effortlessly. Still, I wonder how Watson would do. To me, human intuition is close to being voodoo and is the one thing that makes me a closet mystic.
Get used to hanging out on sites like these: http://quant.stackexchange.com/questions/tagged/cointegration https://www.quantopian.com/posts/grid-searching-for-profitable-cointegrating-portfolios
All very reasonable tries. In the end, it has to stand up to testing on actual bid/ask quotes. The hardest part of getting started is to get a database. Start collecting tick data now on as many instruments as possible. It won't be effort wasted. Many of those time series will be useful at one time or another. If there was one exercise that I would say is step 0 in all of this, is learn to align data correctly. So if say you are collecting B/A millisecond or better data on GE and SPY (to start with - ideally you want all the data in existance being stored in realtime), you are going to get many times more SPY then GE. Learn the different mechanisms for aligning data. They all have their pluses and minuses. Start with say resampling GE to have the same number of data points as SPY. Use, fill with last, with next, with linear interpolation, etc. Then plot the two series so that the x-axis is time and the y-axis is price. This is the "Hello World" of data science. Use whatever language you want for research, but I recommend Python. With pandas, scipy, numpy and scikits, you are going to get very far. N.B., you are going to need a disk (persistence) structure that won't fail. Already this is getting expensive. However, in a pinch just use a small RAID array. Pay close attention to the strategies CERN uses to store data.
Someone asked me what I use. Here is a short (not inclusive) list of the software I use daily: Real-time ======= C++ C# Python Clojure Scala F# Erlang Research ======= Python Haskell R Matlab (If I had the compiler I would probably use some models in realtime) Mathematica Everything sits on top of Apache everything: =========================== Spark Mesos/Marathon Docker Hadoop Hive Storm Cassandra HDF5 I hold everything together with: ============== Git GlusterFS OS === Linux Windows
Sounds impressive, but why would one need two OS or more than one programming language. I used to use C++ and now I use python, but it baffles me why I would use two at the same time. Sometimes I might be forced to -- eg. Bloomberg or IB APIs, but Bloomberg now has Python too.
Can we step back for a second and start with a few assumptions about a noob's resources and capacity: 1) He/she will not be trading past 15-30 min time frame 2) He will execute by paying the spread 3) He will not have an ability to automate equities trading except by using a broker's pair algo, off the shelf equities spreader or by hand. That means no custom execution or legging in at good prices. And one last thing, what about starting with futures spreads which have readily available data through tt API?
As to two OS, F# and C# run best on Windows as most of my GUI is in Windows. I also use mono on Linux, but it is a little clunky still but getting better every day. That explains C# and F# and Windows. As to why use more than one language, for example some underlying apache technologies are best used through the JVM. That explains Clojure and Scala. Erlang is there mostly because there is some legacy stuff that I am too lazy to port. C++ is there because one of my models tries to get to zero latency, which also explains Linux (I compile and handcraft all my kernels myself and use either low latency or realtime kernels. Plus I often have to build the Infiniband drivers myself and 100% of them are in C++). To say nothing that my FIX engine is in C++. There is no other language in existence that can do this for me, save assembler. Python because I try to write all of my models so that they don't know if they are running in sim mode or in realtime and I write and research 90% of my models in Python and Matlab. For the slow stuff, Python works fine in realtime. There are several ways for these language objects to communicate. In the past I used ZeroMQ for messaging. Today I am thinking more about microservices in Docker (which in turn uses ZMQ underneath) that are totally language and OS agnostic. This works fine for anything that is not arbitrage or HFT.
That is all probably true. I am just saying this is how I do it. Every time I try to use some canned program I get annoyed within ten minutes of using it that I can't do a or b. There is always friction. So I build things the right way for me from scratch. Over time, things get easier since I already have a code-base. Sure, TT is fine. I just claim people should be saving every conceivable data they can get their realtime hands on. Why limit yourself to futures?