Shared Memory with Python

Discussion in 'Programming' started by clearinghouse, Dec 13, 2011.

  1. So currently I have these massive csv/txt files with lots of tick data. When I run a python script, they always have to load and parse these massive files. It takes a few minutes just to bring up all the data.

    I would like to do something like load all of these python objects into shared memory and then when I load an analysis script, just grab all of the data from shared memory and process the market data accordingly, so I can tweak/change the script and try over and over without taking the loading-time hit again.

    All I found was this, but this thing is from 2003:

    Should I just bite the bullet and rewrite all this stuff in C++, or is there a better

    Do I just need to rewrite all of this stuff in C++, or is there a better way to just have a sort of memory-block "tick" server of sorts?
  2. Mr_You


    Sorry I don't have an answer to your shared memory question, but I'm curious if you have thought about putting the data into an SQL database (SQLite or PostgreSQL)?

    Also JDB just came to mind. Which unfortunately may not be easily interfaced with Python.
  3. Yeah, I thought about it. This sort of stuff comes to me in dreams at night, but when it comes down to it, for a one-man trading operation, it takes a lot of time to do this properly. I'm workng 16-17 hour days as it is, with delusions of trading grandeur. ;-)

    As much as I love complexity, I have a "keep it simple, stupid" thing on my wall so I don't burn time doing the wrong things.
  4. rwk


    I don't use Python, so I cannot speak about that specifically. But most modern operating systems cache data, so if you access the data a second time, it's already in memory. I can confirm that it works that way in Windows. That assumes that you have LOTS of real memory.

    One thing you can do that will speed up loading the data is convert it to fixed length records with the numeric fields converted to integer. Using Pascal, I can read such a file with a single command, then iterate through the data using a "for " loop. I/O takes less than a second per day of tic data. I think SQL is gross overkill for time-series data.
  5. byteme


    Look at Python's reload function:

    In this case you will want to reload just the module containing your "analysis script". As long as there are references to the tick data objects, it should stay in memory.

    No, you don't need to re-write in C++; can't see how you came to that conlclusion. If you architected your application in C++ the same way you architected it in Python you would have the same problem i.e. the language is not the problem you need to solve.
  6. byteme


    FYI: As an aside, I suspect PyTables might be a good fit for your use case, though I can only guess what kind of analysis you are trying to perform.
  7. burn8


    Put it into MongoDB

  8. rosy2


    you can use mmap But I agree with using hdf5 (libraries pytables and pandas) or using mongodb, redis, etc in some kind of clustered way. You can also look into using celery which allows for distributed tasks to be easily kicked off.
  9. To explain that C++ comment:

    Well. in C++, if I put it in a flat binary layout I could just cast a pointer to shared memory and re-use the data. But in python, I really have no idea how the interpreter works or how I could use the data in shared memory.

    But otherwise, yes, this reload feature looks interesting. Thank you.
  10. IMHO you are going the wrong way about this, as well it's difficult to evaluate the issue without some performance numbers. Maybe you are hitting the limits of your hardware, regardless of the software, we don't know. What is the actual file size, record size and how much memory you have?

    Again, IMHO, you are trying to deal with this in a way which is not only more difficult but produces inherently more unstable solution. While you certainly can load something into the memory and get the handle on the mem address (in C++ and Python as well), once when your code terminates you have to jump the hoops in order to provide the pointer back to your module when it gets reloaded. There are easier ways to try to improve your speed.

    - If you insist on your solution, Python multiprocessing module handles such stuff and you have working examples of how to spawn the processes and communicate between them using queues or shared memory.
    - That would be only part of it, because you want to have one peace of code resident in the memory, while you keep changing and reloading another module (if I understand it correctly). For that you could utilize reload (as already suggested), where you would have the main script idling after you run your logic, then you go and edit some changes, then you let go the main script which reloads your module and run the logic again, still having a problem to pass the pointer to the loaded data to your reloaded module,... man, that's weird as it gets.

    I would suggest you investigate some alternatives and probably re-architecture some of your processing.
    - Most important thing (as already mentioned) if you have enough memory, everything gets cached, so you really have to have a very bad peace of code to see so much lag. I constantly have 1-2GB of data in processing and do not see much of a problem.
    - One rather simple way is to use Sqlite (part of the standard library) and load the database in memory (see docs). But again, if you have enough mem, your data will be cached regardless of how you read it.
    - I don't get the part why you have to reload it so often? If you are still developing your logic and testing it, then it's unnecessary to load everything, use just a sample until you have it right. If you are running some back-testing for many cases and different parameters, than structure your script where you first load the data and then loop through the cases, varying the parameters. Again, I might be missing something here, but you did not elaborate much on the logic.

    I don't know if you have (maybe) some inefficient code in your load and parse routine, but here is some fancy stuff that usually beats the crap out of the conventional loops:
    - Split you files by columns (learn to vectorize) and save them as arrays. You can use Python arrays module, or you can use numpy, which has build in function to load/save stuff to files. That's probably the fastest and smallest space requirement you can get, because it's handled in binary format.
    - You could do similar thing by packing it manually using the struct module, but above mentioned gives your more compact code.
    - I do not use them, but, as already mentioned, hdf5/pytables is good stuff, particularly if you are dealing with something that is way much more than your physical mem.
    - I would shy away from any dbs other than in memory sqlite , particularly object dbs, but that's only me.

    Hope this help...
    #10     Dec 14, 2011