Python - Read and split lines from text file into indexes.

Discussion in 'App Development' started by OTM-Options, Apr 28, 2015.

  1. 2rosy

    2rosy

    my first post reads in a csv and puts it in a list of lists. I didn't entirely read his post. To getcolumns

    df=Pandas.read_csv(file)
    df.col(0)
    df.col(1)

    savoir faire is everywhere
     
    #61     May 12, 2015
  2. fair, so how long does it take you to process a file with 10 million rows and 10 columns? I am curious. Would you know whether memory served me well and Pandas actually hogs a huge chunk of memory due to inefficient reading of the file or does it process line by line?

     
    #62     May 12, 2015
  3. Quiet1

    Quiet1

    I think C# is great and am quite excited MS seem to be backing Mono guys.
    I also think python is great and widely misunderstood but nevermind.

    The code below reads 1mx10col lines in about 1.7s on my PC fully parsed into a pandas dataframe ready for anything pandas can do.

    You can read in by chunks using the read_csv function but I have not done so here. Not looked at parallelism for this.

    So ~1.7s on a 32Gb SSD i7-3770 3.5Ghz 64-bit Ubuntu 14.04


    import numpy as np
    import pandas as pd
    import datetime as dt


    def main():
    # create array of 1mx10 random numbers in a pandas dataframe
    df = pd.DataFrame(np.random.rand(1000000, 10))

    df.to_csv('/tmp/1mill.csv')
    snap = dt.datetime.now()
    df2 = pd.read_csv('/tmp/1mill.csv', index_col=0)
    print("Time taken: {0}".format(dt.datetime.now() - snap))
    print(df2.head(5))

    if __name__ == '__main__':
    main()
     
    #63     May 13, 2015
    volpunter likes this.
  4. Quiet1

    Quiet1

    10million lines is roughly 10 times slower so ~18s.
     
    #64     May 13, 2015
  5. * you are not reading from a pure csv file but already prepare a pandas optimized file with header and row enumeration, not what the OP asked and started with. He stated clearly he deals with a csv file that contains x rows and 10 columns, no header nor line enumeration.

    0,1,2,3,4,5,6,7,8,9 \n
    ...
    ..
    .

    * Not sure how you achieved that performance but I have a machine with almost identical specs and it runs at 3.28 seconds

    * As suspected Pandas reads the whole file in and consumes insane amounts of memory. Not much of an issue with plenty memory but still...



     
    Last edited: May 13, 2015
    #65     May 13, 2015
  6. Quiet1

    Quiet1

    Yes I'm sure using the data he posted will make a bit of a difference. If I get a chance later I'll mock up a better test.

    Interesting difference between our machines. I am running on a brand new Samsung 850 Pro SSD though so that *might* make a difference (Linux doesn't support the Rapid read functionality that model has though). The same machine has a windows partition but with a much older SSD so it might be worth testing on it - will revert back on that.

    In the end its just worth remembering that Python is well used for humongous data analysis projects and widely in general in academia (never mind on webservers or that it's what Dropbox runs on [I think, at least Guido works at Dropbox so you'd think they use Python a lot]). So there is plenty of support speed-hungry use cases either via libraries (numba jit compiler), variants (Cython), replacement runtimes (PyPy), compilers (nuitka) or just libraries like Pandas that use C-extensions.

    So it might not be super-quick by default, but it CAN BE no slouch.
     
    #66     May 13, 2015
    volpunter likes this.
  7. Interesting, in any case the python reads are a lot faster than I thought, respect.

    * Would be interesting if you could run the same import from an unstructured plain csv file and look how it performs then. I am pretty sure that the performance differences are to be found here.
    * Also whether reading the file from one SSD vs another makes a difference
    * It seems that Python is the language of choice for a service like dropbox (I found out only after you pointed to it), but keep in mind that the language does not lend itself to long-running processes and is very memory inefficient. Dropbox does not require either. An algorithmic trading framework critically hinges on both, just to give additional arguments why Python often may not be the right choice.

    In any case, respect, someone who provides facts and numbers and not just talk.


     
    Last edited: May 13, 2015
    #67     May 13, 2015
    Quiet1 likes this.
  8. Quiet1

    Quiet1

    The Dropbox client is python btw afaik so while not algo-trading complexity its still got to run for a while, behave and not fall over (in millions of environments too).

    There are lot of other big players using it btw not just Dropbox. People say the popularity of Python since the mid 2000s is due to Google using it heavily themselves for example (Guido worked there for some years).
     
    #68     May 13, 2015
  9. that is not accurate, the python portion runs only when files are synchronized and or web requests serviced, both of which are short-running: A request is issued and received by the server and a response generated; similarly, the synching is a short process that needs to be kicked off. Also the client side app is not based on Python.

    I am not principally against Python, I just take issue with some on this site who use Python to do stuff that this language was simply not designed for, such as algorithmic trading architectures, write horrifying code, think Linux and/or Python is the all-round weapon of choice, but then go all nuclear when being criticized. Maybe I am a stickler for details in certain areas, for example, if you lost your trading desk several hundred thousands or million in a single day because you got some "meaningless" digits wrong then you deserve to be fired. In the same way if someone cannot even read client (or for this matter OP) requirements then hmm...what else to say...

    Who is Guido btw?

     
    Last edited: May 13, 2015
    #69     May 13, 2015
  10. Quiet1

    Quiet1

    I revised the test to use the sample data the OP posted. It's now actually faster.

    The code is in the attached text file. It generates the test data, saves it to a file (all outside pandas), reads it into a pandas dataframe then strips out the row below the headers. It's a bit faster still if you just supply the headers and tell pandas to ignore the first 3 rows.

    So revised timings:
    1 million rows ~ 1.15s (edit: originally said 1.27s but that was including saving csv out)
    10 million rows ~ 10.75s (edit: originally said 14.5s but that was including saving csv out)

    Surprising result to me. Will try the windows/old SSD at some point soon-ish.

    Q1
     
    Last edited: May 13, 2015
    #70     May 13, 2015
    volpunter likes this.