Log in or Sign up

ET News & Sponsor Info

General Topics

Markets

Technical Topics

Brokerage Firms

Company Specific

Interactive Brokers

Tools of the Trade

Trading for a Living

Community Lounge

Site Support

Feedback

Python - Read and split lines from text file into indexes.

Discussion in 'App Development' started by OTM-Options, Apr 28, 2015.

2rosy
- 3,183
  Posts
- 1,359
  Likes
volpunter said:
Here is the first post of OP: " have a text file with hundreds of lines and 10 columns of data separated by commas. I want to split the lines at the commas into 10 indexes and access each index individually. The code below only works on the first index - items[0] - and will print the first column and all the rows. If I change it to items[1] it will crash."

-> Not one person provided a solution that OP requested. If Pandas is used then fine, why not. But the data tables still need to be arranged then in columnar index as OP requested.
More...

my first post reads in a csv and puts it in a list of lists. I didn't entirely read his post. To getcolumns

df=Pandas.read_csv(file)
df.col(0)
df.col(1)

savoir faire is everywhere

#61 May 12, 2015

Share
volpunter
- 3,205
  Posts
- 429
  Likes
fair, so how long does it take you to process a file with 10 million rows and 10 columns? I am curious. Would you know whether memory served me well and Pandas actually hogs a huge chunk of memory due to inefficient reading of the file or does it process line by line?

2rosy said:
my first post reads in a csv and puts it in a list of lists. I didn't entirely read his post. To getcolumns

df=Pandas.read_csv(file)
df.col(0)
df.col(1)

savoir faire is everywhere
More...

#62 May 12, 2015

Share
Quiet1
- 529
  Posts
- 104
  Likes
I think C# is great and am quite excited MS seem to be backing Mono guys.
I also think python is great and widely misunderstood but nevermind.

The code below reads 1mx10col lines in about 1.7s on my PC fully parsed into a pandas dataframe ready for anything pandas can do.

You can read in by chunks using the read_csv function but I have not done so here. Not looked at parallelism for this.

So ~1.7s on a 32Gb SSD i7-3770 3.5Ghz 64-bit Ubuntu 14.04

import numpy as np
import pandas as pd
import datetime as dt

def main():
# create array of 1mx10 random numbers in a pandas dataframe
df = pd.DataFrame(np.random.rand(1000000, 10))

df.to_csv('/tmp/1mill.csv')
snap = dt.datetime.now()
df2 = pd.read_csv('/tmp/1mill.csv', index_col=0)
print("Time taken: {0}".format(dt.datetime.now() - snap))
print(df2.head(5))

if __name__ == '__main__':
main()

#63 May 13, 2015

Share

volpunter likes this.
Quiet1
- 529
  Posts
- 104
  Likes
10million lines is roughly 10 times slower so ~18s.

#64 May 13, 2015

Share
volpunter
- 3,205
  Posts
- 429
  Likes
* you are not reading from a pure csv file but already prepare a pandas optimized file with header and row enumeration, not what the OP asked and started with. He stated clearly he deals with a csv file that contains x rows and 10 columns, no header nor line enumeration.

0,1,2,3,4,5,6,7,8,9 \n
...
..
.

* Not sure how you achieved that performance but I have a machine with almost identical specs and it runs at 3.28 seconds

* As suspected Pandas reads the whole file in and consumes insane amounts of memory. Not much of an issue with plenty memory but still...

Quiet1 said:
I think C# is great and am quite excited MS seem to be backing Mono guys.
I also think python is great and widely misunderstood but nevermind.

The code below reads 1mx10col lines in about 1.7s on my PC fully parsed into a pandas dataframe ready for anything pandas can do.

You can read in by chunks using the read_csv function but I have not done so here. Not looked at parallelism for this.

So ~1.7s on a 32Gb SSD i7-3770 3.5Ghz 64-bit Ubuntu 14.04

import numpy as np
import pandas as pd
import datetime as dt

def main():
# create array of 1mx10 random numbers in a pandas dataframe
df = pd.DataFrame(np.random.rand(1000000, 10))

df.to_csv('/tmp/1mill.csv')
snap = dt.datetime.now()
df2 = pd.read_csv('/tmp/1mill.csv', index_col=0)
print("Time taken: {0}".format(dt.datetime.now() - snap))
print(df2.head(5))

if __name__ == '__main__':
main()
More...

Last edited: May 13, 2015

#65 May 13, 2015

Share
Quiet1
- 529
  Posts
- 104
  Likes
Yes I'm sure using the data he posted will make a bit of a difference. If I get a chance later I'll mock up a better test.

Interesting difference between our machines. I am running on a brand new Samsung 850 Pro SSD though so that *might* make a difference (Linux doesn't support the Rapid read functionality that model has though). The same machine has a windows partition but with a much older SSD so it might be worth testing on it - will revert back on that.

In the end its just worth remembering that Python is well used for humongous data analysis projects and widely in general in academia (never mind on webservers or that it's what Dropbox runs on [I think, at least Guido works at Dropbox so you'd think they use Python a lot]). So there is plenty of support speed-hungry use cases either via libraries (numba jit compiler), variants (Cython), replacement runtimes (PyPy), compilers (nuitka) or just libraries like Pandas that use C-extensions.

So it might not be super-quick by default, but it CAN BE no slouch.

#66 May 13, 2015

Share

volpunter likes this.
volpunter
- 3,205
  Posts
- 429
  Likes
Interesting, in any case the python reads are a lot faster than I thought, respect.

* Would be interesting if you could run the same import from an unstructured plain csv file and look how it performs then. I am pretty sure that the performance differences are to be found here.
* Also whether reading the file from one SSD vs another makes a difference
* It seems that Python is the language of choice for a service like dropbox (I found out only after you pointed to it), but keep in mind that the language does not lend itself to long-running processes and is very memory inefficient. Dropbox does not require either. An algorithmic trading framework critically hinges on both, just to give additional arguments why Python often may not be the right choice.

In any case, respect, someone who provides facts and numbers and not just talk.

Quiet1 said:
Yes I'm sure using the data he posted will make a bit of a difference. If I get a chance later I'll mock up a better test.

Interesting difference between our machines. I am running on a brand new Samsung 850 Pro SSD though so that *might* make a difference (Linux doesn't support the Rapid read functionality that model has though). The same machine has a windows partition but with a much older SSD so it might be worth testing on it - will revert back on that.

In the end its just worth remembering that Python is well used for humongous data analysis projects and widely in general in academia (never mind on webservers or that it's what Dropbox runs on [I think, at least Guido works at Dropbox so you'd think they use Python a lot]). So there is plenty of support speed-hungry use cases either via libraries (numba jit compiler), variants (Cython), replacement runtimes (PyPy), compilers (nuitka) or just libraries like Pandas that use C-extensions.

So it might not be super-quick by default, but it CAN BE no slouch.
More...

Last edited: May 13, 2015

#67 May 13, 2015

Share

Quiet1 likes this.
Quiet1
- 529
  Posts
- 104
  Likes
The Dropbox client is python btw afaik so while not algo-trading complexity its still got to run for a while, behave and not fall over (in millions of environments too).

There are lot of other big players using it btw not just Dropbox. People say the popularity of Python since the mid 2000s is due to Google using it heavily themselves for example (Guido worked there for some years).

#68 May 13, 2015

Share
volpunter
- 3,205
  Posts
- 429
  Likes
that is not accurate, the python portion runs only when files are synchronized and or web requests serviced, both of which are short-running: A request is issued and received by the server and a response generated; similarly, the synching is a short process that needs to be kicked off. Also the client side app is not based on Python.

I am not principally against Python, I just take issue with some on this site who use Python to do stuff that this language was simply not designed for, such as algorithmic trading architectures, write horrifying code, think Linux and/or Python is the all-round weapon of choice, but then go all nuclear when being criticized. Maybe I am a stickler for details in certain areas, for example, if you lost your trading desk several hundred thousands or million in a single day because you got some "meaningless" digits wrong then you deserve to be fired. In the same way if someone cannot even read client (or for this matter OP) requirements then hmm...what else to say...

Who is Guido btw?

Quiet1 said:
The Dropbox client is python btw afaik so while not algo-trading complexity its still got to run for a while, behave and not fall over (in millions of environments too).

There are lot of other big players using it btw not just Dropbox. People say the popularity of Python since the mid 2000s is due to Google using it heavily themselves for example (Guido worked there for some years).
More...

Last edited: May 13, 2015

#69 May 13, 2015

Share
Quiet1
- 529
  Posts
- 104
  Likes
I revised the test to use the sample data the OP posted. It's now actually faster.

The code is in the attached text file. It generates the test data, saves it to a file (all outside pandas), reads it into a pandas dataframe then strips out the row below the headers. It's a bit faster still if you just supply the headers and tell pandas to ignore the first 3 rows.

So revised timings:
1 million rows ~ 1.15s (edit: originally said 1.27s but that was including saving csv out)
10 million rows ~ 10.75s (edit: originally said 14.5s but that was including saving csv out)

Surprising result to me. Will try the windows/old SSD at some point soon-ish.

Q1
- base.py.txt
  
  File size:
  
  3 KB
  
  Views:
  
  73
Last edited: May 13, 2015

#70 May 13, 2015

Share

volpunter likes this.

(You must log in or sign up to reply here.)

Search