Python - Read and split lines from text file into indexes.

eusdaiki · May 7, 2015

volpunter said:
And here is the code in C# :

var contents =File.ReadAllText(filename).Split('\n');
var csv =from line in contents select line.Split(',').ToArray();

(you can actually move that all into one line but it looks ugly)

P.S.: This version is most likely a lot faster than Python's Pandas. I know OP asked about a Python solution but I could not help it but jump in after seeing 3 full pages of discussion how to read in a text file.
More...

Yeah, the C# code looks pretty clean.

Python doesn't have a decent method for handling csv files "out of the box", probably because most ppl end up using libraries to get the job done... I've noticed that the Python community doesn't have such as strong drive to integrate functionality from the libraries into the main program as you find in other communities (like C++ that keeps absorbing parts of BOOST on every revision)

You may be right about the speed, specially if the files are large and python starts hoggin memory... it may hold it's time against C# on small files but once it starts hogging memory then it can slow down to a crawl.
I had a project a few months ago where I had to parse through files with 10-15 million lines of untidy quotes data and it got to a point where python was taking far longer than my patience allowed for going through 1 file (I killed it at round 30 minutes) and sucking all my memory and a big chunk of swap too...
I moved the project to C and by the time it was done it was going through the files in about 1 minute using 2-3 % memory per core.

volpunter · May 7, 2015

I see your points and agree with all of them.

I want to see the advantages of Python but just simply do not. I know C++, C#, R, ... and I would not know at all what to do with Python.

eusdaiki said:
Yeah, the C# code looks pretty clean.

Python doesn't have a decent method for handling csv files "out of the box", probably because most ppl end up using libraries to get the job done... I've noticed that the Python community doesn't have such as strong drive to integrate functionality from the libraries into the main program as you find in other communities (like C++ that keeps absorbing parts of BOOST on every revision)

You may be right about the speed, specially if the files are large and python starts hoggin memory... it may hold it's time against C# on small files but once it starts hogging memory then it can slow down to a crawl.
I had a project a few months ago where I had to parse through files with 10-15 million lines of untidy quotes data and it got to a point where python was taking far longer than my patience allowed for going through 1 file (I killed it at round 30 minutes) and sucking all my memory and a big chunk of swap too...
I moved the project to C and by the time it was done it was going through the files in about 1 minute using 2-3 % memory per core.
More...

globalarbtrader · May 7, 2015

I don't understand why everyone is so down on libraries, and having to load them to do anything.

Theres no free lunch. You can eitheir:

a) Have a bloated core which does everything, but you then have to load XXX MB of crud just to write hello world
b) Have a lightweight core and then have to write hundreds of lines of your own code to reinvent the wheel a few million times.
c) Have a lightweight core and then load libraries as you need them. Libraries which are tested and optimised by people who know what they're doing. And which are set up so you can write nice readable code when you call them.

Also, if you're only using certain features of the library you don't need to load the whole thing. I can just do this:
Code:
from pandas import read_csv
Which will load just that one function, plus any dependencies it needs.

volpunter · May 7, 2015

I consider reading text based files absolutely core. Any language should be able out of the box to read a delimited text file without need for a library. The problem with Python is that it keep developers stupid. It's basically like R. Most R users would not have a clue how to interpret a simple statistical output yet one can apply complicated algorithms with a few lines of code. I mean is this thread not the perfect example? Or look at your code you wrote in your blog. It's terrible code, badly designed and thought out from the start with apparently little to no thought process behind. Such approach almost always leads to disaster down the road either via bugs or extreme inefficient code. Python is already very slow to start with. How one even considers building a systematic trading architecture based on Python is beyond me. But if that is what floats your boat then all the more power to you. I generally tend to choose the right tools for a task. Python apparently leaves a lot to be desired if it cannot even properly handle importing a large Textfile.
globalarbtrader said:
I don't understand why everyone is so down on libraries, and having to load them to do anything.

Theres no free lunch. You can eitheir:

a) Have a bloated core which does everything, but you then have to load XXX MB of crud just to write hello world
b) Have a lightweight core and then have to write hundreds of lines of your own code to reinvent the wheel a few million times.
c) Have a lightweight core and then load libraries as you need them. Libraries which are tested and optimised by people who know what they're doing. And which are set up so you can write nice readable code when you call them.

Also, if you're only using certain features of the library you don't need to load the whole thing. I can just do this:
Code:
from pandas import read_csv
Which will load just that one function, plus any dependencies it needs.
More...

2rosy · May 7, 2015

volpunter said:
var contents =File.ReadAllText(filename).Split('\n');
var csv =from line in contents
Select line.Split(',').ToArray();
More...

you can do that out of the box with python. Almost the exact same syntax too. But doing all the other stuff the OP wanted can be done in less than 5 lines with the pandas lib.

volpunter · May 7, 2015

Can I remind you that not one single user in this 4 page thread has solved OP's problem? Everyone tried to upsell him. It is hilarious. Here is the task again, I cite:

"I have a text file with hundreds of lines and 10 columns of data separated by commas. I want to split the lines at the commas into 10 indexes and access each index individually. The code below only works on the first index - items[0] - and will print the first column and all the rows. If I change it to items[1] it will crash."

My point has never been terse code (you should go with Q if you like terse code). My point has been that Python does not solve any new problems. It is slower than every compiled language and hardly offers anything that makes my life easier.

2rosy said:
you can do that out of the box with python. Almost the exact same syntax too. But doing all the other stuff the OP wanted can be done in less than 5 lines with the pandas lib.
More...

i960 · May 7, 2015

volpunter, funny that you're sitting there telling "us" (meaning the people that told the guy to stop writing such ridiculous square-wheel code) that none of us helped him and only tried to upsell him and instead you're here turning this into a Python sucks, C# rules thread. How did you not upsell him yourself?

It's pretty simple, I said on like page 1, don't write your own CSV parser, use a library and deal with the fields as native arrays. At that point, I'd consider the issue solved and done. Do you want me to hold his dick for him too?

volpunter · May 7, 2015

I did not say that nobody helped him and yes his code looks horrible. I said nobody solved his problem. Can you provide a solution that exactly solves the problem of OP? No nested lists, no data frames but provides a solution to the question? I am not saying his stated goal is an optimal way to represent text based imported data but nonetheless was it not you who stated to first solve the problem, then optimize it? I have honestly not seen a single solution that solves the problem so far. And no, you have not provided a solution that parses the text based file to columnar indexes as requested. We can argue about semantics but nobody so far provided such solution as a matter of fact.

I did not say C# rules but C# solves the problem much more elegantly. If you are keeping up to date with latest language developments you may have come across the news that Microsoft is working on a new language that may be based on C# and which will rival the speed of C++ and which will handle deterministic destruction rather than being caught up with GC latencies. Even die hard C++ language developers concur that this would spell the death of C++ as systems programming will fully embrace such new language. Let's not forget that C# features modern programming constructs such as delegates and lambdas in much better ways than, for example, C++ or other languages. Picking up on such constructs when building a new language will place it at the top of the stack from day 1. In that I feel fortunate to be well versed in C# programming as it should be a breeze to pick up the new language when it is first released. Oh, did I forget to mention that such new language will most likely be open sourced from the beginning.

Here the link to Joe Duffy's blog, head of compiler and language development at MS: http://joeduffyblog.com/2013/12/27/csharp-for-systems-programming/

i960 said:
volpunter, funny that you're sitting there telling "us" (meaning the people that told the guy to stop writing such ridiculous square-wheel code) that none of us helped him and only tried to upsell him and instead you're here turning this into a Python sucks, C# rules thread. How did you not upsell him yourself?

It's pretty simple, I said on like page 1, don't write your own CSV parser, use a library and deal with the fields as native arrays. At that point, I'd consider the issue solved and done. Do you want me to hold his dick for him too?
More...

OTM-Options · May 8, 2015

I have updated the Python script with some changes.

Reading from the input file in reverse has now been integrated into the main body of the script.
Some of the variables have been renamed to clarify their function.
The footer output has been rearranged.
I have added a timestamp function which is called at the start and finish of the script to time how long it takes to execute.
Some of the lines in the script have been shortened.
The script is now 123 lines long.

This is a screenshot of the old CSV file (top) and new CSV file (bottom) loaded into Gnumeric showing the type of changes I wanted to make. The new CSV file has a much better left to right reading flow compared to the old CSV file. I truncated the bottom image to show both the header and footer. The only manipulation in the spreadsheet is adding bold text and currency formatting.

EDIT: Gnumeric automatically changed the date format in the old CSV file from 2015-04-27 to 2015-April-27. I didn't notice that until after I made the screenshot.

100% Completed Python Script

Code:

#!/usr/bin/python
# Python version 2.7.6

import datetime
import time

def timer(label):
  ts = time.time()
  st = datetime.datetime.fromtimestamp(ts).strftime('%H:%M:%S')
  timer_out = open("timer.txt",'a')
  timer_out.write(label + ": " + st + "\n")
  timer_out.close()

timer("Start")

csv_in = "TransactionHistory_22523594.csv"
csv_out = "transaction_history.csv"
header = "Row,Date,Buy/Sell,QTY,Security,Price,Debit,"\
  "Credit,Commission,Total Amount,Currency" + '\n'

footer_commission = 0
footer_debit = 0
footer_credit = 0
footer_total_amount = 0
counter = 1
split = ","
join = ","

def write_file(row,write_to):
  f_out = open(write_to,'a')
  f_out.write(row + "\n")
  f_out.close()

write_file(header,csv_out)

for line in reversed(list(open(csv_in))):
  if len(line.strip()) != 0 :
  line = line.strip()
  column = line.split(split)
  if column[2] == "Buy" or column[2] == "Sell" or column[2] == "Expired":
  row_counter = '{0:03d}'.format(counter)
  transaction_date = column[0]
  buy_sell = column[2]
  qty = column[5]
  security = column[4]
  price = column[6]
  if price == "":
  price = "0"
  total_amount = column[8]
  currency = column[9]
  transaction_date = (datetime.datetime.strptime\
  (transaction_date, "%Y-%m-%d").strftime("%a %b %d"))
  qty = abs(int(qty))
  price = float(price)
  total_amount = float(total_amount)
  amount = qty * price * 100
  amount = int(amount)

  abs_total_amount = abs(float(total_amount))
  if column[8] >= "0" and  column[8] <= "1":
  commission = 0
  else:
  commission = abs(abs_total_amount - amount)
  if total_amount <= 1:
  debit = amount
  credit = 0
  else:
  debit = 0
  credit = amount
  if debit > abs_total_amount:
  credit = debit
  commission = debit
  debit = 0

  footer_debit = (footer_debit + debit)
  footer_credit = (footer_credit + credit)
  footer_commission = (footer_commission + commission)
  footer_total_amount = (footer_total_amount + total_amount)

  qty = str(abs(qty))
  price = str(price)
  debit = str(debit)

  if debit == "0":
  debit = ""
  credit = str(credit)
  if credit == "0":
  credit = ""
  commission = str(commission)
  if commission == "0":
  commission = ""
  total_amount = str(total_amount)
  counter = counter + 1

  row = (row_counter + join + transaction_date + join + buy_sell + join \
  + qty + join + security + join + price + join + debit + join + credit + join \
  + commission + join + total_amount + join + currency)

  print row
  write_file(row,csv_out)

pl = (footer_debit + footer_commission)
pl_percent = ((footer_credit - (pl)) / pl * 100)
pl_debit = str(pl)
footer_debit = str(footer_debit)
footer_credit = str(footer_credit)
footer_commission = str(footer_commission)
footer_total_amount = str(footer_total_amount)
pl_percent = str(pl_percent)

join2x = (join + join)
join5x = (join + join + join + join + join)
footer = (join5x + "Subtotal" + join + footer_debit + join + footer_credit + join \
  + footer_commission + join + footer_total_amount + join + currency + "\n \n" \
  + join5x + join2x + "Total Debit" + join2x + pl_debit + join + currency + "\n" \
  + join5x + join2x + "Total Credit" + join2x + footer_credit + join + currency + "\n" \
  + join5x + join2x + "P/L" + join + pl_percent + " %" + join + footer_total_amount \
  + join + currency)

write_file(footer,csv_out)
timer("Finish")

input("\n\nPress the Enter key to exit.")

OTM-Options · May 8, 2015

The 1,000,000 Line Test

I created a simple Python script that would loop 20,000 times through the 59 line CSV file and output a 1,180,000 line file to test the efficiency of the CSV to CSV script.

Python CSV to CSV Script

INPUT: 1,180,000 lines of raw data from a 105mb CSV file.

OUPUT: 1,000,000 lines of manipulated data to a 75mb CSV file.

OUPUT: Print to linux terminal to show progress.

TIME: 2 minutes 53 seconds.

Computer Specs

Acer Aspire AM5641-E5651A desktop computer

PCLinuxOS 2014.12 with Mate desktop.

Intel(R) Core(TM)2 Duo CPU E7200 @ 2.53GHz

3 GB DDR2 Memory

640 GB SATA Hard disk

The output CSV file was then loaded into Gnumeric spreadsheet for some bold text and currency formatting. No math functions or entries were made in the spreadsheet.

Screenshot of the truncated 1,000,000 line CSV file in Gnumeric.