4

I have been searching for the deal with large CSV file read method
Its over 100gb and need to know how deal with the chunk file processing
and make concatenation faster

    %%time
    import time
    filename = "../code/csv/file.csv"
    lines_number = sum(1 for line in open(filename))
    lines_in_chunk = 100# I don't know what size is better
    counter = 0
    completed = 0
    reader = pd.read_csv(filename, chunksize=lines_in_chunk)

CPU times: user 36.3 s, sys: 30.3 s, total: 1min 6s
Wall time: 1min 7s

this won't take long but the problem is concat

%%time
df = pd.concat(reader,ignore_index=True)

this part take too long and take too much memory also
is there way to make this concat process faster and efficiently ?

slowmonk
  • 513
  • 1
  • 7
  • 16
  • I don't understand, why reading the file by chunks if it's going to be concatenated back into a single piece of data? – Erwan Jul 25 '19 at 11:05
  • Koalas and Vaex are the way to go for huge data unless you want to try Sparkling water from H2O. – Syenix Dec 25 '19 at 23:51

1 Answers1

2

Its too big file to handle by standard way. You could do it by chunk

for chunk in reader:
    chunk['col1']=chunk['col1']**2 #and so on

Or dump yours csv file to database.

number of rows

num=0 
for chunk in reader: 
    num+=1 
num_of_rows = num*lines_of_chunk

#work around in bash and python
import subprocess
subprocess.check_output(["wc","-l", "file.csv"])
fuwiak
  • 1,355
  • 8
  • 13
  • 26