Any one read the over 100gb csv file and successfully concatenation?

Question

I have been searching for the deal with large CSV file read method
Its over 100gb and need to know how deal with the chunk file processing
and make concatenation faster

    %%time
    import time
    filename = "../code/csv/file.csv"
    lines_number = sum(1 for line in open(filename))
    lines_in_chunk = 100# I don't know what size is better
    counter = 0
    completed = 0
    reader = pd.read_csv(filename, chunksize=lines_in_chunk)

CPU times: user 36.3 s, sys: 30.3 s, total: 1min 6s
Wall time: 1min 7s

this won't take long but the problem is concat

%%time
df = pd.concat(reader,ignore_index=True)

this part take too long and take too much memory also
is there way to make this concat process faster and efficiently ?

I don't understand, why reading the file by chunks if it's going to be concatenated back into a single piece of data? — Erwan, Jul 25 '19 at 11:05
Koalas and Vaex are the way to go for huge data unless you want to try Sparkling water from H2O. — Syenix, Dec 25 '19 at 23:51

score 2 · Answer 1 · edited Jun 16 '20 at 11:08

2

Its too big file to handle by standard way. You could do it by chunk

for chunk in reader:
    chunk['col1']=chunk['col1']**2 #and so on

Or dump yours csv file to database.

number of rows

num=0 
for chunk in reader: 
    num+=1 
num_of_rows = num*lines_of_chunk

#work around in bash and python
import subprocess
subprocess.check_output(["wc","-l", "file.csv"])

edited Jun 16 '20 at 11:08

Community

1

answered Jul 25 '19 at 11:09

fuwiak

1,355
8
13
26

can you show me more to handle the chunk? – slowmonk Jul 25 '19 at 13:32
Chunk is dataframe. You could make everything like typical dataframe in Pandas. For example, name of columns you could get from first chunk. – fuwiak Jul 25 '19 at 13:37
how do you know the number of rows of the reader? – slowmonk Jul 25 '19 at 16:14
not the chunk size the whole number of rows of original data – slowmonk Jul 25 '19 at 16:18

Any one read the over 100gb csv file and successfully concatenation?

1 Answers1

number of rows