Help improving my "read_excel" execution time in python. My code reads slowly

Question

My first question here so please bare with me.

I'm trying to feed my neural network with training data read in from an excel file. It works perfectly fine when i have less than 50 rows in the sheet. But when i try with the real excel file containing almost 4.000 rows it suddenly takes forever. Although 4.000 is a lot i'm pretty sure my way of doing it is still very inefficient.

as you can see in the code below i'm using the read_excel over and over again in the loop. I feel like there should be a way to only read the whole column 1 time and then work with it from there.

My goal is to read in 5 rows as the 1st input starting from row 0. then reading 5 rows in again but starting from row 1 and 5 rows again starting from row 3 So it's like a window of 5 rows that is read and then moving the window by 1. The output should allways be the 1 row after the window.

**Example:** if row 1-20 contains numbers 1-20 then:
input1 = [1,2,3,4,5] and output1 = 6
input2 = [2,3,4,5,6] and output2 = 7
...
input15 = [15,16,17,18,19] and output15 = 20

notice how inputs are lists and outputs are just numbers. So when i append those to the final input & output lists i end up with inputs being a list of lists and out being list of outputs

My code

from pandas import read_excel

# initialize final input & output lists. The contents of the temporary input & output lists
# are gonna be appended to these final lists
training_input = []
training_output = []

# excel relevant info
my_sheet = 'Junaid'
file_name = '../Documents/Junaid1.xlsx'

# initialize counters
loop_count = 0
row_counter = 0

for x in range(25):

    # load the excel file containing inputs & outputs
    # using parameters skiprows, nrows (number of rows) and index col
    df = read_excel(file_name, sheet_name = my_sheet, skiprows=row_counter, nrows=6, index_col=0)

    # initialize temporary input & output lists
    input_temp = []
    output_temp = []

    for y in df.index:
        # append the first 5 rows of the 6 to input list
        if loop_count < 5:
            input_temp.append(df.index[loop_count])
            loop_count += 1
        else:
            # append the 6th data to output list
            training_output.append(df.index[loop_count])

    training_input.append(input_temp)
    row_counter += 1
    loop_count = 0

score 1 · Accepted Answer · answered May 01 '20 at 10:44

Well yes it would be slow because you are opening and closing the file for every iteration of the for loop. A general rule in programming is that if the file is not constantly changing, then only open and read it a single time. Also, there are large sections of your code that can be shaved off if you simply use list comprehension

Here, I have rewritten your code to only open the file and read it once, then it creates the two lists using list comprehension and slicing.

from pandas import read_excel

# excel relevant info
my_sheet = 'Junaid'
file_name = '../Documents/Junaid1.xlsx'

df = read_excel(file_name, sheet_name = my_sheet, index_col=0, header=None)

training_input = [df.index[i:i+5].tolist() for i in range(len(df)-5)]
training_output = [df.index[i].tolist() for i in range(5, len(df))]

Also, there seems to be a bug in your code since the excel file you described in your question does not have a header (i.e. the very first row contains data), thus your code skips the very first row of values. To fix that you should pass the parameter "header=None" to the pandas function to tell it that there is no header index. You can read more about that here.

Did, this work for you? If so, please mark my answer as accepted. — A Kareem, May 01 '20 at 11:26
Sorry, i don't know what went wrong. Yes, i did work, thanks a lot. Execution time is under 1 second even with 5k rows :) Kareem, can i ask u 1 more question. What if i have 2 columns of data. so that 1 input element shouldn't just be "5 rows of column A" but "5 rows of column A + 5 rows from column B" ? — Casper, May 01 '20 at 13:12
Then you can do something like this `training_input = [df.iloc[i:i+5, 0].tolist() + df.iloc[i:i+5, 1].tolist() for i in range(len(df)-5)]` And make sure to remove the `index_col=0` argument from the pandas call to get all columns — A Kareem, May 01 '20 at 13:33

Help improving my "read_excel" execution time in python. My code reads slowly

My code

1 Answers1