1

My first question here so please bare with me.

I'm trying to feed my neural network with training data read in from an excel file. It works perfectly fine when i have less than 50 rows in the sheet. But when i try with the real excel file containing almost 4.000 rows it suddenly takes forever. Although 4.000 is a lot i'm pretty sure my way of doing it is still very inefficient.

as you can see in the code below i'm using the read_excel over and over again in the loop. I feel like there should be a way to only read the whole column 1 time and then work with it from there.

My goal is to read in 5 rows as the 1st input starting from row 0. then reading 5 rows in again but starting from row 1 and 5 rows again starting from row 3 So it's like a window of 5 rows that is read and then moving the window by 1. The output should allways be the 1 row after the window.

**Example:** if row 1-20 contains numbers 1-20 then:
input1 = [1,2,3,4,5] and output1 = 6
input2 = [2,3,4,5,6] and output2 = 7
...
input15 = [15,16,17,18,19] and output15 = 20
notice how inputs are lists and outputs are just numbers. So when i append those to the final input & output lists i end up with inputs being a list of lists and out being list of outputs

My code

from pandas import read_excel

# initialize final input & output lists. The contents of the temporary input & output lists
# are gonna be appended to these final lists
training_input = []
training_output = []

# excel relevant info
my_sheet = 'Junaid'
file_name = '../Documents/Junaid1.xlsx'

# initialize counters
loop_count = 0
row_counter = 0

for x in range(25):

    # load the excel file containing inputs & outputs
    # using parameters skiprows, nrows (number of rows) and index col
    df = read_excel(file_name, sheet_name = my_sheet, skiprows=row_counter, nrows=6, index_col=0)

    # initialize temporary input & output lists
    input_temp = []
    output_temp = []

    for y in df.index:
        # append the first 5 rows of the 6 to input list
        if loop_count < 5:
            input_temp.append(df.index[loop_count])
            loop_count += 1
        else:
            # append the 6th data to output list
            training_output.append(df.index[loop_count])

    training_input.append(input_temp)
    row_counter += 1
    loop_count = 0
Casper
  • 13
  • 3

1 Answers1

1

Well yes it would be slow because you are opening and closing the file for every iteration of the for loop. A general rule in programming is that if the file is not constantly changing, then only open and read it a single time. Also, there are large sections of your code that can be shaved off if you simply use list comprehension

Here, I have rewritten your code to only open the file and read it once, then it creates the two lists using list comprehension and slicing.

from pandas import read_excel

# excel relevant info
my_sheet = 'Junaid'
file_name = '../Documents/Junaid1.xlsx'

df = read_excel(file_name, sheet_name = my_sheet, index_col=0, header=None)

training_input = [df.index[i:i+5].tolist() for i in range(len(df)-5)]
training_output = [df.index[i].tolist() for i in range(5, len(df))]

Also, there seems to be a bug in your code since the excel file you described in your question does not have a header (i.e. the very first row contains data), thus your code skips the very first row of values. To fix that you should pass the parameter "header=None" to the pandas function to tell it that there is no header index. You can read more about that here.

A Kareem
  • 823
  • 5
  • 10
  • Hi A Kareem, thanks for your reply. – Casper May 01 '20 at 11:25
  • Did, this work for you? If so, please mark my answer as accepted. – A Kareem May 01 '20 at 11:26
  • Sorry, i don't know what went wrong. Yes, i did work, thanks a lot. Execution time is under 1 second even with 5k rows :) Kareem, can i ask u 1 more question. What if i have 2 columns of data. so that 1 input element shouldn't just be "5 rows of column A" but "5 rows of column A + 5 rows from column B" ? – Casper May 01 '20 at 13:12
  • Then you can do something like this `training_input = [df.iloc[i:i+5, 0].tolist() + df.iloc[i:i+5, 1].tolist() for i in range(len(df)-5)]` And make sure to remove the `index_col=0` argument from the pandas call to get all columns – A Kareem May 01 '20 at 13:33