22

I have x_data and labels separately. How can I combine and load them in the model using torch.utils.data.DataLoader?

I have a dataset that I created and the training data has 20k samples and the labels are also separate. Lets say I want to load a dataset in the model, shuffle each time and use the batch size that I prefer. The Dataloader function does that. How can I combine and put them in the function so that I can train it in the model in pytorch?

Amarnath
  • 351
  • 1
  • 2
  • 5
  • See discussion on StackOverflow here: https://stackoverflow.com/questions/41924453/pytorch-how-to-use-dataloaders-for-custom-datasets – Johannes Mar 13 '19 at 17:00
  • I found this example using TensorDataset to be helpful: https://stackoverflow.com/questions/55588201/pytorch-transforms-on-tensordataset If `x_data` and `labels` are both Pytorch tensors, you can combine them into a `TensorDataset` then create a dataloader from that TensorDataset. – littleO Jun 11 '20 at 07:54

2 Answers2

16

Assuming both of x_data and labels are lists or numpy arrays,

train_data = []
for i in range(len(x_data)):
   train_data.append([x_data[i], labels[i]])

trainloader = torch.utils.data.DataLoader(train_data, shuffle=True, batch_size=100)
i1, l1 = next(iter(trainloader))
print(i1.shape)
ASHu2
  • 260
  • 2
  • 6
  • I end up getting errors saying that train_data is an np.object rather than a tensor. – Gunner Stone Nov 28 '20 at 19:47
  • 2
    For me, it worked. You can even use a shorter version: `trainloader = torch.utils.data.DataLoader([ [x_data[i], labels[i]] for i in range(len(labels))], shuffle=True, batch_size=100)`. Thank you @ASHu2 – Leo Jun 15 '21 at 08:40
7

I think the standard way is to create a Dataset class object from the arrays and pass the Dataset object to the DataLoader.

One solution is to inherit from the Dataset class and define a custom class that implements __len__() and __get__(), where you pass X and y to the __init__(self,X,y).

For your simple case with two arrays and without the necessity for a special __get__() function beyond taking the values in row i, you can also use transform the arrays into Tensor objects and pass them to TensorDataset.

Run the following code for a self-contained example.

# Create a dataset like the one you describe
from sklearn.datasets import make_classification
X,y = make_classification()

# Load necessary Pytorch packages
from torch.utils.data import DataLoader, TensorDataset
from torch import Tensor

# Create dataset from several tensors with matching first dimension
# Samples will be drawn from the first dimension (rows)
dataset = TensorDataset( Tensor(X), Tensor(y) )

# Create a data loader from the dataset
# Type of sampling and batch size are specified at this step
loader = DataLoader(dataset, batch_size= 3)

# Quick test
next(iter(loader))
Johannes
  • 171
  • 3