Saturday, November 13, 2021

[FIXED] How does shuffling work with ImageDataGenerator in Machine Learning?

November 13, 2021 computer-vision, keras, machine-learning, python, tensorflow No comments

Issue

I'm creating an image classification model with Inception V3 and have two classes. I've split my dataset and labels into two numpy arrays.The data is split with trainX and testY as the images and trainY and testY as the corresponding labels.

data = np.array(data, dtype="float")/255.0
labels = np.array(labels,dtype ="uint8")

(trainX, testX, trainY, testY) = train_test_split(
                                data,labels, 
                                test_size=0.2, 
                                random_state=42) 

train_datagen = keras.preprocessing.image.ImageDataGenerator(
          zoom_range = 0.1,
          width_shift_range = 0.2, 
          height_shift_range = 0.2,
          horizontal_flip = True,
          fill_mode ='nearest') 

val_datagen = keras.preprocessing.image.ImageDataGenerator()


train_generator = train_datagen.flow(
        trainX, 
        trainY,
        batch_size=batch_size,
        shuffle=True)

validation_generator = val_datagen.flow(
                testX,
                testY,
                batch_size=batch_size)

When I shuffle train_generator with ImageDataGenerator, will the images still match the corresponding labels? Also should the validation dataset be shuffled as well?

Solution

Yes, the images will still match the corresponding labels so you can safely set shuffle to True. Under the hood it works as follows. Calling .flow() on the ImageDataGenerator will return you a NumpyArrayIterator object, which implements the following logic for shuffling the indices:

def _set_index_array(self):
    self.index_array = np.arange(self.n)
    if self.shuffle: # if shuffle==True, shuffle the indices
        self.index_array = np.random.permutation(self.n)

self.index_array is then used to yield both the images (x) and the labels (y) (code truncated for readability):

def _get_batches_of_transformed_samples(self, index_array):
    batch_x = np.zeros(tuple([len(index_array)] + list(self.x.shape)[1:]),
                       dtype=self.dtype)
    # use index_array to get the x's
    for i, j in enumerate(index_array):
        x = self.x[j]
        ... # data augmentation is done here
        batch_x[i] = x
     ...
     # use the same index_array to fetch the labels
     output += (self.y[index_array],)

    return output

Check out the source code yourself, it might be easier to understand than you think.

Shuffling the validation data shouldn't matter too much. The main point of shuffling is to introduce some extra stochasticity in the training process.

Answered By - sdcbr

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 13, 2021

[FIXED] How does shuffling work with ImageDataGenerator in Machine Learning?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels