Issue
I'm creating an image classification model with Inception V3 and have two classes. I've split my dataset and labels into two numpy arrays.The data is split with trainX and testY as the images and trainY and testY as the corresponding labels.
data = np.array(data, dtype="float")/255.0
labels = np.array(labels,dtype ="uint8")
(trainX, testX, trainY, testY) = train_test_split(
data,labels,
test_size=0.2,
random_state=42)
train_datagen = keras.preprocessing.image.ImageDataGenerator(
zoom_range = 0.1,
width_shift_range = 0.2,
height_shift_range = 0.2,
horizontal_flip = True,
fill_mode ='nearest')
val_datagen = keras.preprocessing.image.ImageDataGenerator()
train_generator = train_datagen.flow(
trainX,
trainY,
batch_size=batch_size,
shuffle=True)
validation_generator = val_datagen.flow(
testX,
testY,
batch_size=batch_size)
When I shuffle train_generator with ImageDataGenerator, will the images still match the corresponding labels? Also should the validation dataset be shuffled as well?
Solution
Yes, the images will still match the corresponding labels so you can safely set shuffle
to True
. Under the hood it works as follows. Calling .flow()
on the ImageDataGenerator
will return you a NumpyArrayIterator
object, which implements the following logic for shuffling the indices:
def _set_index_array(self):
self.index_array = np.arange(self.n)
if self.shuffle: # if shuffle==True, shuffle the indices
self.index_array = np.random.permutation(self.n)
self.index_array
is then used to yield both the images (x
) and the labels (y
) (code truncated for readability):
def _get_batches_of_transformed_samples(self, index_array):
batch_x = np.zeros(tuple([len(index_array)] + list(self.x.shape)[1:]),
dtype=self.dtype)
# use index_array to get the x's
for i, j in enumerate(index_array):
x = self.x[j]
... # data augmentation is done here
batch_x[i] = x
...
# use the same index_array to fetch the labels
output += (self.y[index_array],)
return output
Check out the source code yourself, it might be easier to understand than you think.
Shuffling the validation data shouldn't matter too much. The main point of shuffling is to introduce some extra stochasticity in the training process.
Answered By - sdcbr
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.