-
Notifications
You must be signed in to change notification settings - Fork 0
Notes on pylearn2 datasets
These are just rough notes made while trying to figure out how to load our dataset into pylearn2 properly. Specifications are that we want to be able to load images when required by minibatches and do the preprocessing at run time. Also, the dataset object must have the same interface that pylearn2 is expecting.
Looking at the code directly, starting from the train.py
script in
pylearn2.scripts
looks like the magic is done when the yaml is loaded.
Then, the script simply calls the loaded models main_loop
method.
Looking at the notebook tutorial on MNIST with convnets - specifically
the example yaml provided looks like the Train class is the important part -
that contains the main_loop
method. The dataset is instantiated at
initialisation. In the case of MNIST, this involves (after checking a
control.get_load_data()
function that always returns true (at the moment)) uses
serial functions to load the data. Uses an internal caching function, unclear
yet on what that does. Then, ooks like this basically just loads the images into a
large 3d numpy array.
The MNIST dataset just inherits from dense_design_matrix.DenseDesignMatrix
.
Documentation on this is found here. All that remains in the object after
MNIST initialisation is:
-
self.shuffle_rng
- suppose this is a random number generator for shuffling. -
self.X
- guess this is the matrix being loaded. -
self.y
- labels that were loaded.
And it calls the initialisation method of it's parent class DenseDesignMatrix, so it also ends up with whatever it creates. This happens on line 141:
super(MNIST, self).__init__(topo_view=dimshuffle(topo_view), y=y,
axes=axes, y_labels=y_labels)
Stored variables are:
self.X
self.y
-
self.view_converter
- unclear what this is at the moment. -
self.X_labels
- optional labels for X values -
self.y_labels
- optional labels for y values -
self.X_topo_space
:
self.X_topo_space
stores a "default" topological space that will be used only whenself.iterator
is called without adata_specs
, and with "topo=True", which is deprecated.
-
self.data_specs
- unclear what this is -
self.compress
- stored option -
self.design_loc
- stored option -
self.rng
- another random number generator -
self._iter_mode
- default option for iterator -
self._iter_topo
- default option for iterator -
self._iter_targets
- default option for iterator -
self._iter_data_specs
- default option for iterator -
self.preprocessor
- stored preprocessor, if handed as keyword arg
So looking at this it looks like it would be possible to make a class inheriting DenseDesignMatrix that would load all of our images as a large numpy array, applying preprocessing functions at load time. Unfortunately, if we have a good number of preprocessing functions we'll quickly swamp the RAM. So, we need to have a way of loading it that'll only at least apply the preprocessing when the minibatch is required.
Guess what we've got to look at is the code that is used to get a minibatch out of these dataset objects.
Looks like this it is recommended for code performing learning to use the
Dataset.iterator
method. (Says this in the get_batch_design
docs).
That method should return an iterator for the dataset:
An iterator object implementing the standard Python iterator protocol (i.e. it has an
__iter__
method that return the object itself, and a next() method that returns results until it raises StopIteration).
But, on each iteration what is this iterator supposed to actually return?
The DenseDesignMatrix class instantiates something called FiniteDatasetIterator
in pylearn2.utils.iteration
. Looking at that.
FiniteDataSetIterator doesn't inherit anything else in pylearn2, but that's probably because it's just a wrapper for subset iterators. Referred to documentation on subset iterators. Looks like that could be inherited and our dataset class could use that to shoehorn in our preprocessing functions. Unfortunately, I don't know how it might work with the test data though, as we don't want the test data treated in the same way, unless we augment the test data and then aggregate predictions that correspond to the same image.
Figuring out how to do that might also be complicated.