Notes on pylearn2 datasets

These are just rough notes made while trying to figure out how to load our dataset into pylearn2 properly. Specifications are that we want to be able to load images when required by minibatches and do the preprocessing at run time. Also, the dataset object must have the same interface that pylearn2 is expecting.

Looking at the code directly, starting from the train.py script in pylearn2.scripts looks like the magic is done when the yaml is loaded. Then, the script simply calls the loaded models main_loop method.

Looking at the notebook tutorial on MNIST with convnets - specifically the example yaml provided looks like the Train class is the important part - that contains the main_loop method. The dataset is instantiated at initialisation. In the case of MNIST, this involves (after checking a control.get_load_data() function that always returns true (at the moment)) uses serial functions to load the data. Uses an internal caching function, unclear yet on what that does. Then, ooks like this basically just loads the images into a large 3d numpy array.

The MNIST dataset just inherits from dense_design_matrix.DenseDesignMatrix. Documentation on this is found here. All that remains in the object after MNIST initialisation is:

self.shuffle_rng - suppose this is a random number generator for shuffling.
self.X - guess this is the matrix being loaded.
self.y - labels that were loaded.

And it calls the initialisation method of it's parent class DenseDesignMatrix, so it also ends up with whatever it creates. This happens on line 141:

        super(MNIST, self).__init__(topo_view=dimshuffle(topo_view), y=y,
                                    axes=axes, y_labels=y_labels)

Stored variables are:

self.X
self.y
self.view_converter - unclear what this is at the moment.
self.X_labels - optional labels for X values
self.y_labels - optional labels for y values
self.X_topo_space:

self.X_topo_space stores a "default" topological space that will be used only when self.iterator is called without a data_specs, and with "topo=True", which is deprecated.

self.data_specs - unclear what this is
self.compress - stored option
self.design_loc - stored option
self.rng - another random number generator
self._iter_mode - default option for iterator
self._iter_topo - default option for iterator
self._iter_targets - default option for iterator
self._iter_data_specs - default option for iterator
self.preprocessor - stored preprocessor, if handed as keyword arg

So looking at this it looks like it would be possible to make a class inheriting DenseDesignMatrix that would load all of our images as a large numpy array, applying preprocessing functions at load time. Unfortunately, if we have a good number of preprocessing functions we'll quickly swamp the RAM. So, we need to have a way of loading it that'll only at least apply the preprocessing when the minibatch is required.

Guess what we've got to look at is the code that is used to get a minibatch out of these dataset objects.

Looks like this it is recommended for code performing learning to use the Dataset.iterator method. (Says this in the get_batch_design docs). That method should return an iterator for the dataset:

An iterator object implementing the standard Python iterator protocol (i.e. it has an __iter__ method that return the object itself, and a next() method that returns results until it raises StopIteration).

But, on each iteration what is this iterator supposed to actually return?

The DenseDesignMatrix class instantiates something called FiniteDatasetIterator in pylearn2.utils.iteration. Looking at that.

FiniteDataSetIterator doesn't inherit anything else in pylearn2, but that's probably because it's just a wrapper for subset iterators. Referred to documentation on subset iterators. Looks like that could be inherited and our dataset class could use that to shoehorn in our preprocessing functions. Unfortunately, I don't know how it might work with the test data though, as we don't want the test data treated in the same way, unless we augment the test data and then aggregate predictions that correspond to the same image.

Figuring out how to do that might also be complicated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes on pylearn2 datasets

Clone this wiki locally