Monday, December 31, 2012

Generating Mona Lisa Pixel By Pixel

I've been experimenting with MNIST for awhile now, and have recently come to the conclusion that it is not a very good problem for highlighting the strengths of deep neural networks. I was able to get near state-of-the-art results (for neural nets; not RBMs), using just a single (albeit large) hidden layer of 6000 rectified linear neurons.

Inspired by a blog post I saw some time ago about about evolving an image of the Mona Lisa, I had an idea for a task that might better highlight the benefits of deep neural networks.

The idea is to use supervised regression to learn a mapping between pixel locations and RGB colors, with the goal of generating an image on pixel at a time. This means that if the dimensions of the target image are X-by-Y, then it is necessary to run the network XY times. If the image is 100-by-100, there will be 10,000 training examples.

All of the locations and RGB values are first normalized to fall into the range [0, 1].

So, as a concrete example, if the pixel in the very middle of an image is beige, then the training example associated with that pixel could look like:

(0.5, 0.5) → (0.96, 0.96, 0.86)

Again, the two values on the left hand side are representative of location, and the three values on the right hand side represent red, green, and blue, respectively.

I've begun doing some preliminary testing of this idea. I'm training neural networks to generate the Mona Lisa, pixel by pixel.

The first picture (above) is the result of training a neural network with a similar architecture to the one I used successfully on the MNIST task. It only uses a single hidden layer.

Now, what happens if we instead use three hidden layers?

The result of using multiple hidden layers is quite perceptibly better.

One drawback is that there is no straightforward way to measure anything like generalization (there are no test or validation sets.) I suppose it might be possible to hide certain small parts of the picture during training to see how well the neural network can fill in the gaps, but this idea seems fraught with difficulties.

However, I'm not overly concerned. Just learning to generate the picture appears to be a very difficult learning challenge, one that already seems to clearly highlight the limitations of shallow neural architectures. So although this task lacks some of the characteristics of those normally found in machine learning, I think there is still much to be gained from further exploration.

EDIT: Just added source code for this on GitHub:


William Payne said...

I wonder how well this approach would perform at super-resolution tasks?

Thomas said...

Interesting. I'm familiar with the idea, but haven't looked into how it is done. Do you have any good papers or other online resources to recommend?

Lerc said...

I tried pretty much the same thing except for a far simpler network. While not accurate, the results of the 50 neurons were quite spectacular. I liked it enought that I'm currently in the process of painting the result on canvas right now.

Thomas said...

@Lerc: Do you have an image of it online? I'd love to see it!

Tobias Domhan said...
This comment has been removed by the author.