Image-to-Image Translation: Machine Learning Magic that Converts Winter Photos Into Summer
A magician can make his trick with just a wave of a magic wand, our engineers can make their magic with just one click! Interested how the same winter landscape would look in summer, then keep on reading. The comprehensive research was completed by our R&D department with the purpose of investigating current possibilities of Machine Learning generators and what quality they can provide for a task of translating winter to summer image as an analogy to automatic language translation. We have conducted the investigation of different generative models and application methods, prepared a solid dataset, and the most suitable model we have adopted to satisfy our requirements. The idea of transitioning the photo taken in winter into a summery image could be applied in a landscape design or related spheres. That would give an opportunity to have a clear vision of the same image in different seasons. The implementation of the same technologies can also enable the various color changes of images.
The secret to our magic is CGAN neural network
A lot of approaches and processing models were researched, analyzed and tested, i.e. neural networks: CGAN, CVAE; pixel-to-pixel translation. CGAN appeared to be the most suitable for our image-to-image translation task because of the following reasons:
- Sharp & realistic image synthesis
- Semi-supervised learning
- Excellent test of our ability to use high-dimensional complicated data
- Bleeding edge ML technology
Conditional GAN is a subset of GANs – Generative adversarial networks. GAN is a type of generative neural networks that consists of two networks: discriminator – D and generator – G. We can easily understand the key concept behind them by imagining a team of counterfeiters as a G and police as D. Counterfeiters (G) is constantly trying to produce fake currency (images), while police (D) is trying to estimate whether it’s real or fake. As time passes both of them are becoming better in their jobs. Conditional GAN has the same principle, but we have some additional info – condition. So now, instead of asking counterfeiters to just produce some fake currency, we ask them to produce, let’s say, fake 100$. Similarly, instead of just asking the police to judge if it’s fake or real, we ask them to judge is it’s 100$. It allows us to teach G to generate even more realistic currency (images in our case). Figure 1. cGAN Structure
While working on the image-to-image translation of the winter photos, our image processing engineers have faced numerous challenges, like:
- dataset aggregation and preparation;
- long iterations
- the cost of experiments;
- the complexity of CGAN, the configuration of many knobs
They continued their research on image-to-image translation with the huge and comprehensive dataset of images collection. At first, our R&D team worked with Transient Attribute Dataset. However, due to the lack of images and in order to reach better results and accuracy, the additional image dataset was needed. For this purpose, we selected “Nordland Line” train video footage. The details and samples of both datasets are provided below.
Dataset №1. Annotated photos from 101 webcams with outdoor scenes.
- Annotated by people during crowdsourcing campaign (downloaded).
- Filtered: picked ones that have “most summer” and “most winter” as pairs (Pandas)
3000 image pairs of 2 seasons / 640×480 scaled to 256 x 256.
Dataset №2. Four 10-hour Full HD videos of footage recorded from the train during the “Nordland Line” trip in Norway in all four seasons.
- Time-sync (downloaded)
- Cut into frames (FFmpeg)
- Auto-align best nearby frames (Python + Hugin)
9000 image pairs of 2 seasons / 1000×1000, scaled to 256×256.
Image-to-image translation results
Our engineers have adopted the model to satisfy their requirements and share their best results with you.
Training set: 11.500 image pairs, testing set: ~600 image pairs. Here are the best setup details:
- 286×286 -> 256×256 jitter, horizontal mirroring
- Conditional D model
We would also like to share some samples of variations that happened during the image-to-image translation process. The most typical were the following:
- Dataset manipulations and mixing
- Jittering, random mirroring, amount of epochs
- Unconditional/conditional D model
Same detailed patterns everywhere
|Jitter 900×900 -> 256×256:|
|Input||Too much dataset №1:|
Mixed illumination, artifacts
|Too much dataset №2:|
|Input||ImageGAN: a bit distorted||PixelGan: Very blurry||No L1 term: Loss of low-level|
- AWS p2.xlarge (NVIDIA GK210 GPU)
12Gb GPU memory. CUDA.
- pix2pix – Lua/Torch implementation of cGAN.
- Hugin – photo-aligning tool
- Python, pandas, numpy.
To get an objective assessment of the output pictures quality, we need a few people as independent experts to compare 2 images and indicate which of them is a real photo, one of the images will be a real photo and another one just a generated picture. Would you like to take part in the experiment? Get in touch with us!
Tools & Technologies: Deep learning, Convolutional neural networks, adversarial networks, Computer vision, Lua, torch, OpenCV, Linux, git.