Projects @British Machine Vision Conference 2017

INTRODUCTION:

A Sondrel Engineering Consultant and alumni of Imperial College, Samuel Kong, attended this year's British Machine Vision Conference held at his old university.

In his first blog, Samuel gave a summary explanation of Image Convolution, Artificial Neural Networks (ANN) and Convolutional Neural Networks (CNN). In this post we will look at some specific projects presented over the course of the conference, showing current innovation and development in these fields:

PROBABILISTIC AND DEEP MODELS FOR 3D RECONSTRUCTION

Presented by Andreas Geiger of Max Planck Institute.

This project presented a set of refined solutions for two different problems. The first problem was in building 3D models of landmarks using multiple photos from different known angles and distances. The second problem was in representing 3D models in the form of voxels.

For the first problem of modelling 3D landmarks, previous existing work used a deterministic technique to re-model 3D objects or landmarks. The problem with such techniques is that they struggle to reconstruct surfaces with little to no texture, such as a grass lawn or a reflective surface. The project presented uses a novel, probabilistic technique that exposes uncertainty and makes use of non-local priors. 

The second problem being tackled was regarding the representation of 3D models in terms of the data structure. Typically, a 3D model is constructed out of Voxels, which are the 3D equivalent of pixels in a 2D image, that are laid out in the form of a 3D grid. 

The problem highlighted by the presentation was that, with Voxel resolution, comes an intensive strain on memory. Where for a square 2D image, the resource requirement scales as a square of the width of the image. For 3D cubic voxel models, the resource requirement scales as a cube of the width of the frame. In addition, the presentation noted that the voxel space was sparsely populated, most of the regions were empty and contained no information. In order to cut down on the amount of memory used on these empty regions of the voxel space, a new method of structuring voxels was presented.

OctNet is a project that attempts to efficiently partition a 3D Voxel space such that neighbouring voxels containing no information or the same information, can be grouped together to form a single larger voxel. By using this technique, 3D models can be both high resolution, yet efficient on memory usage. 

For more information, refer to http://www.cvlibs.net/projects.php.

DEEP FACE HALLUCINATION FOR UNVIEWED SKETCHES

Presented by Conghui Hu of Queen Mary University of London.

The background of this project was in taking a Criminal Facial Composite and synthesising it into an inferred life-like facial image. The project proposes an improved methodology to synthesising facial images using a set of a different Neural Networks.

The solution presented uses a set of CNNs and a Generative Adversarial Network (GAN) model to generate accurate facial images. The GAN model means that two sets of Networks are used side by side, where both are attempting to generate a synthesised facial image. By comparing the results of both images, both networks can be better trained to generate crisper and more accurate images. 

CHARACTER IDENTIFICATION IN TV SERIES WITHOUT A SCRIPT

Presented by Arsha Nagrani of Oxford University.

The aim of this project was to train a computer to be able to recognise a TV series character during the show, and to track him or her throughout the scenes that they are visible in. The example used in the project was a TV series called Sherlock which stars actor Benedict Cumberbatch. The presenter summarised the challenges she tackled in training a Neural Network to recognise the characters from just using a training set consisting of photos of the actor. She stated that photos of actors were not representative of how they look in the show as their image is heavily modified and the lighting and scenery is completely different to that of a photoshoot. In order to get around this, the project proposes a three-stage architecture. 

The presenter had expressed that the intended application for the project in the future would be for video playback devices where users can skip to a specific point in a video based on where certain characters are located or grouped together. But she also appreciates that the technology is still in early days and there does not exist any hardware that can make the process real-time. The current platform being used involves a Desktop PC equipped with high end Intel i7 processors and the very high end Nvidia Titan X graphics cards.

For more information, refer to the white paper titled: From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script [12].

YOU SAID THAT?

Presented by Joon Son Chung of University of Oxford.

This project was a demonstration of how Machine Vision can learn the faces of people and generate a video of a talking face. The example presented at the conference was a synthesised face of Jeremy Corbyn repeating a Forest Gump Quote, as well as a modified version of a News recording where the interviewee is dubbed to be singing. The example of the News recording can be found on YouTube:https://www.youtube.com/watch?v=lXhkxjSJ6p8

The model used by this project to create the synthetic and dubbed images was called Speech2Vid[14]. 

One of the applications intended for this project was for smart video dubbing. Dubbing in TV shows or films is very common, especially for films that are screened in countries where the spoken language is different to that in the country of origin. This project can improve the quality of dubbed videos by more accurately modelling the mouth to the dubbed audio.

For more information, refer to the white paper titled: You Said That? [14].

LIP READING USING DEEP LEARNING METHODS

Presented by Joon Son Chung of University of Oxford.

The goal of the project presented was to teach a machine to read lips. Lip Reading is a very challenging problem that requires a lot of skill for humans to be able to perform as it requires the inference of words based solely on lip movement; this challenge is made more difficult due to the existence of homophemes (different words which sound the same but require the same lip movement such as bark, park and mark). The presentation demonstrated a novel architecture called the Watch, Listen, Attend and Spell network that learned to predict characters in sentences being spoken from a video of a talking face, with or without audio [15]. 

The WLAS Network was trained with almost 5000 hours’ worth of training data, which took it 10 days to complete when using a Nvidia Titan X with 12GB of memory [15]. However, the results are quite spectacular, achieving a lower word error rate than that of a professional lip reader. 

One of the applications mentioned for this project is in creating smart subtitles for TV shows where there isn’t a script, such as the news. The current system for subtitles for live TV involves having the subtitles out of sync from the video and it is not uncommon for there to be mistakes due to human error. This application could also allow for videos and audios to be dynamically resynchronised whenever they go out of sync.

For more information, refer to the white paper titled: Lip Reading Sentences in the Wild [15].

 

There will be one further blog in this short series, looking at 3 more presentations from the event:

References:

[8] Max Planck Institute, “Volumetric Reconstruction,” [Online]. Available: https://ps.is.tue.mpg.de/research_projects/volumetric-reconstruction. [Accessed 14 September 2017].

[9] bilderzucht-blog, “3D Pixel / Voxel,” 24 May 2010. [Online]. Available: http://www.bilderzucht.de/blog/3d-pixel-voxel/. [Accessed 14 September 2017].

[10] G. Riegler, A. O. Ulusoy and A. Geiger, “OctNet: Learning Deep 3D Representations at High Resolutions,” in CVPR 2017, Honolulu, 2015.

[11] C. Hu, D. Li, Y.-Z. Song and T. M. Hospedales, “Now You See Me: Deep Face Hallucination for Unviewed Sketches,” in British Machine Vision Conference 2017, London, 2017.

[12] A. Nagrani and A. Zisserman, “From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script,” in British Machine Vision Conference 2017, London, 2017.

[13] J. S. Chung, “Youtube,” 8 May 2017. [Online]. [Accessed 16 September 2017]

[14] J. S. Chung, A. Jamaludin and A. Zisserman, “You Said That?,” in British Machine Vision 2017, London, 2017.

[15] J. S. Chung, A. Senior, O. Vinyals and A. Zisserman, “Lip Reading Sentences in the Wild,” in IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, 2017.

[16] J. S. Chung, “YouTube,” 17 November 2016. [Online]. [Accessed 17 September 2017].