Samuel Kong.png

A Sondrel Engineering Consultant and alumni of Imperial College, Samuel Kong, attended this year's British Machine Vision Conference held at his old university. 

In his first blog, Samuel gave a summary explanation of Image Convolution, Artificial Neural Networks (ANN) and Convolutional Neural Networks (CNN). In this post we will look at some specific projects presented over the course of the conference, showing current innovation and development in these fields:

  • Probabilistic and Deep Models for 3D Reconstruction
  • Deep Face Hallucination for Unviewed Sketches
  • Character Identification in TV series without a Script
  • You Said That?
  • Lip Reading in Profile


Presented by Andreas Geiger of Max Planck Institute.

This project presented a set of refined solutions for two different problems. The first problem was in building 3D models of landmarks using multiple photos from different known angles and distances. The second problem was in representing 3D models in the form of voxels.

For the first problem of modelling 3D landmarks, previous existing work used a deterministic technique to re-model 3D objects or landmarks. The problem with such techniques is that they struggle to reconstruct surfaces with little to no texture, such as a grass lawn or a reflective surface. The project presented uses a novel, probabilistic technique that exposes uncertainty and makes use of non-local priors. Figure 10 and Figure 11 below are screenshots from their demonstration video. Figure 10 shows the overall problem being tackled where multiple photos of a landmark (about 220 used in their example) are analysed to build up a 3D Volumetric Model of the landmark itself. Figure 11 shows how their method of implementing the 3D Reconstruction helps to make use of non-local priors and some probabilistic inference to construct a more accurate model to resolve surfaces that lack texture or are reflective.



The second problem being tackled was regarding the representation of 3D models in terms of the data structure. Typically, a 3D model is constructed out of Voxels, which are the 3D equivalent of pixels in a 2D image, that are laid out in the form of a 3D grid. Figure 12 below shows a Voxel representation of a car at different resolutions.


The problem highlighted by the presentation was that, with Voxel resolution, comes an intensive strain on memory. Where for a square 2D image, the resource requirement scales as a square of the width of the image. For 3D cubic voxel models, the resource requirement scales as a cube of the width of the frame. In addition, the presentation noted that the voxel space was sparsely populated, most of the regions were empty and contained no information. In order to cut down on the amount of memory used on these empty regions of the voxel space, a new method of structuring voxels was presented.

OctNet is a project that attempts to efficiently partition a 3D Voxel space such that neighbouring voxels containing no information or the same information, can be grouped together to form a single larger voxel. By using this technique, 3D models can be both high resolution, yet efficient on memory usage. Figure 13 shows a comparison between a Dense Voxel Space representation and an OctNet representation.


For more information, refer to



Presented by Conghui Hu of Queen Mary University of London.

The background of this project was in taking a Criminal Facial Composite and synthesising it into an inferred life-like facial image. The project proposes an improved methodology to synthesising facial images using a set of a different Neural Networks. Figure 14 below shows the results of different synthesis algorithms including the one proposed in the presentation. Column ‘a’ are the input sketches and column ‘e’ are the Ground Truth photos.


The solution presented uses a set of CNNs and a Generative Adversarial Network (GAN) model to generate accurate facial images. The GAN model means that two sets of Networks are used side by side, where both are attempting to generate a synthesised facial image. By comparing the results of both images, both networks can be better trained to generate crisper and more accurate images. Figure 15 below shows the architecture presented during the presentation.


For more information, refer to



Presented by Arsha Nagrani of Oxford University.

The aim of this project was to train a computer to be able to recognise a TV series character during the show, and to track him or her throughout the scenes that they are visible in. The example used in the project was a TV series called Sherlock which stars actor Benedict Cumberbatch. The presenter summarised the challenges she tackled in training a Neural Network to recognise the characters from just using a training set consisting of photos of the actor. She stated that photos of actors were not representative of how they look in the show as their image is heavily modified and the lighting and scenery is completely different to that of a photoshoot. In order to get around this, the project proposes a three-stage architecture. Figure 16 shows visually, the three stages of learning to identify the characters within the TV shows. Figure 17 shows some results that the project produced. It is possible to see how the algorithm is capable of detecting, identifying and tracking the faces of characters in different environments, lighting and even when partially obstructed or blurred in the background.



The presenter had expressed that the intended application for the project in the future would be for video playback devices where users can skip to a specific point in a video based on where certain characters are located or grouped together. But she also appreciates that the technology is still in early days and there does not exist any hardware that can make the process real-time. The current platform being used involves a Desktop PC equipped with high end Intel i7 processors and the very high end Nvidia Titan X graphics cards.

For more information, refer to the white paper titled: From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script [12].



Presented by Joon Son Chung of University of Oxford.

This project was a demonstration of how Machine Vision can learn the faces of people and generate a video of a talking face. The example presented at the conference was a synthesised face of Jeremy Corbyn repeating a Forest Gump Quote, as well as a modified version of a News recording where the interviewee is dubbed to be singing. The example of the News recording can be found on YouTube: Figure 18 below is a snippet from the YouTube video showing 12 different actors and politicians, and next to them, the synthesised images of them speaking.


The model used by this project to create the synthetic and dubbed images was called Speech2Vid[14]. Figure 19 below shows the model and the Architecture of the model. The architecture uses many layers of CNNs to generate the end result, where each layer requires training from a database of photos. The training process itself can take up to 37 hours [14].


One of the applications intended for this project was for smart video dubbing. Dubbing in TV shows or films is very common, especially for films that are screened in countries where the spoken language is different to that in the country of origin. This project can improve the quality of dubbed videos by more accurately modelling the mouth to the dubbed audio.

For more information, refer to the white paper titled: You Said That? [14].



Presented by Joon Son Chung of University of Oxford.

The goal of the project presented was to teach a machine to read lips. Lip Reading is a very challenging problem that requires a lot of skill for humans to be able to perform as it requires the inference of words based solely on lip movement; this challenge is fmade more difficult due to the existence of homophemes (different words which sound the same but require the same lip movement such as bark, park and mark). The presentation demonstrated a novel architecture called the Watch, Listen, Attend and Spell network that learned to predict characters in sentences being spoken from a video of a talking face, with or without audio [15]. Figure 20 below shows the WLAS architecture used in the proposed solution.


The WLAS Network was trained with almost 5000 hours’ worth of training data, which took it 10 days to complete when using a Nvidia Titan X with 12GB of memory [15]. However, the results are quite spectacular, achieving a lower word error rate than that of a professional lip reader. Figure 21 below is a snippet from the video demonstration used in the conference.


One of the applications mentioned for this project is in creating smart subtitles for TV shows where there isn’t a script, such as the news. The current system for subtitles for live TV involves having the subtitles out of sync from the video and it is not uncommon for there to be mistakes due to human error. This application could also allow for videos and audios to be dynamically resynchronised whenever they go out of sync.

For more information, refer to the white paper titled: Lip Reading Sentences in the Wild [15].


There will be one further blog in this short series, looking at 3 more presentations from the event:

  • Exploring the Structure of a Real-Time, Arbitrary Neural Artistic Stylisation Network   
  • PixColour: Pixel Recursive Colourisation       
  • HoloLens: Computer Vision Meets Mixed Reality    


[8]      Max Planck Institute, “Volumetric Reconstruction,” [Online]. Available: [Accessed 14 September 2017].

[9]      bilderzucht-blog, “3D Pixel / Voxel,” 24 May 2010. [Online]. Available: [Accessed 14 September 2017].

[10]    G. Riegler, A. O. Ulusoy and A. Geiger, “OctNet: Learning Deep 3D Representations at High Resolutions,” in CVPR 2017, Honolulu, 2015.

[11]    C. Hu, D. Li, Y.-Z. Song and T. M. Hospedales, “Now You See Me: Deep Face Hallucination for Unviewed Sketches,” in British Machine Vision Conference 2017, London, 2017.

[12]    A. Nagrani and A. Zisserman, “From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script,” in British Machine Vision Conference 2017, London, 2017.

[13]    J. S. Chung, “Youtube,” 8 May 2017. [Online]. [Accessed 16 September 2017]

[14]    J. S. Chung, A. Jamaludin and A. Zisserman, “You Said That?,” in British Machine Vision 2017, London, 2017.

[15]    J. S. Chung, A. Senior, O. Vinyals and A. Zisserman, “Lip Reading Sentences in the Wild,” in IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, 2017.

[16]    J. S. Chung, “YouTube,” 17 November 2016. [Online]. [Accessed 17 September 2017].