Sondrel Hardware Engineering Manager
The Advanced Video and Signal-Based Surveillance (AVSS) Conference is an IEEE conference now in its 14th year. It aims to gather the advances in science and technology related to this field, and involves industry partners to provide their view of current market trends.
The conference was aimed primarily at advances in research from an algorithm and software (SW) implementation point of view, therefore relatively distant from the underlying hardware (HW) implementations.
The current trends for the implementation of computer vision products looks to be following a similar pattern to IoT devices: major key players such as NVIDIA are providing platforms for development which include HW solutions; product developers appear reluctant to invest in HW-hardened solutions because of the fluidity of the algorithms, so prefer to use a "standard" platform with different SW implementations which can be changed as necessary.
My personal perception is that this is now changing: optimised HW becomes more critical as the computational load increases exponentially, and the algorithms are now encompassing multiple Convolutional Neural Networks (CNNs) in parallel and/or in series – in products distributed across a wide area, potentially with limited access to power supplies.
The 2017 edition was held in Lecce (Italy) between the 29th of August and the 1st of September. It consisted of a day of hosted workshops and three days of talks delivered by the authors of the accepted papers. The quality of the papers was high, as generally acknowledged by the attendees.
The keynote speakers were well-known in the field and provided a wide-ranging view of the state-of-the-art, and directions for the field.
Here is a summary of the keynote speeches, which are a good distillation of the overall conference themes.
SILVIO SAVARESE, STANFORD UNIVERSITY (HTTP://CVGL.STANFORD.EDU/SILVIO/)
The first keynote speech was on “visual intelligence in perspective”. The main point of this talk was the recognition that while current computer vision solutions are impressive and sometimes capable of exceeding a human’s skills at recognising objects, for instance, they lack the context and interpretation of a scene: the key example was that of a security robot which has recently run over a toddler who had fallen on the floor and was trying to reach out for his toy (the toddler was unharmed). Less critical examples include discriminating between a woman standing in a café, equally distant from a queue of people and some people at a table – is she simply standing there, or waiting for her turn in the queue, or taking customers.
Prof. Savarese’s work focuses on providing this context by gathering information from different sensing modalities, to then process them in parallel to segment the recognised objects into “parts” with relationships between each other, which are “actionable” (i.e. selecting a handle to pick up). Thus, the training part of the process (which is as critical as the underlying model itself) needs to consider not only the parts, but also their features and the relationship between the parts in the context of the objects. Current solutions are generally incapable of providing this information because they lack the contextual overview and fail in the tasks of scene interpretation and predictive reactions.
Social acceptability, and the range of meanings this takes across different cultures, was also touched on here, and it indicates the need for large, multi-cultural-aware data sets to train the AI networks to ensure correct behaviour of agents in the real world; in other words, the navigation task in itself is limited if the social and cultural contexts are ignored. The technologies involved in enabling this context-aware analysis require a large amount of computation: while it is possible with current platforms to reach near-real-time computation, sacrifices have to be made and the overall results of real-time analysis appear poorer compared to batch processing.One particularly interesting outcome of this research in contextual enrichment of sensory input is “Jackrabbot”, a “polite” ambling robot, capable of navigating crowded environments (this was featured, amongst many news outlets, also on the BBC last year: http://www.bbc.com/news/av/technology-36836243/jackrabbot-why-this-robot-is-watching-how-you-move). The AI unit was trained by feeding drone footage which captured a wide variety of human interactions – so that the robot could mimic those to ensure it behaved in a socially-acceptable manner, incorporating into the computations the learned context of the interactions.
STAN SCLAROFF, BOSTON UNIVERSITY (HTTP://WWW.CS.BU.EDU/~SCLAROFF/)
Prof Sclaroff’s work also focuses on contextual information and saliency. This talk was more technical, focusing on Recurrent Neural Network (RNN) models, and the application of these to space-time localisation and human behaviour and interaction.
Two aspects were important beyond the technical material in Prof. Sclaroff’s talk:
- The field is really in its infancy when it comes to context-aware, semantic-driven perception – this aspect is very important and will be picked up further down in this report. While the advances are very significant, the level at which these can be really deployed in the field is limited.
- As an aside, consider a discussion with a delegate from Disney Research, who pointed out that currently they would not consider deploying such techniques in the field, because they are concerned about the behaviour of the system in corner-cases, and the negative impact this might have on visitors at their theme parks (“if the system is not 100%, we would not deploy it. You could not tell a child that he cannot get on a ride because the AI system does not recognise him/her because it’s only 99% accurate”). Their involvement is very topical and highly specialised, but nevertheless this delegate’s view echoed Prof. Sclaroff’s that it is too early to deploy highly complex systems of this kind in the wild. This is, of course, already happening, and in a way the failures are helping shape further developments, but current successful deployments are limited to cases where human interaction is heavily controlled
- Quality of result really depends on the quality of the input. There might be a business case for simply gathering up a large amount of salient footage for different circumstances, from multiple angles and with a range of different image/video characteristics (and critically with accurate labelling), but this would be heavily compromised by valid concerns on privacy, which in different countries are enforced by law at different levels (again, the cultural aspect returns with a different expression)Related to the above point, the current trend for implementation is to stick to known platforms, to allow a seamless switch between multiple models and/or different implementations without having to change the HW. This is an important point, with implications for Sondrel’s involvement in the field (please refer to subsequent sections for a more detailed analysis)
ROBERTO CIPOLLA, CAMBRIDGE UNIVERSITY (HTTPS://MI.ENG.CAM.AC.UK/~CIPOLLA/)
Prof. Cipolla’s talk focused on the uncertainty in deep learning models, and the role of Bayesian models as a framework to understand such uncertainty, “aiding interoperability and safety” of AI systems. It also pointed out the role of knowledge of geometric information to help design networks which can be trained with unlabelled data for, as an example, human body pose and shape recovery
Prof. Cipolla is very close to the industrial implementations: he showed a MobilEye collision detection system, based on some of his published work, and Metail (https://metail.com/) which enables the user to virtually try on clothes by modifying the depicted models to match their body shape – which originally could be extracted by uploading a picture of the user; this has now been discounted because of users’ privacy concerns.
The models for Bayesian networks (he proposed his SegNet – a convolutional encoder-decoder which comprised cascaded CNNs) and others are available from Prof. Cipolla’s website (many of the models are widely available on the Internet for free download).
The industrial presentations were offered by the following companies (with a brief description of domain and focus):
Providing HW and SW platforms for deep learning AI
- BOSCH Security Systems/Intelligent Video Analysis (https://uk.boschsecurity.com/en/)
Smart city applications, video analytics in real-time performed by cameras (developed by BOSCH) with metadata sent off for further computation
- MARCH Networks (https://www.marchnetworks.com/)
Video surveillance and business intelligence systems, financial, retail, corporate, leveraging computer vision for merchandizing
- LEONARDO Security and Information Systems (http://www.leonardocompany.com/en/product-services/sicurezza-infrastrutture-critiche-security-critical-systems)
Crime incident management, audio/speech recognition, biometric-on-the-move for border control
Advanced surveillance to smart lighting, beyond people detection – attention, gaze estimation
Critical infrastructure monitoring
- GE Video Analytics (https://www.ge.com/)
Smart city, healthcare (hospital management), military applications (programmes with DARPA – e.g. threat detection), transport, security
The talks offered a high-level overview of the companies' offerings and strategies, which are summarised below. Amongst them, the NVIDIA presentation was particularly noteworthy considering that most of the technical talks referred to NVIDIA's platforms used for development.
The presentations picked on several useful points:
NVIDIA's presentation highlighted the company's offerings for AI – in particular "Metropolis" (https://www.nvidia.com/en-us/deep-learning-ai/industries/ai-cities/ - indeed, the NVIDIA website provides a rich set of material to learn more at https://www.nvidia.com/en-us/deep-learning-ai/ andhttps://developer.nvidia.com/deep-learning-frameworks for instance). They provide edge and data centre/cloud platforms for a variety of applications.
- Most of the presenters highlighted the explosion of data processing due not only to increased sensory input availability, but also to the increased availability of devices themselves, such as cameras in a city environment. It can be envisioned that thousands of cameras will be available across a city, and the processing of the sensory inputs from these devices require a smart approach at the distribution of this information. BOSCH cameras provide built-in video analytics, with metadata extracted in real time and sent to the server; their efforts are focusing on applying smart, adaptive encoding and quality adjustments to reduce the bandwidth of the data being sent across the links
- The idea that "the real world is messy" (NVIDIA and BOSCH made this point quite strongly), and that traditional video analytics is not trustworthy enough because of both lack of context and limited capabilities, for instance when analysing footage covering large crowds, in different lighting conditions etc. Outdoor conditions in general are a problem for analytics (BOSCH): for example, analysing footage during a snow storm is challenging because of the weather effects on the footage which confuse the AI systems. No current system is robust against these effects
- Current deployments (NVIDIA) include behavioural checks in a city environment, for instance pedestrian usage of zebra crossings vs jaywalking, usage of safety belts in cars etc.
- LEONARDO pointed out a less-discussed sensory input: audio feeds. The insight that one operator can monitor multiple video feeds but only one audio feed was very interesting. LEONARDO is working on AI systems which analyse combinations of video inputs with audio, adding a new dimension to contextual identificationIn general, "smart cities" came up repeatedly as a focus of NVIDIA, BOSCH, MARCH Networks and OSRAM. The surveillance is intended to go beyond traditional applications such as people counting and identification of suspects from footage. Rather, the challenge is two-fold: on the one hand, the surveillance is shifting to behavioural analysis (gaze tracking, posture and activities, unexpected or "non-standard" behaviour by individuals); on the other, the analysis attempts to optimise resources (e.g. energy and road utilization). Both can be considered in terms of policing by city government bodies, and also in terms of business management: human behaviour and interactions can be analysed to identify customer engagement inside and outside shops (for instance, reactions after visiting a shop)
- Going beyond the smart city applications, the Italian Association of experts in Critical Infrastructures considered the analysis of environments and infrastructures themselves to identify threats to the integrity of, for instance, supplies to the population (such as electricity and gas)
The fundamental aspect that was coming up repeatedly is how far we are from a stable, standardised methodology for intelligent video surveillance.
- Research is ongoing on the type of neural network model which offers the best results. Different models vary in performance depending on the circumstances – for instance, a model which is powerful in recognising people in a crowd shows poor performance when used to analyse behaviours. In addition, some models are better/worse in different lighting conditions, image resolution etc. Therefore, there is a strong argument to separate the HW from the SW, to enable the modification of the underlying model as described previously
- Thus, deployments (considering the various limitations of the models) focus on SW solutions running on GPUs and any HW is developed using off-the-shelf components. This seems to be primarily to ensure models can be modified/upgraded late in the product lifecycle. Some companies are reluctant to use fully-fledged solutions in the wild considering the uncertainty in outcome – particularly because of the unexpected failures in different real-world condition (rain, snow, blinding and low light are some examples). Of course, there are outliers – as we are aware, some companies such as BOSCH do have dedicated HW for cameras and are deploying these systems to an extent in the real world (although it is not known at this stage whether they rely on dedicated SoCs/ASICs or if they use development platforms) However, this was not the general trend observed at the conference.
- Audio surveillance does not have the same level of attention as video surveillance.The context/saliency aspect is probably the most important research effort at this stage. Recall that this aims at providing the results with context information, to ensure any course of action taken in response to the availability of results is appropriate considering human expectations. This involves a dramatic increase in computation needs, because it typically involves cascaded or parallel neural networks which need to work in real-time.
- City-wide surveillance was mentioned repeatedly, while automotive was mentioned only once (in Prof. Cipolla's talk, where he described the MobilEye solution which, in the context of this conference, is relatively light-weight in terms of computation complexity). The challenges of smart, adaptive bandwidth for camera-to-datacentre communication, robust all-weather performance, reduced power consumption maintaining high computing performance are a strong focus for the industry.
- Another common trend is the identification of deviations from normal behaviour; this aims at analysing the behaviour of an individual in various circumstances in order to highlight behaviour which hints at a possible threat. Context-aware analysis, which incorporate an understanding of cultural differences, social interaction and generally the variety of human behaviour is critical here, and presents a significant challenge for AI systems: considering the underlying uncertainty in perception and interpretation, the risk for false positive and over-reaction is very high and can lead to time-consuming (and potentially socially unacceptable) action by police forces. One example of this was shown in a different presentation by a research team: their system attempted at labelling "bystanders" in the context of a poster presentation, compared to "engaged" individuals who are participating in the conversation with the author and/or are actively reading the poster. Shy people were incorrectly labelled as "bystanders" because their body language did not conform to the "normal" behaviour as perceived by the system (which of course was trained using footage selected by the researchers). It is easy to imagine an anxious person being incorrectly labelled as a threat because it appears to dither in front of a queue at border control, for instance. This can be mitigated by deploying a human-in-the-loop approach (as pursued by Disney Research in their paper) where the system flags abnormal behaviour (threats or simply someone who is not enjoying a visit to a theme park) and then a human user is responsible for choosing the correct course of action. This hybrid solution is acceptable in most applications, but fails in the context of massive surveillance networks.
- Cloud processing is the next step in processing. This moves the system beyond the traditional approach of server maintenance for analysis, shifting the training and inference activities into remote (with respect to the “edge”, or the sensory devices) the datacentre. NVIDIA provides platforms for these applications, and MARCH Networks is interested in exploring options, as they are starting to explore these applications.
- Privacy concerns may become a serious threat to the deployment of such solutions: potentially, raw data showing individuals' behaviours and personal habits would be uploaded onto third-party datacentres for analysis. This is raising difficult ethical questions about rights to access this material, and the general public might have an adverse reaction to such a solution – also considering that the traditional approach (which is significantly less intrusive than the proposed solutions) already causes a negative response from a significant portion of the population in various countries around the world. Consider the changes made to Metail: customers were unhappy about having to upload their images (especially because they would have needed to be rather personal, such as wearing limited clothing to identify the body shape), so the system had to be modified to allow the customers to shape an online model to their body sizes (in a way limiting significantly the impact of the system). Similar concerns might be raised in the future for city surveillance systems, where, in contrast to an AI product in the home, the individuals being monitored did not explicitly agree to their data being processed by a third party.
The video analytics field is a rich environment for research and development. The current trends for context-awareness and semantic features identification indicate a significant, exponential increase in the processing needs by nodes and servers in the near future, which will require dedicated chip design solutions to address the power consumption increases, the bandwidth bottlenecks and the integration challenges.