ACCV 2016 Demonstrations

Room 201
Taipei International Convention Center (TICC)
13:30-15:30 on November 21 to 23

D1. Stereo Depth based Human Detection and Tracking for Queue Analysis
Csaba Beleznai (AIT Austrian Institute of Technology GmbH)
D2. Component-Based Distributed Framework for Coherent and Real-Time Video Dehazing
Meihua Wang, Jiaming Mai, Yun Liang (South China Agricultural University), Tom Z. J. Fu, Zhenjie Zhang (Advanced Digital Sciences Center) and Ruichu Cai (Guangdong University of Technology)
D3. Based on 3D Trajectory Projection Object Grouping and Classification for Video Synopsis
Jing-Ming Guo and Yu-Da Lin (National Taiwan University of Science and Technology)
D4. Recognition from Hand Cameras: A Revisit with Deep Learning
Cheng-Sheng Chan, Ting-An Chien, Tz-Ying Wu and Min Sun (National Tsing Hua University)
D5. Visual-Inertial Ego-Positioning for Flying Cameras
Meng-Hsun Chou, Hsin-Ruey Tsai, Qiao Liang, Tian-Yi Shen, Kuan-Wen Chen and Yi-Ping Hung (Nation Taiwan University)
D6. Deep Learning of Facial Attributes
Yan-Xiang Chen, Cheng-Hua Hsieh, Hung-Cheng Shie and Gee-Sern Jison Hsu (National Taiwan University of Science and Technology)
D7. LRP — A General Tool to Explain Predictions of Deep Neural Networks
Wojciech Samek and Alexander Binder (Fraunhofer Heinrich Hertz Institute)

D1. Stereo Depth based Human Detection and Tracking for Queue Analysis
Contributors: Csaba Beleznai
Affiliation: AIT Austrian Institute of Technology GmbH, Austria

Abstract:
In this live demonstrator we present a passive stereo depth based detection and tracking framework which is capable to detect and track humans at interactive framerates within an observed area of up to 12 x 8 meters. The main novelty of the proposed system is twofold: (i) Delineating individual humans in a crowded occupancy map – obtained by projecting depth data onto an estimated ground plane – is a non-trivial task due to occlusions. We use a human shape prior (appearing as variable elliptic forms in the occupancy map) in form of a learned multi-resolution binary shape tree. This shape representation allows for delineating compact clusters in a noisy probabilistic occupancy map in a fast, coarse-to-fine manner, resulting in high detection rates and temporally stable clustering results. (ii) We detect and track individuals forming queues to estimate the waiting time for the last individual in the queue. We present a coherent motion detection scheme capable to capture the typical stop-and-go movement propagation along the length of the queue in order to determine its spatial extent and the velocity of the forward motion. The demonstrator runs at 6-7 fps including stereo matching and all additional computations on a multi-core CPU.

d1

Figure 1: Left: Stereo disparity image showing the automatically estimated ground plate (as a re-projected grid) and detected humans. Right: detection and tracking results for a meander-style queue.

D2. Component-Based Distributed Framework for Coherent and Real-Time Video Dehazing
Contributors: Meihua Wang1, Jiaming Mai1, Yun Liang1, Tom Z. J. Fu2,3, Zhenjie Zhang3, Ruichu Cai2
Affiliation: 1South China Agricultural University, China
2Guangdong University of Technology, China
3Advanced Digital Sciences Center, China

Abstract:
Traditional dehazing techniques, as a well-studied topic in image processing, are now widely used to eliminate the haze effect- s from individual images. However, even the state-of-the-art dehazing algorithms may not provide sufficient support to video analytics, as a crucial pre-processing step for video-based decision making systems (e.g., robot navigation), due to the limitations of these algorithms on poor result coherence and low processing efficiency. This paper presents a new framework, particularly designed for video dehazing, to output coherent results in real time, with two novel techniques. Firstly, we decompose the dehazing algorithms into three generic components, namely transmission map estimator, atmospheric light estimator and haze-free image generator. They can be simultaneously processed by multiple threads in the distributed system, such that the processing efficiency is optimized by automatic CPU resource allocation based on the workloads. Secondly, a cross-frame normalization scheme is proposed to enhance the coherence among consecutive frames, by sharing the parameters of atmospheric light from consecutive frames in the distributed computation platform. The combination of these techniques enables our framework to generate highly consistent and accurate dehazing results in real-time, by using only 3 PCs connected by Ethernet.

d2

D3. Based on 3D Trajectory Projection Object Grouping and Classification for Video Synopsis
Contributors: Jing-Ming Guo and Yu-Da Lin
Affiliation: National Taiwan University of Science and Technology, Taiwan

Abstract:
Nowadays, surveillance systems are everywhere, school, airport, street and more, means that have huge data had been created day by day. To browse the specific event to be a big challenge. To reduce human efforts and make the dynamic behavior clearly at the short time pried, video synopsis was proposed. Video synopsis is utilized to provide a condensed video which removes spatial or temporal redundancies, maintaining all the activities completely as original video. Traditional video synopsis are very time consuming also have blanking parts in result video, To solve this problem, we proposed trajectory-based video synopsis were efficiently solved this problem. We divide it to three parts, 1) Making object tubes, trajectory-based objects classified are proposed to keep tubes continuality and smoothly to avoid blanking effect in synopsis video, 2) Rearrange object tubes, Minimum Overlap (MO) algorithm are proposed to decide the objects tubes temporal position in synopsis video, and 3) Flexible objects tubes, through Global Temporal Shifting (GTS) process to makes tubes along temporal-domain flexible. As a result, proposed video synopsis can be efficiently generated the smooth synopsis video without blanking effect and even shorter than original video.

D4. Recognition from Hand Cameras: A Revisit with Deep Learning
Contributors: Cheng-Sheng Chan, Ting-An Chien, Tz-Ying Wu, Min Sun
Affiliation: National Tsing Hua University, Taiwan

Abstract:
The HandCam system (Fig.1) has two unique properties as compared to egocentric systems: it avoids the need to detect hands, and it more consistently observes the activities of hands. By taking advantage of these properties, we collected a HandCam-Object Dataset. In our experiment, accuracy of HandCam is about 10% higher than accuracy of egocentric camera.

Figure. 1 Our whole System, include one hand-view camera and the TK1.

d4-1

We recognize the object which interact with user’s hand, the model is implemented on a development board. The whole system is portable, and we provide a real-time demo. Here is the link of our concept video.(http://goo.gl/uAC1TG)

Brief introduction for Fig. 2, we predict user’s hand-state by the hand-view camera, which the image at the left column. Two images in the right column show the user’s state from other view point. Please watch the whole video to get more information.

Figure. 2

d4-2

With a new dataset with at least 50 daily-object categories, we can do more things than the concept video. We collected all frames from user’s daily life activities. According to our goal, this dataset has at least 5 instances and 5 minutes in average for each object category. We have collected 60% of the dataset. This would become a really interesting live demo.

D5. Visual-Inertial Ego-Positioning for Flying Cameras
Contributors: Meng-Hsun Chou, Hsin-Ruey Tsai, Qiao Liang, Tian-Yi Shen, Kuan-Wen Chen, Yi-Ping Hung
Affiliation: Nation Taiwan University, Taiwan

Abstract:
In this work, a low cost monocular camera and an inertial measurement unit (IMU) are combined for the ego-positioning of a flying camera. The state-of-the-art monocular visual positioning approaches include Simultaneous Localization and Mapping (SLAM) and Model-Based Localization (MBL). We show the experimental results of self-positioning using three representative methods, ORB-SLAM, LSD-SLAM, and MBL, and evaluate their performance in different scenarios. Based on the experiment results, we analyze the pros and cons of each method. Also, we improve the performance of visual positioning with an inertial sensor based on a loosely-coupled framework. The experiment results demonstrate the benefits of visual-inertial sensor fusion.

d5

Fig.1  Overview over the complete LSD-SLAM algorithm.

D6. Deep Learning of Facial Attributes
Contributors: Yan-Xiang Chen, Cheng-Hua Hsieh, Hung-Cheng Shie, Gee-Sern Jison Hsu
Affiliation: National Taiwan University of Science and Technology, Taiwan

Abstract:
Built on a jointly trained deep convolutional neural network (CNN), this system can identify more than 40 facial attributes, including gender, expression, age, hair color and style, with or without makeup, with or without eyeglasses, similarity to celebrity faces and many others. We locate faces and local facial features by the Regressive Tree Structured Model (Hsu et al., ICCV 15), and decompose each face into facial segments. Each facial segment encloses a specific region good for the identification of certain facial attributes. For example, the mouth region is used to identify if the subject puts on lipstick or smiles; the hair region is used to identify the hair color and style. The system can also identify one’s face if he/she is registered. The registration needs a frontal face only, and it can identify faces up to 60 degrees rotation under arbitrary illumination conditions. The figures below are two samples.

d6-1
d6-2

D7. LRP — A General Tool to Explain Predictions of Deep Neural Networks
Contributors: Wojciech Samek, Alexander Binder
Affiliation: Fraunhofer Heinrich Hertz Institute, Germany

Abstract:
Understanding and interpreting predictions of complex machine learning algorithms such as deep neural networks is of high value in many applications, as it allows to verify the reasoning of the system and provides additional information to the human expert. Although deep nets are solving very successfully a plethora of tasks, they have in most cases the disadvantage of acting as a black box, not providing any information about what made them arrive at a particular decision.
We wish to demonstrate a recently developed technique, Layer-wise Relevance Propagation (LRP), that explains the predictions of deep neural networks by visualizing the relevance of each single input dimension (e.g. image pixel) wrt. the classification decision in terms of a heatmap. This unsupervised process of decomposition is in principle applicable to any kind of model, resulting in high (positive) relevance values identifying properties of the input speaking for the presence of the prediction target and low (or even negative) scores indicating no or negative contribution.
We wish to demonstrate LRP in context of a live demo on image, text and video data and various neural network type models. Our demonstration fits well to the Workshop on Interpretation and Visualization of Deep Neural Nets which we organize at ACCV 2016.

d7