Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
On the Exploration of Convolutional Fusion Networks for Visual Recognition
Yu Liu, Yanming Guo, Michael S. Lew
Leiden University, Netherlands, The
We seek to explore multi-scale convolutional neural networks (CNNs) for visual recognition. Despite recent advances in multi-scale deep representations, their limitations are due to expensive parameters and weak fusion module. Hence, in this paper, we propose an efficient multi-scale fusion architecture, called convolutional fusion networks (CFN). Owing to using efficient 1$\times$1 convolution and global average pooling, CFN can generate the side branches from multi-scale intermediate layers while consuming few parameters. In addition, we present a locally-connected fusion module, which can learn adaptive weights for the side branches and form a fused feature. Extensive experiments on CIFAR and ImageNet demonstrate considerable gains of CFN over the plain CNN. Furthermore, we transfer the pre-trained ImageNet model to three new tasks, including scene recognition, fine-grained recognition and image retrieval. As a result, CFN can obtain consistent improvements towards the transferring tasks.
11:15am - 11:40am
Spatio-temporal VLAD Encoding for Human Action Recognition in Videos
Ionut Cosmin Duta1, Bogdan Ionescu2, Kiyoharu Aizawa3, Nicu Sebe1
1University of Trento, Italy; 2University Politehnica of Bucharest, Romania; 3University of Tokyo, Japan
Encoding is one of the key factors for building an effective video representation. In the recent works, super vector-based encoding approaches are highlighted as one of the most powerful representation generators. Vector of Locally Aggregated Descriptors (VLAD) is one of the most widely used super vector methods, with outstanding results in many tasks. However, one of the limitations of VLAD encoding is in the lack of spatial information captured from the data. This is critical, especially when dealing with video information. In this work, we propose an extension of VLAD, i.e., Spatio-temporal VLAD (ST-VLAD), an encoding method which incorporates spatio-temporal information within the encoding process. This is carried out by proposing a video division and extracting specific information over the feature group of each video split. To test ST-VLAD we address the problem of human action recognition in videos. Experimental validation is carried out using both hand-crafted and deep features. Our pipeline for action recognition with the proposed encoding method obtains state-of-the-art performance over three challenging datasets: HMDB51 (67.6%), UCF50 (97.8%) and UCF101 (91.5%).
11:40am - 12:05pm
A Framework of Privacy-Preserving Image Recognition for Image-Based Information Services
Nowadays mobile devices such as smartphones are widely used all over the world. On the other hand, the performance of image recognition has dramatically increased by deep learning technologies. From these backgrounds, we think that the following scenario of information services could be realized in the near future: users take a photo and send it to a server, who recognizes the location in the photo and returns the users some information about the recognized location. However, this kind of client-server-based image recognition can cause a privacy issue because image recognition results are sometimes privacy sensitive. To tackle the privacy issue, in this paper, we propose a novel framework for privacy-preserving image recognition in which the server cannot uniquely identify the recognition result but users can do so. An overview of the proposed framework is as follows: First users extract a visual feature from their taken photo and transform it so that the server cannot uniquely identify the recognition result. Then users send the transformed feature to the server, who returns a candidate set of recognition results to the users. Finally, the users compare the candidates and the original visual feature for obtaining the final result. Our experimental results demonstrate the effectiveness of the proposed framework.
12:05pm - 12:30pm
Cross-modal Recipe Retrieval: How to Cook This Dish?
Jingjing Chen, Lei Pang, Chong-wah Ngo
City university of HongKong, Hong Kong S.A.R. (China)
In social media users like to share food pictures. One intelligent feature, potentially attractive to amateur chefs, is the recommendation of recipe along with food. Having this feature, unfortunately, is still technically challenging. First, the current technology in food recognition can only scale up to few hundreds of categories, which are yet to be practical for recognizing ten of thousands of food categories. Second, even one food category can have variants of recipes that differ in ingredient composition. Finding the best-match recipe requires knowledge of ingredients, which is a fine-grained recognition problem. In this paper, we consider the problem from the viewpoint of cross-modality analysis. Given a large number of image and recipe pairs acquired from the Internet, a joint space is learnt to locally capture the ingredient correspondence from images and recipes. As learning happens at the region level for image and ingredient level for recipe, the model has ability to generalize recognition to unseen food categories. Furthermore, the embedded multi-modal ingredient feature sheds light on the retrieval of best-match recipes. On an in-house dataset, our model can double the retrieval performance of DeViSE, a popular cross-modality model but not considering region information during learning.