Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
Describing Geographical Characteristics with Social Images
Huangjie ZHENG, Jiangchao YAO, Ya ZHANG
Shanghai Jiao Tong University, China, People's Republic of
Images play important roles in providing comprehensive understanding of our physical world. When thinking of a tourist city, one can immediately imagine pictures of its famous attractions. With the boom of social images, we attempt to explore the possibility of describing geographical characteristics of different regions. We here propose a Geographical Latent Attribute Model (GLAM) to mine regional characteristics from social images, which is expected to provide a comprehensive view of the regions. The model assumes that a geographical region consists of different "attributes" (e.g., infrastructures, attractions, events and activities) and "attributes" are interpreted by different image "clusters". Both "attributes" and image "clusters" are modeled as latent variables. The experimental analysis on a collection of 2.5M Flickr photos regarding Chinese provinces and cities has shown that the proposed model is promising in describing regional characteristics. Moreover, we demonstrate the usefulness of the proposed model for place recommendation.
9:20am - 9:40am
Fully convolutional network with superpixel parsing for fashion Web image segmentation
Lixuan YANG1,2, Helena RODRIGUEZ2, Michel CRUCIANU1, Marin FERECATU1
1Conservatoire National des Arts et Metiers, Paris, France; 2Shopedia SAS
In this paper we introduce a new method for extracting deformable clothing items from still images by extending the output of a Fully Convolutional Neural Network (FCN) to infer context from local units (superpixels). To achieve this we optimize an energy function, which combines the large scale structure of the image with the local low-level visual descriptions of superpixels, over the space of all possible pixel labelings. To asses our method we compare it to the unmodified FCN network used as a baseline, as well as to the well-known Paper Doll and Co-parsing methods for fashion images.
9:40am - 10:00am
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks
Nikiforos Pittaras1, Foteini Markatopoulou1,2, Vasileios Mezaris1, Ioannis Patras2
1Centre for Research and Technology Hellas, Information Technologies Institute (CERTH-ITI); 2Queen Mary University of London
In this study we compare three different fine-tuning strategies in order to investigate the best way to transfer the parameters of popular deep convolutional neural networks that were trained for a visual annotation task on one dataset, to a new, considerably different dataset. We focus on the concept-based image/video annotation problem and use ImageNet as the source dataset, while the TRECVID SIN 2013 and PASCAL VOC-2012 classification datasets are used as the target datasets. A large set of experiments examines the effectiveness of three fine-tuning strategies on each of three different pre-trained DCNNs and each target dataset. The reported results give rise to guidelines for effectively fine-tuning a DCNN for concept-based visual annotation.
10:00am - 10:20am
What Convnets Make for Image Captioning?
Yu Liu, Yanming Guo, Michael S. Lew
Leiden University, Netherlands, The
Nowadays, a general pipeline for the image captioning problem is taking advantage of image representation based on convolutional neural networks (CNNs) and sequence modeling based on recurrent neural networks (RNNs). Captioning performance closely depends on the discriminative capacity of CNNs. Hence, our aim in this work mainly focuses on investigating the effects of different Convnets (CNN models) on image captioning. We train three Convnets based on different classification tasks: single-label, multi-label and multi-attribute. Since three Convnets focus on different visual content in one image, we further propose aggregating their features together to generate a richer visual representation. Then visual features derived from Convnets are fed to a Long Short-Term Memory (LSTM) that models the sequence of language words. During test time, we use an efficient multi-scale augmentation approach based on fully convolutional networks (FCNs). Extensive experiments on MS COCO 2014 dataset provide significant insights into the effects of Convnets.