M-SBIR: An Improved Sketch-based Image Retrieval Method using Visual Word Mapping
1State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China; 2hong kong polytechnic university; 3School of Electronics and Information, xi'an Jiaotong university, xi'an 710049, China.
Sketch-based image retrieval (SBIR) systems, which interactively search photo collections using free-hand sketches depicting shapes, have attracted much attention recently. In most existing SBIR techniques, the color images stored in a database are first transformed into corresponding sketches. Then, features of the sketches are extracted to generate the sketch visual words for later retrieval. However, transforming color images to sketches will normally incur loss of information, thus decreasing the final performance of SBIR methods. To address this problem, we propose a new method called M-SBIR. In M-SBIR, besides sketch visual words, we also generate a set of visual words from the original color images. Then, we leverage the mapping between the two sets to identify and remove sketch visual words that cannot describe the original color images well. We demonstrate the performance of M-SBIR on a public data set. We show that depending on the number of different visual words adopted, our method can achieve 9.8~13.6% performance improvement compared to the classic SBIR techniques. In addition, we show that for a database containing multiple color images of the same objects, the performance of M-SBIR can be further improved via some simple techniques like co-segmentation.
Discovering User Interests from Social Images
1Shanghai Jiao Tong University, China, People's Republic of; 2University of Technology Sydney, Austrilia
The last decades have witnessed the boom of social networks. As a result, discovering user interests from social media has gained increasing attention. While the accumulation of social media presents us great opportunities for a better understanding of the users, the challenge lies in how to build a uniform model for the heterogeneous contents. In this article, we propose a hybrid mixture model for user interests discovery which exploits both the textual and visual content associated with social images. By modeling the features of each content source independently at the latent variable level and unifies them as latent interests, the proposed model allows the semantic interpretation of user interests in both the visual and textual perspectives. Qualitative and quantitative experiments on a Flickr dataset with 2.54 million images have demonstrated its promise for user interest analysis compared with existing methods.
Frame-independent and Parallel Method for 3D Audio Real-time Rendering on Mobile Devices
1State Key Laboratory of Software Engineering, Wuhan University, China; 2National Engineering Research Center for Multimedia Software, Computer School of Wuhan University, China; 3Hubei Provincial Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China; 4School of Physics and Electronic Science, Guizhou Normal University, Guiyang, China
As 3D audio is a fundamental medium of virtual reality (VR), 3D audio real-time rendering technique is essential for the implementation of VR, especially on the mobile devices. While constrained by the limited computational power, the computation load is too high to implement 3D audio real-time rendering on the mobile devices. To solve this problem, we propose a frame-independent and parallel method of framing convolution, to parallelize process of 3D audio rendering using head-related transfer function (HRTF). In order to refrain from the dependency of overlap-add convolution over the adjacent frames, the data of convolution result is added on the final results of the two adjacent frames. We found our method could reduce the calculation time of 3D audio rendering significantly. The results were 0.74 times, 0.5 times and 0.36 times the play duration of si03.wav (length of 27s), with Snapdragon 801, Kirin 935 and Helio X10 Turbo, respectively.
Color-Introduced Frame-to-Model Registration for 3D Reconstruction
Fujitsu Research & Development Center Co., Ltd., China, People's Republic of
3D reconstruction has become an active research topic with the popularity of consumer-grade RGB-D cameras, and registration for model alignment is one of the most important steps. Most typical systems adopt depth-based geometry matching, while the captured color images are totally discarded. Some recent methods further introduce photometric cue for better results, but only frame-to-frame matching is used. In this paper, a novel registration approach is proposed. According to both geometric and photometric consistency, depth and color information are well involved in a unified optimization framework. With the available depth maps and color images, a global model with colored surface vertices is maintained. And the incoming RGB-D frames are aligned based on frame-to-model matching for more effective camera pose estimation. Both quantitative and qualitative experimental results demonstrate that better reconstruction performance can be obtained by our proposal.
Scale-Relation Feature for Moving Cast Shadow Detection
Fujian Agriculture and Forestry University, China, People's Republic of
Shadow detection is the problem of moving cast detection in visual surveillance applications, which has been studied over years. However, an efficient model that can handle the issue of moving cast shadow in various situations is still challenging. Unlike prior methods, we use a data-driven method without the strong parametric assumptions or complex models to address the problem of moving cast shadow. In this paper, we propose a novel feature-extracting framework called Scale-Relation Feature Extracting (SRFE). By leveraging the scale space, SRFE decomposes each image with various properties into various scales and further considers the relationship between adjacent scales of the two shadow properties to extract the scale-relation features. To seek the criteria for discriminating moving cast shadow, we use random forest algorithm as the ensemble decision scheme. Experimental results show that the proposed method can achieve state-of-the-art performance on the widely used dataset.
Improving the discriminative power of Bag of Visual Words Model
1XLIM UMR CNRS 7252, University of Poitiers, France; 2L3I, University of La Rochelle, France
With the exponential increase of image database, Content Based Image Retrieval research field has started a race to always propose more effective and efficient tools to manage this massive amount of data. In this paper, we focus on improving the discriminative power of the well-known bag of visual words model. To do so, we present $n$-BoVW, an approach that combines visual phrase model effectiveness keeping the efficiency of visual words model with a binary based compression algorithm. Experimental results on various datasets have shown the potential performance of our proposals.
Recognizing Emotions Based on Human Actions in Videos
Tsinghua University, China, People's Republic of
Systems for automatic analysis of videos are in high demands as videos are expanding rapidly on the Internet and understanding of the emotions carried by the videos (e.g. “anger”, “happiness”) are becoming a hot topic. While existing affective computing model mainly focusing on facial expression recognition, little attempts have been made to explore the relationship between emotion and human action. In this paper, we propose a comprehensive emotion classification framework based on spatio-temporal volumes built with human actions. To each action unit we get before, we use Dense-SIFT as descriptor and K-means to form histograms. Finally, the histograms are sent to the mRVM and recognizing the human emotion. The experiment results show that our method performs well on FABO dataset with an average precision rate of 70.8%.
A Novel Affective Visualization System for Videos based on Acoustic and Visual Features
1State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China; 2School of Electronics and Information, xi'an Jiaotong university, xi'an 710049, China
With the fast development of social media in recent years, affective video content analysis has become a hot research topic and the relevant techniques are adopted by quite a few popular applications. In this paper, we firstly propose a novel set of audiovisual movie features to improve the accuracy of affective video content analysis, including seven audio features, eight visual features and two movie grammar features. Then, we propose an iterative method with low time complexity to select a set of more significant features for analyzing a specific emotion. And then, we adopt the BP (Back Propagation) network and circumplex model to map the low-level audiovisual features onto high-level emotions. To validate our approach, a novel video player with affective visualization is designed and implemented, which makes emotion visible and accessible to audience. Finally, we built a video dataset including 2000 video clips with manual affective annotations, and conducted some experiments to evaluate our proposed features, algorithms and models. The experimental results reveals that our approach outperforms state-of-the-art methods.
Rocchio-based Relevance Feedback in Video Event Retrieval
1University of Twente, Netherlands, The; 2Netherlands Organization for Applied Scientific Research (TNO)
This paper investigates methods for user and pseudo relevance feedback in video event retrieval. Existing feedback methods achieve strong performance but adjust the ranking based on few individual examples. We propose a relevance feedback algorithm (ARF) derived from the Rocchio method, which is a theoretically founded algorithm in textual retrieval. ARF updates the weights in the ranking function based on the centroids of the relevant and non-relevant examples. Additionally, relevance feedback algorithms are often only evaluated by one feedback mode (user feedback or pseudo feedback). Hence, a minor contribution of this paper is to study feedback algorithms based on a series of modes. Our experiments use TRECVID Multimedia Event Detection collections. We show that ARF performs significantly better in terms of Mean Average Precision, robustness, subjective user evaluation, and run time compared to the state-of-the-art.
A Unified Framework for Monocular Video-Based Facial Motion Tracking and Expression Recognition
University of Science and Technology of China, China, People's Republic of
This paper proposes a facial motion tracking and expression recognition framework based on monocular video data. By a 3D deformable facial model, the online statistical model (OSM) and cylinder head model (CHM) are combined to track 3D facial motion in particle filtering. For facial expression recognition, a fast and efficient algorithm and a robust and precise algorithm are developed. With the first, facial animation and facial expression are retrieved sequentially. After that facial animation is obtained, facial expression is recognized by static facial expression knowledge learned from anatomical analysis. With the second, facial animation and facial expression are simultaneously retrieved to increase the reliability and robustness with noisy input data. Facial expression is recognized by fusing static and dynamic facial expression knowledge, the latter of which is learned from a video database. Experiments show the tracking by OSM + CHM is more accurate than that by OSM, and the facial expression recognition score of the robust and precise algorithm is higher than those of other state-of-the-art facial expression recognition methods.
A Scalable Video Conferencing System Using Cached Facial Expressions
National Tsing Hua University, Taiwan, Republic of China
We propose a scalable video conferencing system that streams High-Definition videos (when bandwidth is sufficient) and ultra-low-bitrate (< 0.25 kbps) cached facial expressions (when the bandwidth is scarce). Our solution consists of optimized approaches to: (i) choose representative facial expressions from training video frames and (ii) match an incomingWebcam frame against the pre-transmitted facial expressions. To the best of our knowledge, such approach has never been studied in the literature. We evaluate the implemented video conferencing system using Webcam videos captured from 9 subjects. Compared to the state-of-the-art scalable codec, our solution: (i) reduces the bitrate by about 130 times when the bandwidth is scarce, (ii) achieves the same coding efficiency when the bandwidth is sufficient, (iii) allows exercising the tradeoff between initialization overhead and coding efficiency, (iv) performs better when the resolution is higher, and (v) runs reasonably fast before extensive code optimization.
Exploiting multimodality in video hyperlinking to improve target diversity
1CNRS, IRISA and INRIA, France; 2INSA, IRISA and INRIA, France; 3Université de Rennes 1, France; 4INRIA, IRISA and INRIA, France
Video hyperlinking is the process of creating links within a collection of videos. Starting from a given set of video segments, called anchors, a set of related segments, called targets, must be provided. In the past years, a number of content-based approaches have been proposed with good results obtained by searching for target segments that are very similar to the anchor in terms of content and information. Unfortunately, relevance has been obtained to the expense of diversity. In this paper, we study multimodal approaches and their ability to provide a set of diverse yet relevant targets. We compare two recently introduced cross-modal approaches, namely, deep auto-encoders and bimodal LDA, and experimentally show that both provide significantly more diverse targets than a state-of-the-art baseline. Bimodal auto-encoders offer the best trade-off between relevance and diversity, with bimodal LDA exhibiting slightly more diverse targets at a lower precision.
A Novel Two-step Integer-pixel Motion Estimation Algorithm for HEVC Encoding on a GPU
1Peking University, China, People's Republic of; 2Advanced Micro Devices Co., Ltd., China, People's Republic of
Integer-pixel Motion Estimation (IME) is one of the fundamental and time-consuming modules in encoding. In this paper, a novel two-step IME algorithm is proposed for High Efficiency Video Coding (HEVC) on a Graphic Processing Unit (GPU). First, the whole search region is roughly investigated with a predefined search pattern, which is analyzed in detail to effectively reduce the complexity. Then, the search result is further refined in the zones only around the best candidates of the first step. By dividing IME into two steps, the proposed algorithm combines the advantage of one-step algorithms in synchronization and the advantage of multiple-step algorithms in complexity. According to the experimental results, the proposed algorithm achieves up to 3.64 times speedup compared with previous representative algorithms, and the search accuracy is maintained at the same time. Since IME algorithm is independent from other modules, it is a good choice for different GPU-based encoding applications.
An Evaluation of Video Browsing on Tablets with the ThumbBrowser
Klagenfurt University, Austria
We present an extension and evaluation of a novel interaction concept for video browsing on tablets. It can be argued that the best user experience for watching video on tablets can be achieved when the device is held in landscape orientation. Most mobile video players ignore this fact and make the interaction unnecessarily hard when the tablet is held with both hands. Naturally, in this hand posture only the thumbs are available for interaction. Our ThumbBrowser interface takes this into account and combines it in its latest iteration with content analysis information as well as two different interaction methods. The interface was already introduced in a basic form in earlier work. In this paper we report on extensions that we applied and show first evaluation results in comparison to standard video players. We are able to show that our video browser is superior in terms of search accuracy and user satisfaction.
Illumination-Preserving Embroidery Simulation for Non-photorealistic Rendering
East China Normal University, China, People's Republic of
We present an illumination-preserving embroidery simulation method for Non-photorealistic Rendering (NPR). Our method turns an image into the embroidery style with its illumination preserved by intrinsic decomposition. This illumination-preserving feature makes our method distinctive from the previous papers, eliminating their problem of inconsistent illumination. In our method a two-dimensional stitch model is developed with some most commonly used stitch patterns, and the input image is intrinsically decomposed into a reflectance image and its corresponding shading image. The Chan-Vese active contour is adopted to segment the input image into regions, from which parameters are derived for stitch patterns. Appropriate stitch patterns are applied back onto the base material region-by-region and rendered with the intrinsic shading of the input image. Experimental results show that our method is capable of performing fine embroidery simulations, preserving the illumination of the input image.
Spatial Verification via Compact Words for Mobile Instance Search
University of Electronic Science and Technology of China, China, People's Republic of
Instance search is a retrieval task proposed by TRECVID, which searches video segments or images relevant to a certain specific instance (object, person, or location). Selecting more representative visual words is a significant challenge for the problem of instance search, since spatial relations between features are leveraged in many state-of-the-art methods. However, with the popularity of mobile devices it is now feasible to adopt multiple similar photos from mobile devices as a query to extract representative visual words for instance search. This paper proposes a novel approach for mobile instance search, by spatial analysis with a few representative visual words extracted from multi-photos. We develop a scheme that applies three criteria, including BM25 with exponential IDF (EBM25), significance in multi-photos and separability to rank visual words. Then, a spatial verification method about position relations is applied to a few visual words to obtain the weight of each photo selected. In consideration of the limited bandwidth and instability of wireless channel, our approach only transmits a few visual words from mobile client to server and the number of visual words varies with bandwidth. We evaluate our approach on Oxford building dataset, and the experimental results demonstrate a notable improvement on average precision over several state-of-the-art methods including spatial coding, query expansion and multiple photos.
Adaptive and optimal combination of local features for image retrieval
1University Paris-Est, IGN/SR, France; 2Nicéphore Cité, Chalon sur Saône, France
With the large number of local feature detectors and descriptors in the literature of Content-Based Image Retrieval (CBIR), in this work we propose a solution to predict the optimal combination of features, for improving image retrieval performances, based on the spatial complementarity of interest point detectors. We review several complementarity criteria of detectors and employ them in a regression based prediction model, designed to select the suitable detectors combination for a dataset. The proposal can improve retrieval performance even more by selecting optimal combination for each image (and not only globally for the dataset), as well as being profitable in the optimal fitting of some parameters. The proposal is appraised on three datasets to validate its effectiveness and stability. The experimental results highlight the importance of spatial complementarity of the features to improve retrieval, and prove the advantage of using this model to optimally adapt detectors combination and some parameters.
Deep Convolutional Neural Network for Bidirectional Image Sentence Mapping
National University of Defense Technology, China, People's Republic of
With the rapid development of the Internet and the explosion of data volume, it is important to access the cross-media big data including text, image, audio, and video, etc., efficiently and accurately. However, the content heterogeneity and semantic gap make it challenging to retrieve such cross-media archives. The existing approaches try to learn the connection between multiple modalities by direct utilization of hand-crafted low-level features, and the learned correlations are merely constructed with high-level feature representations without considering semantic information. To further exploit the intrinsic structures of multimodal data representations, it is essential to build up an interpretable correlation between these heterogeneous representations. In this paper, a deep model is proposed to first learn the high-level feature representation shared by different modalities like texts and images, with convolutional neural network (CNN). Moreover, the learned CNN features can reflect the salient objects as well as the details in the images and sentences. Experimental results demonstrate that proposed approach outperforms the current state-of-the-art base methods on public dataset of Flickr8K.
Online User Modeling for Interactive Streaming Image Classification
State Key Laboratory for Novel Software Technology, Nanjing University, P R China
Regarding of the explosive growth of personal images, this paper propose an online user modeling method for the categorization of the streaming images. In proposed framework, user interaction is brought in after an automatic classification by the classification model, and several strategies have been used for online user modeling. Firstly, to cover diverse personalized taxonomy, we describe images from multiple views. Secondly, to train the classification model gradually, we use an incremental variant of the nearest class mean classifier and update the centroids incrementally during the categorization. Finally, to learn diverse interests of different users, we propose an online learning strategy to learn weights of different feature views. Using proposed method, user can categorize streaming images flexibly and freely without any pre-labeled images or pre-trained classifiers. And with the classification going on, the efficiency will keep increasing which could ease user's interaction burden significantly. The experimental results and a user study demonstrated the effectiveness of the proposed approach.
Unsupervised Multiple Object Cosegmentation via Ensemble MIML Learning
State Key Laboratory for Novel Software Technology, Nanjing University, P R China
Multiple foreground cosegmentation (MFC) has being a new research topic recently in computer vision. This paper proposes a framework of unsupervised multiple object cosegmentation, which is composed of three components: unsupervised label generation, saliency pseudo-annotation and cosegmentation based on MIML learning. Based on object detection, unsupervised label generation is done in terms of the two-stage object clustering method, to obtain accurate consistent label between common objects without any user intervention. Then, the object label is propagated to the object saliency come from saliency detection method, to finish saliency pseudo-annotation. This makes an unsupervised MFC problem as a supervised MIML learning problem. Finally, an ensemble MIML framework is introduced to achieve image cosegmentation based on random feature selection. The experimental results on data sets ICoseg and FlickrMFC demonstrated the effectiveness of the proposed approach.
Discovering Geographic Regions in the City Using Social Multimedia and Open Data
University of Amsterdam, Netherlands, The
In this paper we investigate the potential of social multimedia and open data for automatically identifying regions within the city. We conjecture that the regions may be characterized by specific patterns related to their visual appearance, the manner in which the social media users describe them, and the human mobility patterns. Therefore, we collect a dataset of Foursquare venues, their associated images and users, which we further enrich with a collection of city-specific Flickr images, annotations and users. Additionally, we collect a large number of neighbourhood statistics related to e.g., demographics, housing and services. We then represent visual content of the images using a large set of semantic concepts output by a convolutional neural network and extract latent Dirichlet topics from their annotations. User, text and visual information as well as the neighbourhood statistics are further aggregated at the level of postal code regions, which we use as the basis for detecting larger regions in the city. To identify those regions, we perform clustering based on individual modalities as well as their ensemble. The experimental analysis shows that the automatically detected regions are meaningful and have a potential for better understanding dynamics and complexity of a city.
Facial Expression Recognition by Fusing Gabor and Local Binary Pattern Features
1University of Science and Technology of China, China, People's Republic of; 2Nanjing University, China, People's Republic of
Obtaining effective and discriminative facial appearance descriptors is a challenging task for facial expression recognition (FER). In this paper, a new FER method which combines two of the most successful facial appearance descriptors, namely Gabor filters and Local Binary Patterns (LBP), was proposed considering that the former one can represent facial shape and appearance over a broader range of scales and orientations while the latter one can capture subtle appearance details. Firstly, feature vectors of Gabor and LBP representation were generated from the preprocessed face images respectively. Secondly, feature fusion was applied to combine these two vectors and dimensionality reduction was conducted afterwards. Finally, the support vector machine was adopted to classify prototypical facial expressions from still images. Experimental results on CK+ database demonstrated that the proposed method promoted the performance compared with that using Gabor or LBP descriptor alone, and outperformed several other methods.
Stochastic Decorrelation Constraint Regularized Auto-Encoder for Visual Recognition
Computer School of Wuhan University, China
Deep neural networks have achieved state-of-the-art performance on many application such as image classification, object detection and semantic segmentation. But the difficulty of optimizing the networks still exists when training networks with a huge number of parameters. In this work, we propose a novel regularizer called stochastic decorrelation constraint (SDC) imposed on the hidden layers of the large networks, which can significantly improve the networks' generalization capacity. SDC reduces the co-adaptions of the hidden neurons in an explicit way, with a clear objective function. In the meanwhile, we show that training the network with our regularizer has the effect of training an ensembles of exponentially many networks. We apply the proposed regularizer to the auto-encoder for visual recognition tasks. Compared to the auto-encoder without any regularizers, the SDC constrained auto-encoder can extract features with less redundancy. Comparative experiments on the MNIST database and the FERET database demonstrate the superiority of our method. When reducing the size of training data, the optimization of the network becomes much more challenging, yet our method shows even larger advantages over the conventional methods.
The Perceptual Lossless Quantization of Spatial Parameter for 3D Audio Signals
1State Key Laboratory of Software Engineering, Wuhan University, China; 2National Engineering Research Center for Multimedia Software School of Computer, Wuhan University, Wuhan, China; 3Hubei Provincial Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China
With the development of multichannel audio systems, the 3D audio systems have already come into our lives. But the increasing number of channels brought chal-lenges to storage and transmission of large amounts of data. Spatial Audio Cod-ing (SAC), the mainstream of 3D audio coding technologies, is key to reproduce 3D multichannel audio signals with efficient compression. Just Noticeable Dif-ference (JND) characteristics of human auditory system can be utilized to reduce spatial perceptual redundancy in the spatial parameters quantization process of SAC. However, the current quantization methods of SAC didn’t fully combine the JND characteristics. In this paper, we proposed a Perceptual Lossless Param-eters Quantization (PLSPQ) method, the azimuthal and elevational quantization step sizes of spatial parameters are combined with JNDs. Both objective and sub-jective experiments have conducted to prove the high efficiency of PLSPQ meth-od. Compared with reference method SLQP-L/SLQP-H, the quantization code-book size of PLSPQ has decreased by 16.99% and 27.79% respectively, while preserving similar listening quality.
A Comparative Study For Known Item Visual Search Using Position Color Feature Signatures
Charles University, Czech Republic
According to the results of the Video Browser Showdown competition, position-color feature signatures proved to be an effective model for visual known-item search tasks in BBC video collections. In this paper, we investigate details of the retrieval model based on feature
signatures, given a state-of-the-art known item search tool - Signature-based Video Browser. We also evaluate a preliminary comparative study for three variants of the model. In the discussion, we analyze logs and provide clues for understanding the performance of the model variants.
Smart loudspeaker arrays for self-coordination and user tracking
School of Electrical Engineering, KAIST, Korea, Republic of (South Korea)
The Internet of Things paradigm aims at developing new services through the interconnection of sensing and actuating devices. In this work, we demon-strate what can be achieved through the interaction between multiple sound devices arbitrarily deployed in space but connected through a unified net-work. In particular, we introduce techniques to realize a smart sound array through simultaneous synchronization and layout coordination of multiple sound devices. As a promising application of the smart sound array, we show that acoustic tracking of a user-location is possible by analyzing scat-tering waves induced from the exchange of acoustic signals between multiple sound objects.
Video Search via Ranking Network With Very Few Query Exemplars
1Xi'an Jiaotong University, China, People's Republic of; 2Carnegie Mellon University
This paper addresses the challenge of video search with only a handful query exemplars by proposing a triplet ranking network-based method. Based on the typical scenario for video search system, a user begins the query process by first utilizing the metadata-based text-to-video search module to find an initial set of videos of interest in the video repository. As bridging the semantic gap between text and video is very challenging, usually only a handful relevant videos appear in the initial retrieved results. The user now can use the video-to-video search module to train a new classifier to search more relevant videos. However, since we found that statistically only fewer than 5 videos are initially relevant, training a complex event classifier with a handful of examples is extremely challenging. Therefore, it is necessary to improve video retrieval methods that work for a handful of positive example videos. The proposed triplet ranking network is mainly designed for this situation and has the following properties: 1)This ranking network can learn an off-line similarity matching projection, which is event independent, from other previous video search tasks or datasets. Such that even with only one query video, we can search its relative videos. Then this method can transfer previous knowledge to the specific video retrieval tasks as more and more relative videos being retrieved, to further improve the retrieval performance; 2) It casts the video search task as a ranking problem, and can exploit partial ordering information in the dataset; 3) Based on the above two merits, this method is suitable for the case where only a handful of positive examples exploit. Experimental results show the effectiveness of our proposed method on video retrieval with only a handful of positive exemplars.
Using Object Detection, NLP, and Knowledge Bases to Understand the Message of Images
1University of Mannheim, Germany; 2University of Mannheim, Germany; 3University of Mannheim, Germany; 4CEPS - College of Engineering and Physical Sciences, Department for Computer Science, University of New Hampshire, Durham, New Hampshire, USA
With the increasing amount of multimodal content from social media posts and news articles, it becomes important to not only to allow for conceptual labeling and multimodal (topic) modeling of images and their affiliated texts, but to understand the core message—which we call the gist. Detecting the gist allows new retrieval, tagging, clustering tasks down-stream. The proposed method makes use of Wikipedia and DBpedia knowledge bases, aiming at leveraging associative and semantic connections between the image, its caption and the expressed gist. Within a learning-to-rank approach we show the usefulness of jointly leveraging image and caption signals (best MAP: 0.74). Furthermore, an automatic image tagging and caption generation API is compared to manual given image and caption signals. We show the difficulty to find the correct gists especially for abstract, non-depictable gists and how those benefit from different input signals.
Exploring Large Movie Collections: Comparing Visual Berrypicking and Traditional Browsing
1Otto von Guericke University Magdeburg, Germany; 2Hasso Plattner Institute for Software Systems Engineering, Potsdam, Germany; 3University of Potsdam, Germany
We compare Visual Berrypicking, an interactive approach allowing users to explore large and highly faceted information spaces using similarity-based two-dimensional maps, with traditional browsing techniques. For large datasets, current projection methods used to generate maplike overviews suffer from increased computational costs and a loss of accuracy resulting in inconsistent visualizations. We propose to interactively align inexpensive small maps, showing local neighborhoods only, which ideally creates the impression of panning a large map. For evaluation, we designed a web-based prototype for movie exploration and compared it to the web interface of The Movie Database (TMDb) in an online user study. Results suggest that users are able to effectively explore large movie collections by hopping from one neighborhood to the next. Additionally, due to the projection of movie similarities, interesting links between movies can be found more easily, and thus, compared to browsing serendipitous discoveries are more likely.
Binaural Sound Source Distance Reproduction Based on Distance Variation Function and Artificial Reverberation
1State Key Laboratory of Software Engineering, Wuhan University, China; 2National Engineering Research Center for Multimedia Software, Computer School of Wuhan University, China; 3Hubei Provincial Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China; 4School of Physics and Electronic Science, Guizhou Normal University,China
In this paper, a method combining the distance variation function (DVF) and image source method (ISM) is presented to generate binaural 3D audio with accurate feeling of distance. The DVF is introduced to indicate the change in intensity and inter-aural difference when the distance between listener and source changes. Then an artificial reverberation simulated by ISM is added. The reverberation introduces the energy ratio of direct and reverberant, which provides an absolute cue to distance perception. The distance perception test results indicate improvement for distance perception when sound sources located within 50cm. In addition, the variance of perceptual distance was much smaller than that using DVF only. The reduction of variance is a proof that the method proposed in this paper can generate 3D audio with more accurate and steadier feeling of distance.
Compressing Visual Descriptors of Image Sequences
JOANNEUM RESEARCH, Austria
In recent years, there has been significant progress in developing more compact visual descriptors, typically by aggregating local descriptors. However, all these methods are descriptors for still images, and are typically applied independently to (key) frames when used in instance search tasks in video. Thus, they do not make use of the temporal redundancy of the video, which has negative impacts on the descriptor size and the matching complexity. We propose a compressed descriptor for image sequences, which encodes a segment of video using a single descriptor. The proposed approach is a framework that can be used with different local descriptors and description methods. We describe the extraction and matching process for the descriptor and provide evaluation results on a large video data set.
Effect of Junk Images on Inter-Concept Distance Measurement: Positive or Negative?
Osaka University, Japan
In this paper, we focus on the problem of inter-concept distance measurement (ICDM), which is a task of computing the distance between two concepts. ICDM is generally achieved by constructing a visual model of each concept and calculating the dissimilarity score between two visual models. The process of visual concept modeling often suffers from the problem of junk images, i.e., the images whose visual content is not related to the given text-tags. Similarly, it is naively expected that junk images also give a negative effect on the performance of ICDM. On the other hand, junk images might be related to its text-tags in a certain (non-visual) sense because the text-tags are given by not automated systems but humans. Hence, the following question arises: Is the effect of junk images on the performance of ICDM positive or negative? In this paper, we aim to answer this non-trivial question from experimental aspects using a unified framework for ICDM and junk image detection. Surprisingly, our experimental result indicates that junk images give a positive effect on the performance of ICDM.
A Virtual Reality Framework for Multimodal Imagery for Vessels in Polar Regions
1University of Delaware, United States of America; 2University of Alaska Fairbanks, United States of America; 3Alfred Wegener Institute, Germany
Maintaining total awareness when maneuvering an ice-breaking vessel is key to its safe operation. Camera systems are commonly used to augment the capabilities of those piloting the vessel, but rarely are these camera systems used beyond simple video feeds. To aid in visualization for decision making and operation, we present a scheme for combining multiple modalities of imagery into a cohesive Virtual Reality application which provides the user with an immersive, real scale, view of conditions around a research vessel operating in polar waters. The system incorporates imagery from a 360 degree Long Wave Infrared camera as well as an optical band stereo camera system. The Virtual Reality application allows the operator multiple natural ways of interacting with and observing the data, as well as provides a framework for further inputs and derived observations.
Multimodal Video-to-Video Linking: Turning to the Crowd for Insight and Evaluation
1Radboud University, Nijmegen, Netherlands, The; 2Vienna University of Technology, Austria; 3University of Twente, Netherlands, The; 4Dublin City University, Ireland; 5EURECOM, France
Video-to-video linking systems allow users to explore and exploit the content of a large-scale multimedia collection interactively and without the need to formulate specific queries. This paper presents a short introduction to video-to-video linking (also called ‘video hyperlinking’) and discusses the latest version of the linking task at the TRECVid 2016 video retrieval benchmark. Specifically, in 2016, the emphasis is on multimodality as it is used by videomakers in order to communicate their message. Crowdsourcing makes three critical contributions. First, it allows us to verify the multimodal nature of the anchors (queries) used in the task. Second, crowdsourcing makes it possible to evaluate the performance of video-to-video linking systems at a large scale. Third, crowdsourcing gives us insight into how people understand the relevance relationship between two linked video segments. This insight is valuable since the relationship between video segments can manifest itself at different levels of abstraction.
Movie Recommendation via BLSTM
Tsinghua University, China, People's Republic of
Traditional recommender systems have achieved remarkable success. However, they only consider users’ long term interests, ignoring the situation when new users don’t have any profile or user delete their tracking information. In order to solve this problem, the session-based recommendations based on Recurrent Neural Networks(RNN) is proposed to make recommendations taking only the behaviour of users into account in a period time. The model showed promising improvements over traditional recommendation approaches.
However, RNN model considers only the previous movie watch information and ignore the movies behind. In this paper, We apply bidirectional long short-term memory (BLSTM) on movie recommender systems to deal with the above problems. Experiments on the Movielens dataset demonstrate relative improvements over previously reported results on the Recall@N metrics respectively.