Research Projects

Please see our PLUS Lab website for more details.

Few-shot Learning

Despite recent success of deep neural networks, it remains challenging to efficiently learn new visual concepts from limited training data. To address this problem, a prevailing strategy is to build a meta-learner that learns prior knowledge on learning from a small set of annotated data. We propose a novel meta-learning method for few-shot classification based on two simple attention mechanisms: one is a spatial attention to localize relevant object regions and the other is a task attention to select similar training data for label prediction. We implement our method via a dual-attention network and design a semantic-aware meta-learning loss to train the meta-learner network in an end-to-end manner.

A Dual Attention Network with Semantic Embedding for Few-shot Learning, [pdf]
Shipeng Yan, Songyang Zhang, Xuming He
AAAI Conference on Artificial Intelligence (AAAI),2019

One-shot Action Localization by Learning Sequence Matching Network, [pdf]
Hongtao Yang, Xuming He, Fatih Porikli
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

Connecting Vision and Language

Linguistic style is an essential part of written communication, which can affect both clarity and attractiveness. With recent advances in vision and language, we can start to tackle the problem of generating image captions that are both visually grounded and appropriately styled. We develop a model that learns to generate visually relevant styled captions from a large corpus of styled text without aligned images. One key component is a novel and concise semantic term representation generated using natural language processing techniques and frame semantics.

SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text, [pdf]
Alexander Mathews, Lexing Xie, Xuming He
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

SentiCap: Generating Image Descriptions with Sentiments, [pdf] [arXiv]
Alexander Mathews, Lexing Xie, Xuming He
AAAI Conference on Artificial Intelligence (AAAI-16), 2016

Object Detection in Context

Exploring contextual relations is one of the key factors to improve object detection under challenging viewing condition and to scale up recognition to large numbers of object classes. We consider two effective approaches that incorporate contextual information: object codetection, which jointly detects object instances in a set of related images, and structural Hough voting, which models the context from 2.5D perspective for object localization under heavy occlusion.

Efficient Scene Layout Aware Object Detection for Traffic Surveillance, [pdf]
Tao Wang, Xuming He, Songzhi Su, Yin Guan
IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017.
Traffic Surveillance Workshop and Challenge Best Paper Award.

Learning to Co-Generate Object Proposals with a Deep Structured Network, [pdf]
Zeeshan Hayder, Xuming He, Mathieu Salzmann
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

Object and Scene Parsing

We address the problem of joint detection and segmentation of multiple object instances in an image, a key step towards scene understanding. Inspired by data-driven methods, we propose an exemplar-based approach to the task of instance segmentation, in which a set of reference image/shape masks is used to find multiple objects. We design a novel CRF framework that jointly models object appearance, shape deformation, and object occlusion.

Deep Free-Form Deformation Network for Object-Mask Registration, [pdf]
Haoyang Zhang, Xuming He
International Conference on Computer Vision (ICCV), 2017

Learning Dynamic Hierarchical Models for Anytime Scene Labeling, [pdf] [arXiv]
Buyu Liu, Xuming He
European Conference on Computer Vision (ECCV), 2016

Holistic Video Understanding

We address the problem of integrating object reasoning with supervoxel labeling in multiclass semantic video segmentation. To this end, we first propose an object-augmented CRF in spatio-temporal domain, which captures long-range dependency between supervoxels, and imposes consistency between object and supervoxel labels. We develop an efficient inference algorithm to jointly infer the supervoxel labels, object activations and their occlusion relations for a large number of object hypotheses.

3D Object Structure Recovery via Semi-supervised Learning on Videos, [pdf]
Qian He, Desen Zhou, Xuming He
British Machine Vision Conference (BMVC),2018

Multiclass Semantic Video Segmentation with Object-Level Active Inference, [pdf] [suppl zip]
Buyu Liu, Xuming He
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

More details on previous projects can be found here.