Person Retrieval with Deep Learning
File version
Author(s)
Primary Supervisor
Gao, Yongsheng
Other Supervisors
Liew, Wee-Chung
Xiong, Shengwu
Shen, Chunhua
Editor(s)
Date
Size
File type(s)
Location
License
Abstract
Person retrieval aims at matching person images across multiple non-overlapping camera views. It has facilitated a wide range of important applications in intelligent video analysis. The task of person retrieval remains challenging due to dramatic changes on visual appearance that are caused by large intra-class variations from human pose and camera viewpoint, misaligned person detection and occlusion. How to learn discriminative features under these challenging conditions becomes the core issue for the task of person retrieval. According to the input modality, person retrieval could be categorised into image-based retrieval and video-based retrieval. Despite decades of efforts, person retrieval is still very challenging and remains unsolved due to the following factors: 1) the large intra-class variations (e.g., pose variation) of pedestrian images, leading to a dramatic change in their appearances; 2) only heuristically coarse-grained region strips or pixel-level annotations directly borrowed from pretrained human parsing models are employed, impeding the efficacy and practicality of region representation learning; 3) absence of useful temporal cues for boosting the video person retrieval system; This thesis reports a series of technical solutions towards addressing the above challenges in person retrieval. To address the large intra-class variations among the person images, we introduce an improved triplet loss such that the global feature representations from the same identity are closely clustered for person retrieval. To learn a discriminative region representation within fine-grained segments while avoiding expensive pixel-level annotations, we introduce a novel identityguided human region segmentation method that can predict informative region segments, enabling discriminative region representation learning for person retrieval. To extract useful temporal cues for video person retrieval, we build a two-stream architecture, named appearance-gait network, to jointly learn the appearance features and gait feaures from RGB video clips and silhouette video clips. To further provide potentially useful information for person retrieval, we introduce a lightweight and effective knowledge distillation method for facial landmark detection. We believe that the proposed person retrieval approaches can serve as benchmark methods and provide new perspectives for the person retrieval task.
Journal Title
Conference Title
Book Title
Edition
Volume
Issue
Thesis Type
Thesis (PhD Doctorate)
Degree Program
Doctor of Philosophy (PhD)
School
School of Eng & Built Env
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
The author owns the copyright in this thesis, unless stated otherwise.
Item Access Status
Note
Access the data
Related item(s)
Subject
person retrieval
image-based retrieval
video-based retrieval
large intra-class variations
fine-grained segments
pixel-level annotations