Person Retrieval with Deep Learning

Loading...
Thumbnail Image
File version
Author(s)
Primary Supervisor

Gao, Yongsheng

Other Supervisors

Liew, Wee-Chung

Xiong, Shengwu

Shen, Chunhua

Editor(s)
Date
2022-01-07
Size
File type(s)
Location
License
Abstract

Person retrieval aims at matching person images across multiple non-overlapping camera views. It has facilitated a wide range of important applications in intelligent video analysis. The task of person retrieval remains challenging due to dramatic changes on visual appearance that are caused by large intra-class variations from human pose and camera viewpoint, misaligned person detection and occlusion. How to learn discriminative features under these challenging conditions becomes the core issue for the task of person retrieval. According to the input modality, person retrieval could be categorised into image-based retrieval and video-based retrieval. Despite decades of efforts, person retrieval is still very challenging and remains unsolved due to the following factors: 1) the large intra-class variations (e.g., pose variation) of pedestrian images, leading to a dramatic change in their appearances; 2) only heuristically coarse-grained region strips or pixel-level annotations directly borrowed from pretrained human parsing models are employed, impeding the efficacy and practicality of region representation learning; 3) absence of useful temporal cues for boosting the video person retrieval system; This thesis reports a series of technical solutions towards addressing the above challenges in person retrieval. To address the large intra-class variations among the person images, we introduce an improved triplet loss such that the global feature representations from the same identity are closely clustered for person retrieval. To learn a discriminative region representation within fine-grained segments while avoiding expensive pixel-level annotations, we introduce a novel identityguided human region segmentation method that can predict informative region segments, enabling discriminative region representation learning for person retrieval. To extract useful temporal cues for video person retrieval, we build a two-stream architecture, named appearance-gait network, to jointly learn the appearance features and gait feaures from RGB video clips and silhouette video clips. To further provide potentially useful information for person retrieval, we introduce a lightweight and effective knowledge distillation method for facial landmark detection. We believe that the proposed person retrieval approaches can serve as benchmark methods and provide new perspectives for the person retrieval task.

Journal Title
Conference Title
Book Title
Edition
Volume
Issue
Thesis Type

Thesis (PhD Doctorate)

Degree Program

Doctor of Philosophy (PhD)

School

School of Eng & Built Env

Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement

The author owns the copyright in this thesis, unless stated otherwise.

Item Access Status
Note
Access the data
Related item(s)
Subject

person retrieval

image-based retrieval

video-based retrieval

large intra-class variations

fine-grained segments

pixel-level annotations

Persistent link to this record
Citation