CLE-ViT: Contrastive Learning Encoded Transformer for Ultra-Fine-Grained Visual Categorization

Loading...
Thumbnail Image
File version

Version of Record (VoR)

Author(s)
Yu, X
Wang, J
Gao, Y
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2023
Size
File type(s)
Location

Macao, China

License
Abstract

Ultra-fine-grained visual classification (ultra-FGVC) targets at classifying sub-grained categories of fine-grained objects. This inevitably requires discriminative representation learning within a limited training set. Exploring intrinsic features from the object itself, e.g., predicting the rotation of a given image, has demonstrated great progress towards learning discriminative representation. Yet none of these works consider explicit supervision for learning mutual information at instance level. To this end, this paper introduces CLE-ViT, a novel contrastive learning encoded transformer, to address the fundamental problem in ultra-FGVC. The core design is a self-supervised module that performs self-shuffling and masking and then distinguishes these altered images from other images. This drives the model to learn an optimized feature space that has a large inter-class distance while remaining tolerant to intra-class variations. By incorporating this self-supervised module, the network acquires more knowledge from the intrinsic structure of the input data, which improves the generalization ability without requiring extra manual annotations. CLE-ViT demonstrates strong performance on 7 publicly available datasets, demonstrating its effectiveness in the ultra-FGVC task. The code is available at https://github.com/Markin-Wang/CLEViT.

Journal Title
Conference Title

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement

© 2023 International Joint Conference on Artificial Intelligence. The attached file is reproduced here in accordance with the copyright policy of the publisher. Please refer to the Conference's website for access to the definitive, published version.

Item Access Status
Note
Access the data
Related item(s)
Subject

Artificial intelligence

Persistent link to this record
Citation

Yu, X; Wang, J; Gao, Y, CLE-ViT: Contrastive Learning Encoded Transformer for Ultra-Fine-Grained Visual Categorization, Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023, pp. 4531-4539