Multi-Modal Classification Using Images and Text
Published in SMU Data Science Review, 2021
Recommended citation: Miller, Stuart J.; Howard, Justin; Adams, Paul; Schwan, Mel; and Slater, Robert (2020) "Multi-Modal Classification Using Images and Text," SMU Data Science Review: Vol. 3 : No. 3, Article 6. https://scholar.smu.edu/datasciencereview/vol3/iss3/6/
This paper proposes a method for the integration of natural language understanding in image classification to improve classification accuracy by making use of associated metadata. Traditionally, only image features have been used in the classification process; however, metadata accompanies images from many sources. This study implemented a multi-modal image classification model that combines convolutional methods with natural language understanding of descriptions, titles, and tags to improve image classification. The novelty of this approach was to learn from additional external features associated with the images using natural language understanding with transfer learning. It was found that the combination of ResNet-50 image feature extraction and Universal Sentence Encoder embeddings yielded a Top 5 error rate of 73.05% and Top 1 error rate of 54.65%, which is an improvement of 1.56% on benchmark results. This suggests external text features can be used to aid image classification when they are available.
Recommended citation: Miller, Stuart J.; Howard, Justin; Adams, Paul; Schwan, Mel; and Slater, Robert (2020) “Multi-Modal Classification Using Images and Text,” SMU Data Science Review: Vol. 3 : No. 3, Article 6.