Improving Multi-Modal Food Detection System with Transfer Learning

Shivani Gowda

Research output: ThesisMaster's Thesis

Abstract

Self-assessment of food intake is important for preventing and treating obesity. The current self-assessment methods of food intake are inaccurate and hard to use. In this thesis, we explore ways to improve machine learning (ML) food classification methods which are the core technical problem of food intake self-assessment. We present a food detection system that utilizes a state-of-the art multi-modal architecture called Vision and Language Transformer (ViLT). This architecture combines both food appearance via the image modality, and description via the textual modality to improve the accuracy of food classification. To further enhance the performance, we incorporate other improvements such as curating a branded food item dataset. We apply transfer learning, an ML method that allows reusing a pre-trained model from a related high-resource task as the starting point for a low-resource task such as ours. This approach reduces the cost and time required compared to building a model from scratch. In addition, we compare our approaches with that of Visual Chat GPT, a combination of vision foundation model and a large language model, and find that our approach for food intake assessment is both accurate and cost-effective.

Original languageEnglish
QualificationMaster of Science, Computer Science
Awarding Institution
  • Loyola Marymount University
Supervisors/Advisors
  • Korpusik, Mandy, Advisor
  • Huang, Lei, Advisor
  • Lin, Junyuan, Advisor
StatePublished - May 2 2023
Externally publishedYes

Cite this