Visual Question Answering
VQA, Pytorch, Deep Learning, Transformer
In this project, we used dataset of VQA v1.0 Open-Ended, and took the 5216 most frequent answers as target number of classes plus “others” class, for the supervised training. We used ResNet-18 for image feature extraction, and RoBERTa as text feature extraction, the two feature are concatenated and sent to a fully connected layer to calculate the multi-label loss. To improve the performance, we introduced Transformer-based VQA, which is inserted between the two features layer(before concatenation) and the fully connected layer. To be more specific, there are 3 Cross-Encoder layers, each of them consist of a self-attention and cross-attention. The result have an accuracy around 67%.
[Github]