This project aims at building a learning method to answer questions based on image inputs, we used the dataset of VQA v1.0 Open-Ended, and took the 5216 most frequent answers as the target number of classes plus the “others” class, for the supervised training. We used ResNet-18 for image feature extraction, and RoBERTa for text feature extraction, the two features are then concatenated and sent to a fully connected layer to calculate the multi-label loss. To improve the performance, we introduced Transformer-based VQA, which is inserted between the two features layer(before concatenation) and the fully connected layer. To be more specific, there are 3 Cross-Encoder layers, each of them consisting of self-attention and cross-attention. The result has an accuracy of around 67%.