We discuss the structure of the representations and optimization problems involved in Spatial AI and propose new synthetic datasets that include accurate ground truth information about the scene composition as well as individual object shapes and poses. We furthermore propose evaluation metrics for all aspects of such joint geometric-semantic representations and apply them to a new semantic SLAM framework.