Computational Linguistics Reading Group

Visually Grounded Dialogue

Combining information from language and vision has recently received a lot of attention in AI. From one side, the computer vision community is exploiting NLP methods in order to deepen image understanding (think about image captioning or image generation from text descriptions). On the other side, the NLP community has understood the importance of the visual channel to ground computational models of language into the visual world. For example, in the construction of visually grounded semantic representations. Despite such progress, these models instantiate rather fragile connections between vision and language, and we are still far from truly grasping the linkage between these two modalities. One of the reasons for this is that these systems are mainly devised to learn from very static environments, where a single agent is repeatedly exposed to annotated examples and learns by trying to reproduce the annotations or images (by plain supervised learning).

In this talk, I will introduce a multimodal learning framework where two agents will have to cooperate via language in order to achieve a goal that is grounded in an external visual world. More specifically, I will talk about two alternative tasks that upgrade multimodal learning to dialogue interactions (Visual Dialogue and GuessWhat!?), and introduce two of our current contributions to this system: a new module within multimodal dialogue which is acting as a dialogue manager, and a new multimodal dialogue task specifically designed to capture a linguistic phenomenon called partner specificity.

Computational Linguistics Reading Group

User Tools

Site Tools

Sidebar

Visually Grounded Dialogue

Page Tools