One Day Meeting: BMVA Symposium on Vision and Language
Wednesday 17 January 2024
Chairs: Michael Wray (University of Bristol), Davide Moltisanti (University of Bath), and Tengda Han (University of Oxford).
Invited Speakers
- Yuki M. Asano (University of Amsterdam)
- Frank Keller (University of Edinburgh) - Presentation slides.
- Hilde Kuehne (University of Bonn/MIT-IBM Watson AI Lab) - Presentation slides.
- Andrew Zisserman (University of Oxford) - Presentation slides.
Programme
Start | End | Title | ||
---|---|---|---|---|
09:00 | 09:15 | Registration/Poster Set-up | ||
09:15 | 09:20 | Opening Remarks | ||
09:20 | 10:00 | Invited Speaker - Hilde Kuehne | ||
10:00 | 10:40 | Invited Speaker - Frank Keller | ||
10:40 | 11:55 | Coffee Break + Posters | ||
11:55 | 12:40 | Accepted Talks - Pt. 1 | ||
12:40 | 13:40 | Lunch + Posters | ||
13:40 | 14:20 | Invited Speaker - Andrew Zisserman | ||
14:20 | 15:00 | Invited Speaker - Yuki M. Asano | ||
15:00 | 15:30 | Coffee Break + Posters | ||
15:30 | 16:30 | Accepted Talks - Pt. 2 | ||
16:30 | 17:00 | Past, Present, and Future of Vision-Language |
Talks
- Anil Batra - University of Edinburgh: Efficient Pre-training for Procedural Videos
- Yongshuo Zong - University of Edinburgh: Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations
- Oleg Sinavski - Wayve: Language and Video Generative AI in Autonomous Driving
- Christina Kassab - University of Oxford: Visual-Language Models for Scene Understanding and Localisation
- Junjie Shentu - Durham University: CXR-IRGen: An Integrated Vision and Language Model for the Generation of Clinically Accurate Chest X-Ray Image-Report Pairs
- Kevin Flanagan - University of Bristol: Learning Temporal Sentence Grounding From Narrated EgoVideos
- Walid Bousselham - Bonn University: Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
- Asmar Nadeem, Mahrukh Awan - University of Surrey: Enhancing Audio Visual Question Answering with Contextual Multi-modal Alignment: Leveraging Audio-Visual Large Models for Generalization and Robustness
Posters
- Pramit Saha - University of Oxford: Do Unimodal Clients Benefit from Multimodal Clients in Incongruent Multimodal Federated Learning
- Bingchen Zhao - University of Edinburgh: Vision Learners Meet Web Image-Text Pairs
- Michael Dorkenwald - University of Amsterdam: PIN: Positional Insert unlocks object localisation abilities in VLMs
- Pengwan Yang - University of Amsterdam: Aligning unpaired text and visual datasets using uni-modally pretrained models
- Nina Shvetsova - University of Bonn: Learning Video-Language Models with Limited Supervision
- Anil Batra - University of Edinburgh: Efficient Pre-training for Procedural Videos
- Victor Escorcia - Samsung AI Center Cambridge: Finding Moments in Video before CLIP
- Antoine Yang - Inria: VidChapters-7M: Video Chapters at Scale
- Thomas Hudson - Durham University: Uniting NLP and Vision with Video
- Lucas Ventura - ENPC: CoVR: Learning Composed Video Retrieval from Web Video Captions
- Olga Loginova - University of Trento, Italy: Scaling VQA Datasets with Automated Temporal Annotation for Enhanced Temporal Reasoning
- Thomas Winterbottom - Durham University: Multimodal Skin Cancer Detection
- Charles Raude - ENPC, Imagine lab (France): A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision
- Adriano Fragomeni - University of Bristol: ConTra:(Con) text (Tra) nsformer for Cross-Modal Video Retrieval
- Yongshuo Zong - University of Edinburgh: Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations
- Ana-Maria Marcu - Wayve: Language and Video Generative AI in Autonomous Driving
- Christina Kassab - University of Oxford: Visual-Language Models for Scene Understanding and Localisation
- Junjie Shentu - Durham University: CXR-IRGen: An Integrated Vision and Language Model for the Generation of Clinically Accurate Chest X-Ray Image-Report Pairs
- Kevin Flanagan - University of Bristol: Learning Temporal Sentence Grounding From Narrated EgoVideos
- Walid Bousselham - Bonn University: Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
- Asmar Nadeem, Mahrukh Awan - University of Surrey: Enhancing Audio Visual Question Answering with Contextual Multi-modal Alignment: Leveraging Audio-Visual Large Models for Generalization and Robustness
Meeting Location
The meeting will take place at:
British Computer Society (BCS), 25 Copthall Avenue, London EC2R 7BP
Registration
We keep the cost of attending these events as low as possible to ensure no barriers from the whole computer vision community attending. The registration costs are as follows
-
Presenters (Poster/Talk): £0
-
BMVA Members: £20
-
Non BMVA Members £40 (Includes membership to the BMVA)
Paid options include lunch and refreshments for the day
Call for Presentations
We invite presentations from both academia and industry, bringing together researchers interested in all aspects of vision-language models and their potential future applications.
Presentations can be either published work or ongoing research. The aim is to spotlight recent work, receive feedback and spur discussion. We expect presentations to cover one or more of the following areas (but this is a non-exhaustive list):
- What’s next for Vision-Language Models?
- Generative models (both text and vision generation)
- Retrieval and Search Based-Tasks
- Visual Question Answering
- Foundation Models
- Is Large-Scale Pre-training Required?
- Evaluating Generative Vision-Language Models
- Vision Language Benchmarks and Datasets
- Adversarial Attacks
- Security and Wellbeing
Presentations will take the form of one of three types: poster, talk, demo. When applying for a presentation, you are welcome to choose any combination of the options. You can apply to give a presentation using the form below. Note that this will NOT be a hybrid event, therefore we expect presenters to attend the event in person.
The deadline for submitting a presentation has passed.
Organisers
- Michael Wray (University of Bristol)
- Davide Moltisanti (University of Bath)
- Tengda Han (University of Oxford)
Banner image by Davide Moltisanti