PhD "Multimodal Multi-Hop Reasoning for Video Analysis" F/H

Orange

Lannion, Francia

Scienza/Ricerca, Inglese

42 Visite			0 Candidati

Registrarsi

Descrizione del lavoro:

Publication date : Apr 14, 2026, 4:48PM
Your role is to pursue a PhD thesis on "Multimodal multi-hop reasoning for video analysis".

Multimodal reasoning represents a major shift in AI, going beyond single-modality approaches to jointly process visual, linguistic and auditory information. The main challenge is to integrate these heterogeneous sources, which differ in structure and representation. Recently, so-called "omni" unified models have emerged that can account for multiple modalities simultaneously, but their use of each modality remains poorly understood.

Videos particularly illustrate this complexity: they combine visual, audio and sometimes textual content (subtitles) and constitute a demanding evaluation domain. Multi-hop video reasoning must link cues dispersed across different segments while ensuring temporal alignment, semantic coherence and robust intermodal fusion in the presence of asynchronous signals.

The thesis goal is to study the interaction between modalities in video analysis and to improve multi-hop reasoning across distinct segments. Determining when and how multiple modalities contribute to reasoning represents just part of the challenge. Current models fail to guarantee consistent use of the full modality set, with some multimodal configurations underperforming unimodal reasoning. These findings suggest dataset biases, "modality collapse" phenomena, and fundamental limitations in modality alignment and exploitation.

Research directions will be organized along two axes.

Axis 1: evaluation, robustness and interpretability. This will involve characterizing the conditions under which models truly exploit multiple modalities and when they fall back to a single one, using probing, systemic analyses, modality ablations and controlled data manipulations (synthetic data, counterfactual examples, physics-informed scenarios). Robustness protocols (noise, suppression or misalignment of modalities) will allow diagnosing the causal role of each signal.

Axis 2: solutions and training of truly multimodal models. Based on the identified challenges, the thesis will aim to design and train architectures and learning procedures that promote collaboration between modalities (attention or routing mechanisms, intermodal coherence constraints, temporal grounding objectives). The ambition is to obtain truly multimodal, robust, efficient and interpretable multi-hop video reasoning models that outperform their unimodal counterparts on tasks explicitly designed to require integration of multiple modalities.
Hard and soft skills required for the position
* Proficiency in Deep Learning techniques (text, image, audio or video processing).
* Programming skills, particularly in Python, with experience in deep learning frameworks such as PyTorch or TensorFlow.
* Ability to analyze and interpret complex data, with strong analytical skills.
* Personal qualities: scientific rigor, autonomy, curiosity, initiative, ability to work in a team.
* Strong oral and writing skills in English for presenting research findings and drafting publications and research reports.
* Ability to present results clearly and pedagogically to different audiences.

Required education (master's degree, engineering diploma, doctorate, scientific and technical field, etc.)

You hold a professional or research master's degree or have graduated from an engineering school in computer science or applied mathematics, preferably with a specialization in one or more fields of artificial intelligence.

Desired experience
* Prior experience in research projects or internships in video processing or multimodality.
* Experience with vision-language models (VLMs) and/or multimodal LLMs (MLLMs).
* Experience in Natural Language Processing (NLP).
* In-depth understanding of LLMs and reasoning models.
* Participation in scientific publications or presentations in the field is a plus.
The thesis will contribute to research at the intersection of computer vision, automatic language processing, and speech processing. As part of this position, you will have the opportunity to develop your skills in artificial intelligence research, deepen your knowledge of vision-language models and multimodality, and strengthen your project management and scientific communication abilities (writing articles, giving presentations at seminars and conferences). You will work on an ambitious project that will allow you to expand your professional network and contribute to high-level publications.

You will benefit from multidisciplinary supervision by experts in machine learning, multimodality, natural language processing, and speech processing. Scientific publications will be produced throughout the thesis, depending on the results obtained.

You will have access to centralized computing resources (an HPC cluster with around a hundred GPUs) for your work on neural networks, and you may also use the Jean Zay supercomputer for your research. Additionally, you will have access to proprietary large language models (LLMs) via API, as well as the necessary licenses for agent-based coding tools (GitHub Copilot, OpenCode).
The proposed gross salary ranges between37 KEURand 40 KEUR and is paid over 12 months
Orange Innovation brings together the research and innovation activities and expertise of the Group's entities and countries. We work every day to ensure that Orange is recognized as an innovative operator by its customers and we create value for the Group in each of our projects. With 720 researchers, thousands of marketers, developers, designers and data analysts, it is the expertise of our 6,000 employees that fuels this ambition every day. Orange Innovation anticipates technological breakthroughs and supports the Group's countries and entities in making the best technological choices to meet the needs of our consumer and business customers.

Within Orange Innovation, you will be integrated into a cutting-edge research team specializing in AI expertise. The team conducts activities in the field of natural language processing (NLP), covering a wide range of topics such as agentic AI, deep research, language modeling, multimodality, semantic analysis, information extraction, document processing, knowledge management, human-machine dialogue, and more. You will be part of a research ecosystem working alongside Data Scientists and developers, supporting the practical application of the studied concepts. The team belongs to the Data & AI Department, whose mission is to consolidate key skills to support the company's transformation, develop use cases, enrich services, and improve workflows by leveraging data and its processing, notably through Artificial Intelligence.
At Orange, only your skills matter.
Regardless of your age, gender, background, origin, religion, sexual orientation, disability, neurodiversity, or appearance, we actively encourage diversity within our teams, as it is a collective strength and a driver of innovation.
Orange is a disability-inclusive employer: please feel free to let us know about any specific needs you may have

Visualizza tutto

Provenienza:	Web dell'azienda
Pubblicato il:	15 Apr 2026 (verificato il 04 Gui 2026)
Tipo di impiego:	Lavoro
Settore:	Telecomunicazioni
Compensation:	40000 EUR
Lingue:	Inglese

Registrarsi

PhD "Multimodal Multi-Hop Reasoning for Video Analysis" F/H

Chi siamo

Azienda

Lavoro

Studi