Multimodal embeddings for Telecom Images

Ericsson

Paris, France

Internship, IT/Technology, English

104 Visits			0 Applicants

Job Description:

Multimodal Embeddings and Tokenization for Telecom-Oriented Images
Ericsson is building a new R&D team in Massy, France, covering many cutting-edge technologies such as AI/ML, cloud, and 5G advanced/6G technologies.
Within this new lab, the Standards & Technology unit is part of Development Unit Networks' global Standards & Technology organization.
At Development Unit Networks, Standards & Technology secures technology leadership in Radio Access Networks (RAN) by actively driving New Concepts, Standardization, Software and Hardware Research, Architecture, and Testbeds.
Technical focus includes 5G evolution, IoT, Digital Twins, Automation / Machine Learning, and Security.
As a part of this young and talented research team, we are looking for motivated interns to contribute to our research activities on Generative AI and Multimodal Intelligence for Telecom Systems.
Context
Modern telecommunication research and engineering rely on a variety of technical images and diagrams, including network architectures, protocol sequence charts, RF signal plots, coverage maps, and KPI dashboards.
These visual representations are essential for understanding system design and performance, yet they remain poorly captured by existing vision-language models (VLMs), which are trained mostly on natural images.
Generic models such as CLIP, Qwen2-VL, or LLaVA are powerful but struggle to interpret telecom-specific visual symbols, legends, and numerical content (e.g., "256-QAM EVM plots" or "handover flow diagrams").
To enable intelligent retrieval, reasoning, and automation in the telecom domain, there is a need for domain-adapted multimodal embeddings and tokenizers that can understand these technical visual cues.
This internship explores the training and evaluation of a multimodal embedding or tokenizer model specialized for telecom-oriented images, aiming to bridge the gap between visual and textual representations in this technical domain.
Research Questions
· How can we design visual tokenizers or embeddings that effectively represent structured, symbolic telecom images (diagrams, plots, dashboards)?
· What pretraining or fine-tuning strategies (contrastive, masked, or alignment-based) best adapt general-purpose VLMs to telecom data?
· How can these embeddings be integrated with textual knowledge bases or LLMs to enable multimodal reasoning (e.g., "Explain Figure 5: RRC connection procedure")?
· How can we evaluate visual grounding and retrieval quality for telecom-specific multimodal datasets?
Objectives
In this internship we are looking for talented students to help us design, train, and evaluate a multimodal embedding model for telecom images.
· Collect and preprocess a dataset of telecom-oriented images (network diagrams, RF plots, dashboards) with accompanying captions or textual context.
· Explore domain-adaptive fine-tuning of existing models (CLIP, SigLIP-2, Qwen2-VL, or LLaVA) using telecom data.
· Train and evaluate a visual tokenizer or encoder capable of generating robust embeddings for structured technical images.
· Design evaluation benchmarks for telecom image understanding (e.g., retrieval, captioning, or visual grounding tasks).
· Integrate the resulting embeddings into a GraphRAG or Hi-RAG pipeline for cross-modal retrieval (image clause entity).
· Analyze performance against generic VLMs and report improvements in domain alignment and factual grounding.
To be successful in the role you need to have:
· Basic understanding of telecommunication systems and network architectures (4G/5G or similar).
· Interest in computer vision, multimodal learning, and representation learning.
· Familiarity with Transformer-based architectures (Vision Transformers, CLIP, or similar).
· Experience with Python and deep learning frameworks (PyTorch, TensorFlow, or similar).
· Experience with cloud-based AI platforms, preferably AWS and Amazon Bedrock (or similar cloud LLM/VLM services).
· Experience with version control and collaboration platforms (Git/GitLab, or similar).
· Curiosity about Generative AI, Large Multimodal Models, and their applications in telecom.
· Qualities of fast learning, critical thinking, autonomy, and teamwork.
· Willingness to work in an inclusive, research-oriented, and multicultural environment.
· Fluent English language skills in both writing and conversation.
· French language skills are a plus

Read full job description

Source:	Company website
Posted on:	19 Dec 2025 (verified 04 Jan 2026)
Type of offer:	Internship
Industry:	Telecommunications
Languages:	English

Multimodal embeddings for Telecom Images

About iAgora

Company

Work

Studies