PhD in Information Engineering and Science

Coordinator: Prof. Mauro Barni


Home \| DIISM \| \| Login	Privacy e Cookie policy

Info

How to apply

Information for PhD
students

Internal Rules

Structure

PhD Board

Research lines

Multimodal Deep Learning: Principles And Architectures

Prof.	Franco Scarselli University of Siena - Dipartimento di Ingegneria dell'Informazione e Scienze Matematiche Paolo Andreini University of Siena - Dipartimento di Ingegneria dell'Informazione e Scienze Matematiche
Course Type	Type B
Calendar	Aula 103 February 24-28 h 14-18
Room
Program	Brief abstract Multimodal models provide pivotal technologies in deep learning, offering unprecedented capabilities in content creation and manipulation across multiple data modalities. These models can power applications in natural language processing, image synthesis, drug discovery, and more. Their ability to generate diverse, high-quality content and integrate information across different modalities has deep implications for science and society. This course aims to provide a basic understanding of multimodality in machine learning, covering foundational principles and architectures. Initially, the course will introduce complex networks, such as modern convolutional neural networks, generative adversarial networks, variational autoencoders, diffusion models, and large language models. Later, we will explore how multimodal models are built by combining the above mentioned networks and exploiting novel learning frameworks, e.g. auto-regression, adversarial learning. Finally, we will show how multimodal models can be used in a large variety of tasks. Syllabus Building Blocks of Multimodal Models: • Convolutional Neural Networks (CNNs): Evolution from AlexNet to ResNet and DenseNet. • Generative Models: Variational Autoencoders (VAEs). • Generative Adversarial Networks (GANs). • Diffusion Models. • Transformers: Architecture and innovations behind transformers. • Overview of Large Language Models (LLMs). Core Multimodal Architectures: • Contrastive Language-Image Pre-Training (CLIP). • Bootstrapping Language-Image Pre-Training (BLIP). • Examples of Multimodal Architectures (Flamingo, Llama). Multimodal Applications and Case Studies: • Few-Shot and Zero-Shot Learning: Techniques enabling models to generalize to new tasks with minimal data and applications. • Interactive Applications: Real-world applications, including chatbots, digital assistants, and their role in enhancing human-computer interaction through multimodal capabilities.

Courses

2024-2025

Previous years

PhD Courses Unipi

PhD Courses Unipg

PhD Students/Alumni

PhD students

Alumni

Other PhD programs

Dip. Ingegneria dell'Informazione e Scienze Matematiche - Via Roma, 56 53100 SIENA - Italy