Coordinator: Prof. Mauro Barni
Home |  DIISM |   | Login Privacy e Cookie policy

Info

Structure




Multimodal Deep Learning: Principles And Architectures

 

Prof.
Franco Scarselli
University of Siena - Dipartimento di Ingegneria dell'Informazione e Scienze Matematiche
Paolo Andreini
University of Siena - Dipartimento di Ingegneria dell'Informazione e Scienze Matematiche
Course Type
Type B
Calendar
Aula 103

February 24-28 h 14-18
Room
Program
Brief abstract
Multimodal models provide pivotal technologies in deep learning, offering unprecedented capabilities in content creation and manipulation across multiple data modalities. These models can power applications in natural language processing, image synthesis, drug discovery, and more. Their ability to generate diverse, high-quality content and integrate information across different modalities has deep implications for science and society. This course aims to provide a basic understanding of multimodality in machine learning, covering foundational principles and architectures. Initially, the course will introduce complex networks, such as modern convolutional neural networks, generative adversarial networks, variational autoencoders, diffusion models, and large language models. Later, we will explore how multimodal models are built by combining the above mentioned networks and exploiting novel learning frameworks, e.g. auto-regression, adversarial learning. Finally, we will show how multimodal models can be used in a large variety of tasks.

Syllabus

Building Blocks of Multimodal Models:
• Convolutional Neural Networks (CNNs): Evolution from AlexNet to ResNet and DenseNet.
• Generative Models: Variational Autoencoders (VAEs).
• Generative Adversarial Networks (GANs).
• Diffusion Models.
• Transformers: Architecture and innovations behind transformers.
• Overview of Large Language Models (LLMs).

Core Multimodal Architectures:
• Contrastive Language-Image Pre-Training (CLIP).
• Bootstrapping Language-Image Pre-Training (BLIP).
• Examples of Multimodal Architectures (Flamingo, Llama).

Multimodal Applications and Case Studies:

• Few-Shot and Zero-Shot Learning: Techniques enabling models to generalize to new tasks with minimal data and applications.
• Interactive Applications: Real-world applications, including chatbots, digital assistants, and their role in enhancing human-computer interaction through multimodal capabilities.





 

Courses

PhD Students/Alumni


Dip. Ingegneria dell'Informazione e Scienze Matematiche - Via Roma, 56 53100 SIENA - Italy