Building AI Agents with Multimodal Models (BAAMM)

 

Course Overview

Learn how to build neural network agents that reason across multiple data types using advanced fusion techniques, OCR, and NVIDIA AI Blueprints for real-world applications like robotics and healthcare.

Course Content

We'll begin with a robotics use case to show how different datatypes impact an effective neural-networks architecture. The mathematical concepts we learn in the robotics use case can then be applied to Large Language Models (LLMs) in order to modify these powerful model to accept non-language data input. We'll end with orchestration where multiple models work together to answer user queries.

Prerequisites

  • A basic understanding of Deep Learning Concepts.
  • Familiarity with a Deep Learning framework such as TensorFlow, PyTorch, or Keras. This course uses PyTorch.

Course Objectives

In this course, you will learn about:

  • Different data types and how to make them neural network ready
  • Model fusion, and the differences between early, late, and intermediate fusion
  • PDF extraction using OCR
  • The difference between modality and agent orchestration
  • Customization of NVIDIA AI Blueprints with Video Search and Summarization (VSS)

Outline: Building AI Agents with Multimodal Models (BAAMM)

1. Early and Late Fusion (1 hr)

  • Use camera and LiDAR data to predict object positions.
  • Convert various datatypes to make them neural network ready.

2. Intermediate Fusion (1 hr)

  • Explore the theory behind effective multimodal model architecture.
  • Train a Contrastive Pretraining model.
  • Create a vector database.

3. Cross-modal Projection (2 hr)

  • Converting a Language model into a Vision Language Model (VLM).
  • Process PDFs with Optical Character Recognition (OCR) tools.

4. Model Orchestration (2 hr)

  • Analyze video using Cosmos Nemotron.
  • Use VSS to answer user queries about video content.
  • Orchestrate with NVIDIA AI Blueprints.

5. Assessment (1 hr)

  • Convert a pre-trained model to input a different datatype using projection.

Prices & Delivery methods

Online Training

Duration
8 hours

Price
  • US $ 500
Classroom Training

Duration
8 hours

Price
  • United States: US $ 500

Schedule

Currently there are no training dates scheduled for this course.