Building AI agents with LLM's

Chneu · Nov 18, 2024

Now that Multimodal LLMs (MLLMs) have been released, it's possible to do vision-based tasks using famous LLM models like GPT4, GPT 4o mini, Claude Sonnet, and Gemini 1.5 Vision Pro latest. However, most of these models are not cost-effective for experimenting with complete tasks as the token price has been incremental since the release of these LLM models. Example: GPT 4o can now operate a Robot arm for minimal tasks, Claude can do web navigation, and OS navigation in a Controlled Virtual environment.

Inference using Vision-only approach is very time-consuming. Further experiments on open-sourced models like Llama 3.2 vision or Llava would be very handy. However, each model behaves differently in different use cases. Testing on a Virtual environment is highly recommended - however, support for an open-source model opens so many doors for resource-poor developers and researchers.

Chneu · Nov 27, 2024

Creating a single AI agent to plan and execute tasks is very costly. Claude's Computer Use use case still uses heavy tokens to complete the tasks defined by a user. Building Multi-agents to complete complex tasks can be an alternative.

Prompting is essential. For example, the Audioscribe agent uses GPT-4o to transcribe audio for note-taking, while Gemini 1.5 pro cleans the transcripts for grammar mistakes and noise (noise includes filler words like "uh," "huh," etc.). Clearly defining what agents should do what is the key.

Search

Welcome!

Building AI agents with LLM's

Chneu

Administrator

Chneu

Administrator