Welcome!

By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!

Building AI agents with LLM's

Chneu

Administrator
Staff member
Joined
Nov 2, 2017
Messages
14
Now that Multimodal LLMs (MLLMs) have been released, it's possible to do vision-based tasks using famous LLM models like GPT4, GPT 4o mini, Claude Sonnet, and Gemini 1.5 Vision Pro latest. However, most of these models are not cost-effective for experimenting with complete tasks as the token price has been incremental since the release of these LLM models. Example: GPT 4o can now operate a Robot arm for minimal tasks, Claude can do web navigation, and OS navigation in a Controlled Virtual environment.

Inference using Vision-only approach is very time-consuming. Further experiments on open-sourced models like Llama 3.2 vision or Llava would be very handy. However, each model behaves differently in different use cases. Testing on a Virtual environment is highly recommended - however, support for an open-source model opens so many doors for resource-poor developers and researchers.
 

Chneu

Administrator
Staff member
Joined
Nov 2, 2017
Messages
14
Creating a single AI agent to plan and execute tasks is very costly. Claude's Computer Use use case still uses heavy tokens to complete the tasks defined by a user. Building Multi-agents to complete complex tasks can be an alternative.

Prompting is essential. For example, the Audioscribe agent uses GPT-4o to transcribe audio for note-taking, while Gemini 1.5 pro cleans the transcripts for grammar mistakes and noise (noise includes filler words like "uh," "huh," etc.). Clearly defining what agents should do what is the key.
 
Top Bottom