Welcome!

By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!

Building AI agents with LLM's

Chneu

Administrator
Staff member
Joined
Nov 2, 2017
Messages
9
Now that Multimodal LLMs (MLLMs) have been released, it's possible to do vision-based tasks using famous LLM models like GPT4, GPT 4o mini, Claude Sonnet, and Gemini 1.5 Vision Pro latest. However, most of these models are not cost-effective for experimenting with complete tasks as the token price has been incremental since the release of these LLM models. Example: GPT 4o can now operate a Robot arm for minimal tasks, Claude can do web navigation, and OS navigation in a Controlled Virtual environment.

Inference using Vision-only approach is very time-consuming. Further experiments on open-sourced models like Llama 3.2 vision or Llava would be very handy. However, each model behaves differently in different use cases. Testing on a Virtual environment is highly recommended - however, support for an open-source model opens so many doors for resource-poor developers and researchers.
 
Top Bottom