- Joined
- Nov 2, 2017
- Messages
- 9
Now that Multimodal LLMs (MLLMs) have been released, it's possible to do vision-based tasks using famous LLM models like GPT4, GPT 4o mini, Claude Sonnet, and Gemini 1.5 Vision Pro latest. However, most of these models are not cost-effective for experimenting with complete tasks as the token price has been incremental since the release of these LLM models. Example: GPT 4o can now operate a Robot arm for minimal tasks, Claude can do web navigation, and OS navigation in a Controlled Virtual environment.
Inference using Vision-only approach is very time-consuming. Further experiments on open-sourced models like Llama 3.2 vision or Llava would be very handy. However, each model behaves differently in different use cases. Testing on a Virtual environment is highly recommended - however, support for an open-source model opens so many doors for resource-poor developers and researchers.
Inference using Vision-only approach is very time-consuming. Further experiments on open-sourced models like Llama 3.2 vision or Llava would be very handy. However, each model behaves differently in different use cases. Testing on a Virtual environment is highly recommended - however, support for an open-source model opens so many doors for resource-poor developers and researchers.