Vision as Model Context Protocol (MCP) service

We think the best way for VLMs to become great at vision is to outsource it to specialized models behind an MCP server

Vision as Model Context Protocol (MCP) service

Author(s):

Dr. Paulina Varshavskaya
Chief Science Officer

Claude Sonnet 3.7 with no tools mis-identifying the shop from the V*Bench/ChatGPTV4-hard dataset.

Claude failing with no tools

Same Claude Sonnet 3.7 correctly identifying the shop with the zoom_to_object tool from mcp-vision.

Claude succeeding with zoom tool

Vision Language Models (VLMs) are blind* but great at reasoning about text

*ish – they are definitely getting better.

Even the largest VLMs struggle with simple vision-based tasks. CLIP-based VLMs can’t tell if a dog is facing left or right. Impressive chatbots can’t count the number of times lines cross or perform other low-level vision tasks with no visual distractions. They don’t know when an image has been rotated or flipped upside-down. To be clear, these tasks are exceedingly easy for a seeing person. And a simple MLP head on top of CNN features, fine-tuned to the right task, will outperform the GPT behemoth when it comes to understanding low-level spatial and geometric information.

The most recent ones are getting better, in fact GPT-4o-mini can answer the question in the video with no trouble. Although Gemini 2.5 Pro does not answer correctly:

Gemini 2.5 Pro without vision tools incorrectly answers "cafe".

But even the most sophisticated current VLMs still are only as good as the image resolution they can ingest. This is why GPT4-o scores only 66% on V*Bench overall and a dedicated image tree search algorithm like ZoomEye does much better. This is also why playing Pokémon Red is thwarted in part by mistaking a bed for the stairs.

On the other hand, large models are ever more impressive and efficient at reasoning about text and the kinds of problems whose representation does not depend on visual attributes and relations. DeepSeek-R1 excels at high-school-level mathematical reasoning, surpassing most humans. And LLMs are becoming expert tool users when it comes to fetching answers to questions their training data doesn’t entail: up-to-date weather or schedules, current events, – we can now ask our favorite chatbot about those because they have been trained to recognize when to use a specialized service that will produce accurate information about just this type of question. 

So what if instead of a dedicated tree search algorithm we used the reasoning abilities and higher-level visual understanding of state-of-the-art VLMs and allowed them to access standardized image processing and computer vision tools? That would combine the VLMs strengths – reasoning and tool use – with the visual-spatial capabilities of state-of-the-art CV specialist models.

Use specialized vision models as MCP servers

The word “standardized” in the preceding paragraph is important. We should make specialized vision models into MCP servers for VLMs to use like any other tool. To anticipate a certain objection: yes, MCP should make this easier by supporting image transport without piping them through the LLM token cruncher (more on that later).

The Model Context Protocol (MCP) is becoming the industry standard for AI tool use. It is now supported by most major models and many IDEs. MCP servers already provide integrations with data and file systems, productivity, development and automation tools, image generation, KB retrieval, and direct control of applications that play music or create 3D models. We should extend this approach to placing the best image analysis and computer vision models – Owl-Vit, YOLO, GroundingDINO, SAM, and of course Groundlight models – behind an MCP server. Then prompt the VLM to recognize when to call on the best available external vision model for semantic or panoptic segmentation, for depth perception, for specific object localization, for specialized-domain image classification and any tools supporting lower-level visual analysis.

Vision tools as MCP servers. Original diagram from https://modelcontextprotocol.io/introduction

Essentially the MCP vision servers let the VLM act in a more agentic manner, grounding its reasoning in sensory inputs like images and videos. The upshot is that we avoid having to reproduce all of the vision capabilities within an integrated VLM, which can focus on reasoning about the visual information it gets from the tools as well as its own.

We could of course just directly call the specialized models’ APIs. And tool use itself, even image analysis tool use, is not a new idea: specialized object-detection and segmentation models were used in Visual Sketchpad, for example, to reason about math problems and object relations in images. But using MCP for this purpose takes advantage of the decoupled architecture enabled by a standard protocol. And fits right into the fast-growing ecosystem of standardized tools.

How would it work?

We’re releasing the open-source, permissively licensed code for mcp-vision. The initial minimal version provides a MCP wrapper around HuggingFace pipelines, currently implemented for zero-shot object detection, and easily extensible to other models. Check out the README for how to use it as a tool with your MCP client and how to add your favorite models and tools to extend it. To use with Claude Desktop on a machine that has NVIDIA GPUs, follow the build instructions and add this to claude_desktop_config.json

"mcpServers": {
  "mcp-vision": {
    "command": "docker",
    "args": ["run", "-i", "--rm", "--runtime=nvidia", "--gpus", "all", "mcp-vision"],
	"env": {}
  }
}

The zoom_to_object tool showcased in the video above calls google/owlvit-large-patch14 via the HuggingFace pipeline interface for open-domain zero-shot object detection and returns an image crop around the best found object, if any. Without the tool, Claude is unable to read the text on the advertising board. With the zoom-in tool using open-domain object detection, it gets the cropped board back at a higher resolution and has no trouble identifying that it is advertising a yoga studio. Both videos created on the same Intel machine running Ubuntu-22.04 with 128GB RAM and a NVIDIA GeForce RTX 3090.

Current protocol limitations

At the time of writing there is no satisfactory standard way to transport images between the client (a VLM such as Claude) and any MCP server. One way it can be done is by sending base64 encoded image bytes, which severely limits the size of the images passed around. The other way is to send a file path or URL, which means that the server needs to provide load/save functionality that is left entirely up to the server developer.

In the videos above, we come up against this limitation when we ask Claude to identify the type of shop from the V*Bench image. You can see the difference in prompts between the two demo videos at the top of the post. We load the image directly into the chat interface when running Claude without tools. And we give it a URL when running Claude with tools. This works because under the hood the HF pipeline accepts URLs for input images. On the way back, when the zoom_to_object tool returns the cropped image, it does so as base64-encoded bytes. At the time of writing, images larger than 500KB cannot be sent to Claude Desktop. So zooming into a small region by cropping is a perfect showcase of tool use that is not as limited by the current protocol and client constraints. 

But in an ideal world, the protocol itself would enforce a standard way to pass URIs for any inputs going to and outputs coming from MCP servers. Maybe they can be made into resources and the protocol would share between the client and the servers.

A vision for tools, the tools of vision

All in all, when it comes to really seeing things, our position is that VLMs should really lean into the collective abilities of all the state-of-the-art specialist tools out there, and that a standardized protocol like MCP is just the guiding light they need. 

Check out https://github.com/groundlight/mcp-vision, please use and contribute to an ever expanding set of vision tools.

Groundlight announces it will provide its service via MCP

Ultra-specialized models with 24/7 human backup behind the standard protocol interface for your VLMs. 

Watch this space for a forthcoming announcement, and meanwhile here's a preview of the Groundlight MCP server: https://github.com/groundlight/groundlight-mcp-server

Citation

If you want to cite ideas in this blog post please use

@online{vision_as_mcp,
  author = {Varshavskaya, Paulina},
  title = {Vision as MCP Service},
  year = {2025},
  organization = {Groundlight AI},
  url = {https://www.groundlight.ai/blog/vision-as-mcp-service},
  urldate = {2025-05-20}
}

References

https://arxiv.org/abs/2401.06209 – Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. Tong et al. 2024 CVPR

https://arxiv.org/abs/2312.03052 – Visual Program Distillation: Distilling Tools and Programmatic Reasoning into VLMs, Hu et al. 2024 CVPR

https://arxiv.org/abs/2503.09837v1 – On the limitations of vision-language models in understanding image transforms, Anis, Ali, Safraz 2025

https://arxiv.org/abs/2205.00363 – Visual Spatial Reasoning, Liu, Emerson, Collier, 2023 

https://arxiv.org/abs/2310.19785 – What’s “up” with vision language models? Investigating their struggle with spatial reasoning, Kamath, Hessel, Chang 2023

https://arxiv.org/abs/2501.12948 – DeepSeek-R1, DeepSeek-AI 2025

https://arxiv.org/abs/2406.09403 – Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models. Hu et al. NeurIPS 2024

https://arxiv.org/abs/2411.16044 – ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration. Shen et al. 2024

https://huggingface.co/datasets/craigwu/vstar_bench – V* Bench Dataset

https://huggingface.co/blog/vlms-2025 – Vision Language Models (Better, Faster, Stronger). Noyan et al. 2025