TL; DR: No, GPT is not reliable enough for real-world visual analysis. Sure, it can analyze images in the sense of producing text related to the content of the image. However, for trustworthy, repeatable, and actionable answers, you are better off training a specialized model. Groundlight allows you to do just that: all the machine learning engineering ops are hidden behind its simple service.
An Application of AI Image Analysis – Manufacturing Quality Control
Maybe you would like to automate a manufacturing QA step by asking a Vision-Language Model (VLM) such as the latest version of ChatGPT to look at an image of an assembled product and verify that it’s been properly completed. Or perhaps you would like to check a machine setup prior to starting a job. Or verify that the correct order is shipping to the right customer based on a picture of the box contents. Can ChatGPT act as a reliable business AI assistant in these scenarios?
To pick a specific example, let’s say we would like our AI assistant to tell us if the timing assembly is correctly aligned on a car engine. This is the kind of industrial quality-control task that Groundlight is great at, because mistakes here are very costly, so you want the AI assistant to be correct, and ask for help in cases where it can’t be sure.
An Example of ChatGPT Analyzing a Complex Image
Here’s an example of an image from our engine timing assembly. We can ask ChatGPT to describe what it sees in this image, and it will describe it pretty well.
However, we aren’t interested in having a conversation with ChatGPT. What we’d really like is a stamp of approval in case of perfect alignment, and an alert in the unlikely but potentially fatal case of error. Something that an engineer could directly wire to a quality control system.
So here I am going to try to cajole correct, trustworthy, repeatable and actionable answers out of ChatGPT on this task through a series of question rewordings known as “prompt engineering”. We will see that it’s difficult to get a satisfactory result with ChatGPT.
Let’s ask ChatGPT if our engine timing assembly is correctly aligned.
It gives the right answer (aligned) to this not very well phrased question about a very specialized image, which is great. So can we trust this answer? Is it looking at the right thing?
The downloaded cropped image is:
The highlighted fiducial alignment image is:
ChatGPT 4o is admittedly not very good at object localization and drawing bounding boxes around regions of interest. It was not primarily designed for that. But maybe it can explain a little bit more about what it’s looking at, to convince us that we can trust its judgment.
Oh, the explanation hallucinates a blue dot on the belt, even though it knows about white-colored lines. It’s combining the image provided with its general knowledge of timing assemblies learned from vast amounts of data, including the entirety of the internet and then some, in ways that are not very straightforward.
If I wanted to put a camera at the end of a timing assembly line to verify correctness, I don’t know if I could trust ChatGPT with this decision.
If instead we were talking with a reasonably competent English-speaking person with no specialized knowledge of timing assemblies, we would expect them to have no trouble identifying the letter ‘T’ and pointing to the region around it that will help them determine the answer to the alignment question.
Let’s attempt some prompt-engineering and require GPT-4o to produce a single-word actionable answer, one that could be automatically linked to passing or failing the inspection.
Now ChatGPT is giving the incorrect answer with a high level of confidence. That’s a dramatic reversal from the answer to our original question, and illustrates how much the precise wording submitted to the VLM will matter.
The good news is that we can ask it to explain its reasoning. Why is it answering “No” now?
Clicking on the analysis above reveals what ChatGPT thinks is an appropriate way to measure misalignment in this image, and maybe the reason why it’s giving a wrong answer.
from PIL import Image
# Load the image
image = Image.open('/mnt/data/18-55-56.400614.jpeg')
# Convert the image to grayscale for better edge detection
gray_image = image.convert('L')
# Define the coordinates for the fiducial marks on the gear and the belt
gear_mark_coords = (270, 135) # Approximate coordinates for the 'T' mark on the gear
belt_mark_coords = (270, 155) # Approximate coordinates for the mark on the belt
# Calculate the misalignment distance
misalignment_distance = abs(gear_mark_coords[1] - belt_mark_coords[1])
misalignment_distance
GPT-4o’s explanation of checking alignment is to look at the vertical pixel distance between approximate coordinates for the letter “T” and the white line on the belt. That of course is a naive and incorrect way to establish alignment between two objects in a perspective image.
What we’re getting at with this example is the reliability of these zero-shot (“as-is”, no specialized training on our problem) models. In the absence of training data, off-the-shelf VLMs are a great starting point, providing we don’t trust their highly confident answers too much.
Clearly with sufficient effort put into prompt engineering around any new task, ChatGPT 4o will give the right answer. But don’t rely on ChatGPT (even the GPT-4o version) when specialized, repeated, actionable intel on visual images is required, such as at inspection points. You’re going to have to train a specialized model.
Limitations of GPT-4o Image Analysis
To recap, despite its impressive image analysis and text generation abilities, ChatGPT has some significant limitations when it comes to getting repeatable and reliable answers to specialized questions:
- It’s not as good at questions and images from specialized domains that don’t have troves of visual data available online
- On the other hand, it uses its entire knowledge base to try to answer your question, which means that it will sometimes hallucinate things that are not in the image
- It may take several iterations of prompt engineering to get it to pay attention and correctly respond to the right things in the image
- It will sometimes give you a confidently incorrect answer, its confidence in its answers is not well-calibrated
- It was not designed for object localization and so has 1) trouble visually anchoring in the correct region of the image or 2) communicating this anchoring in a trustworthy manner to the user
- The chat interface is great for a conversation, but not well-suited to integrating with your application without efforts of a dedicated dev team
When you need repeated, actionable, trustworthy answers to a visual question, you would be better off training a specialized model that can be easily integrated into your application. There are alternative image analysis solutions just for such situations.
A Reliable AI Solution to Analyze Images – Groundlight AI
Groundlight AI takes a different approach to image analysis from chatbot-style VLMs such as ChatGPT. Instead of conversational agents, Groundlight AI detectors are specialized vision models trained behind the scenes to reliably answer your visual questions on your data. They output calibrated confidence estimates with every answer and are backed by real people in cases where the model is unsure.
The experience of setting up a Groundlight detector to answer if our engine timing assembly is aligned would involve creating a new detector with the same question we ended up asking of ChatGPT and sending your image in.
Whenever the model is not confident of its answer, as it wouldn’t be the first time the question is asked, it will answer UNSURE and escalate the question to Groundlight's Cloud Labelers (live human monitors who review the question in real-time), who can provide the correct answer, as this screenshot shows.
Integrating detector answers into your application is just a couple of lines of code - see our video tutorial and SDK and integrations documentation for more.
You get human-level correct, trustworthy, repeatable and actionable answers right away. And those answers are immediately used in retraining and improving the specialized model that powers your detector.
Over time, as uncertain images are labeled by people and models are retrained, your detector learns exactly the decision function your business application needs. The more the model learns, the fewer images result in uncertain answers, so over time, the fraction of queries answered by the specialized model increases and approaches 100%, keeping the cost of human labeling low, while maintaining high quality and calibration of confidence in answers.