Source: Gartner
In the interest of performance, the typical approach to Computer Vision assumes a 1:1 relationship between applications and GPUs. Every time a camera feeds video into a CV model, that data “has to” be processed in real-time—either by shipping it to the cloud for inference or by investing in powerful (and costly) edge hardware. This is commonly where projects fall apart. It’s in the best interest of the CV expert making the POC to not show the full deployment costs, since the back of the envelope makes it so that nobody would have accepted the project when it was proposed--GPU sticker shock would have stopped it before it began.
To compound this problem, a single CV application might be performing several CV processes, each requiring a different model, and therefore ANOTHER GPU. Let’s say you have a camera on a factory floor and want to perform automated Quality Control. You’ll need to find the object in the frame, so you build an object detection model. Then you want to know if it contains defects so you compare it to a known good ideal version of the product modeled as a comparison. If it fails that comparison, you’ll want to know if it’s because of something being done imprecisely by your machinery so it needs maintenance (which requires a dimension measurement model), or if a flaw is being introduced and you need a floor supervisor to find the root cause (so you need a model for defect detection). If any of your data might pass through the EU or California, you’ll also need a model to detect faces to blur them for privacy compliance. And all of these will be run on each frame, streaming in 30 times a second.
So now I need five models on five GPUs...
The whole point of a business Filter (a CV application) is to turn cameras into spreadsheets–this is a form of what’s called semantic compression. A single 4k video frame is 23.71 Million bytes, but the relevant information held in it might be summarized in a JSON file using only 150 bytes. Eliminating irrelevant visual data is the fundamental task of Computer Vision for business.
Since it’s all fundamentally digital data, you could semantically compress all of the relevant information into a file size that’s equivalent to the tiny pink box in the bottom right corner.
To address these challenges, preprocessing raw video on less-expensive CPUs before it reaches the GPU can significantly reduce computational loads and associated costs. By ensuring that only relevant frames—those containing predefined critical events—are processed, businesses can decrease GPU workload and achieve substantial cost savings—up to 90% in some cases. Implementing such preprocessing strategies allows organizations to scale their CV applications more effectively, making deployments financially viable and operationally efficient.
If you want Vision at the Edge, you’ll need to leverage a combination of utilities to compress raw video into structured, efficient inputs:
By offloading these preprocessing tasks to inexpensive CPU-based systems, businesses can dramatically reduce GPU usage, cutting cloud processing fees and edge hardware requirements (or both). Most Computer Vision projects fail not because the models don’t work, but because the economics don’t work. By shifting the heavy lifting away from GPUs and toward lightweight preprocessing techniques, we make it possible to deploy CV applications at scale—without blowing your budget.