Hard object detection using temporal features

Using data over time to improve model performance

There is a big focus on single shot performance in computer vision, which can have a large cost in terms of model complexity, dataset collection and annotation time as well as training costs.

It might even be a case of diminishing returns where significant effort is spent for little boost in performance.

To get around this I'll talk a little to what my experience has been with applying traditional CV techniques with deep learning to get the results needed.

Where single shot falls apart

For the purpose of this article I'll be coming at the approach of detecting objects in complex environments i.e. small objects, occlusions, blurry features.

Sometimes a complicated environment is a good use case for larger detection models which can learn to extract the information from the scene required for acceptable detection, but as mentioned this can come with a few issues:

A larger model means we need more data to train it
More data collection and labeling time
Longer training times
Longer inference times (especially if you're using a realtime system / object tracking this can be a big problem)