Recent developments in deep learning have pushed the performance of image OD to new heights by learning-based, data-driven approaches. On the other hand, video OD remains less explored, mostly due to much more expensive data annotation needs. At the same time, Multi-Object Tracking (MOT) shares similar spirits with video OD. However, most MOT datasets are class-specific, which constrains a model’s flexibility to perform tracking on other objects. We propose TrIVD (Tracking and Image-Video Detection), the first framework that unifies image OD, video OD, and MOT within one end-to-end model. Experiments demonstrate that TrIVD achieves state-of-the-art performance across all image/video OD and MOT tasks.