Omid Poursaeed

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Visual-LLMs (V-LLMs) have enabled exceptional performance in vision-language tasks. However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness and fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions.

K. Ranasinghe, S. Shukla, Omid Poursaeed, M. Ryoo, T. Lin

CVPR, 2024.

PDF ArXiv

WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians

Existing approaches for 3D scene stylization demonstrate proficiency in transferring colors and textures but often struggle with replicating the geometry of the scenes. In our work, we leverage an explicit Gaussian Scale (GS) representation and directly match the distributions of Gaussians between style and content scenes using the Earth Mover’s Distance (EMD). By employing the entropy-regularized Wasserstein-2 distance, we ensure that the transformation maintains spatial smoothness. Our method achieves high-resolution 3D stylization by faithfully transferring details from 3D style scenes onto the content scene. Furthermore, WaSt-3D consistently delivers results across diverse content and style scenes without necessitating any training.

D. Kotovenko, O. Grebenkova, N. Sarafianos, A. Paliwal, P. Ma, Omid Poursaeed, S. Mohan, Y. Fan, Y. Li, R. Ranjan, B. Ommer

ECCV, 2024.

PDF Code ArXiv

Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning

We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP’s contrastive loss, intending to train an alignment between the patch tokens of the vision encoder and the CLS token of the text encoder. Using pre-trained CLIP encoders with PACL, we are able to set the state-of-the-art on the task of open vocabulary zero-shot segmentation on 4 different segmentation benchmarks: Pascal VOC, Pascal Context, COCO Stuff and ADE20K. Furthermore, we show that PACL is also applicable to image-level predictions and when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy compared to CLIP.

J. Mukhoti, T.Y. Lin, Omid Poursaeed, R. Wang, A. Shah, P. Torr, S. Lim

CVPR (Highlight), 2023.

PDF ArXiv

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Modern multi-stage vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, this added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is actually unnecessary. By pretraining with a strong pretext task (MAE), we can strip out all the bells and whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Simple MViT, an extremely simple multi-stage vision transformer that is more accurate than previous models while being significantly faster both at inference and during training.

C. Ryali, Y. Hu, D. Bolya, C. Wei, H. Fan, B. Huang, V. Aggarwal, A. Chowdhury, Omid Poursaeed, J. Hoffman, J. Malik, Y. Li, C. Feichtenhofer

ICML (Oral), 2023.

Code ArXiv

A Unified Model for Tracking and Image-Video Object Detection

Recent developments in deep learning have pushed the performance of image OD to new heights by learning-based, data-driven approaches. On the other hand, video OD remains less explored, mostly due to much more expensive data annotation needs. At the same time, Multi-Object Tracking (MOT) shares similar spirits with video OD. However, most MOT datasets are class-specific, which constrains a model’s flexibility to perform tracking on other objects. We propose TrIVD (Tracking and Image-Video Detection), the first framework that unifies image OD, video OD, and MOT within one end-to-end model. Experiments demonstrate that TrIVD achieves state-of-the-art performance across all image/video OD and MOT tasks.

P. Liu, R. Wang, P. Zhang, Omid Poursaeed, Y. Zhou, X. Cao, S. Roy, A. Shah, S. Lim

ArXiv, 2023.

PDF ArXiv

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. We formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.

A. Aflal, S. Shukla, Omid Poursaeed, P. Zhang, A. Shah, S. Lim

ICCVW, 2023.

PDF ArXiv

Universal Pyramid Adversarial Training for Improved ViT Performance

Pyramid Adversarial training has been shown to be very effective for improving clean accuracy and robustness of vision transformers. However, due to the iterative nature of adversarial training, the technique is up to 7 times more expensive than standard training. To make the method more efficient, we propose Universal Pyramid Adversarial training, where we learn a single pyramid adversarial pattern shared across the whole dataset instead of the sample-wise patterns. We decrease the computational cost of Pyramid Adversarial training by up to 70 percent while retaining the majority of its benefits. In addition, to the best of our knowledge, we are also the first to find that universal adversarial training can be leveraged to improve clean model performance.

P. Chiang, Y. Zhou, Omid Poursaeed, S. Shukla, T. Goldstein, S. Lim

ArXiv, 2023.

PDF ArXiv

Robustness and Generalization via Generative Adversarial Training

Several defenses have been proposed to improve robustness of deep neural networks against input variations. However, current defenses can only withstand the specific attack used in training and often degrade performance of the model on clean images. In this paper, we present an approach to simultaneously improve the model’s generalization and robustness to unseen adversarial attacks. Instead of altering a single pre-defined aspect of images, we generate a spectrum of low-level, mid-level and high-level changes using generative models with a disentangled latent space. We show that adversarial training with our approach not only improves performance of the model on clean images but also makes it robust against unforeseen attacks.

Omid Poursaeed, Tianxing Jiang, Harry Yang, Serge Belongie, Ser-Nam Lim

ICCV, 2021.

PDF Slides Arxiv Supp

Coupling Explicit and Implicit Surface Representations for Generative 3D Modeling

We propose a novel neural architecture for representing 3D surfaces by harnessing complementary explicit and implicit shape representations. We make these two representations synergistic by introducing novel consistency losses. Our hybrid architecture outputs results which are superior to the output of the two equivalent single-representation networks, yielding smoother explicit surfaces with more accurate normals, and a more accurate implicit occupancy function. Additionally, our surface reconstruction step can directly leverage the explicit atlas-based representation. This process is computationally efficient, and can be directly used by differentiable rasterizers, enabling training our hybrid representation with image-based losses.

Omid Poursaeed, Matthew Fisher, Noam Aigerman, Vladimir Kim

ECCV, 2020.

PDF Slides Video Arxiv Supp

Interpolative AutoEncoders for Unsupervised Few-Shot Image Generation

We aim to build image generation models that generalize to new domains from few examples. To this end, we first investigate the generalization properties of classic image generators, and discover that autoencoders generalize extremely well to new domains, even when trained on highly constrained data. We leverage this insight to produce a robust, unsupervised few-shot image generation algorithm, and introduce a novel training procedure based on recovering an image from data augmentations. Our Augmentation-Interpolative AutoEncoders synthesize realistic images of novel objects from only a few reference images, and outperform both prior interpolative models and supervised few-shot image generators.

Davis Wertheimer, Omid Poursaeed, Bharath Hariharan

ICLR, 2020.

PDF Supp ArXiv

Self-supervised Learning of Point Clouds via Orientation Estimation

While deep neural networks have achieved impressive results on point cloud learning tasks, they require massive amounts of manually labeled data. In this paper we leverage 3D self-supervision for learning downstream tasks on point clouds with fewer labels. A point cloud can be rotated in infinitely many ways, which provides a rich label-free source for self-supervision. We consider the auxiliary task of predicting rotations that in turn leads to useful features for other tasks. Using experiments on ShapeNet and ModelNet, we demonstrate that our approach outperforms the state-of-the-art. Moreover, features learned by our model are complementary to other self-supervised methods and combining them leads to further performance improvement.

Omid Poursaeed, Tianxing Jiang, Han Qiao, Nayun Xu, Vladimir Kim

3DV, 2020.

PDF Code Slides Video Arxiv

Neural Puppet: Generative Layered Cartoon Characters

We propose a learning based method for generating new animations of a cartoon character given a few example images. We express pose changes as a deformation of a layered 2.5D template mesh, and devise a novel architecture that learns to predict mesh deformations matching the template to a target image. In addition to coarse poses, character appearance also varies due to shading, out-of-plane motions, and artistic effects. We capture these subtle changes by applying an image translation network to refine the mesh rendering. Our generative model can be used to synthesize in-between frames and to create data-driven deformation. Our template fitting procedure outperforms state-of-the-art generic techniques for detecting image correspondences.

Omid Poursaeed, Vladimir Kim, Eli Shechtman, Jun Saito, Serge Belongie

WACV, 2019.

PDF Poster Slides ArXiv Supp

Differential Privacy has Disparate Impact on Model Accuracy

Differential privacy (DP) is a popular mechanism for training machine learning models with bounded leakage about the presence of specific points in the training data. The cost of differential privacy is a reduction in the model’s accuracy. We demonstrate that in the neural networks trained using differentially private stochastic gradient descent (DP-SGD), this cost is not borne equally: accuracy of DP models drops much more for the underrepresented classes and subgroups. We demonstrate this effect for a variety of tasks and models, including sentiment analysis of text and image classification. We then explain why DP training mechanisms such as gradient clipping and noise addition have disproportionate effect on the underrepresented and more complex subgroups.

Eugene Bagdasaryan, Omid Poursaeed, Vitaly Shmatikov

NeurIPS, 2019.

PDF Code Poster

Deep Fundamental Matrix Estimation without Correspondences

Estimating fundamental matrices is a classic problem in computer vision. Traditional methods rely heavily on the correctness of estimated key-point correspondences, which can be noisy and unreliable. As a result, it is difficult for these methods to handle image pairs with large occlusion or significantly different camera poses. In this paper, we propose novel neural network architectures to estimate fundamental matrices in an end-to-end manner without relying on point correspondences. New modules and layers are introduced in order to preserve mathematical properties of the fundamental matrix as a homogeneous rank-2 matrix with seven degrees of freedom. We analyze performance of the proposed model on the KITTI dataset, and show that they achieve competitive performance with traditional methods without the need for extracting correspondences.

Omid Poursaeed, Guandao Yang, Aditya Prakash, Hanqing Jiang, Qiuren Fang, Bharath Hariharan, Serge Belongie

ECCV, 2018.

PDF Code Slides ArXiv Poster

Generative Adversarial Perturbations

We propose novel generative models for creating adversarial examples, slightly perturbed images resembling natural images but maliciously crafted to fool trained models. Our approach can produce image-agnostic and image-dependent perturbations for both targeted and non-targeted attacks. We also demonstrate that similar architectures can achieve impressive results in fooling both classification and semantic segmentation models, obviating the need for hand-crafting attack methods for each task. We improve the state-of-the-art performance in universal perturbations by leveraging generative models in lieu of current iterative methods. Our attacks are considerably faster than iterative and optimization-based methods at inference time. Moreover, we are the first to present effective targeted universal perturbations.

Omid Poursaeed, Isay Katsman, Bicheng Gao, Serge Belongie

CVPR, 2018.

PDF Code Slides ArXiv Poster Supp

Stacked Generative Adversarial Networks

We propose a novel generative model which is trained to invert the hierarchical representations of a bottom-up discriminative network. Our model consists of a top-down stack of GANs, each learned to generate lower-level representations conditioned on higher-level representations. A representation discriminator is introduced at each feature hierarchy to encourage the representation manifold of the generator to align with that of the bottom-up discriminative network. Unlike the original GAN that uses a single noise vector to represent all the variations, our SGAN decomposes variations into multiple levels and gradually resolves uncertainties in the top-down generative process. Based on visual inspection, Inception scores and visual Turing test, we demonstrate that SGAN is able to generate images of much higher quality than GANs without stacking.

Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, Serge Belongie

CVPR, 2017.

PDF Code Poster Slides ArXiv

Vision Based Real Estate Price Estimation

Several online real estate database companies provide automatic estimation of market values for houses using a proprietary formula. Although these estimates are often close to the actual sale prices, in some cases they are highly inaccurate. One of the key factors that affects the value of a house is its interior and exterior appearance, which is not considered in calculating these estimates. In this paper, we evaluate the impact of visual characteristics of a house on its market value. Using deep convolutional neural networks on a large dataset of photos of home interiors and exteriors, we develop a novel framework for automated value assessment using these photos in addition to other home characteristics. By applying our proposed method for price estimation to a new dataset of real estate photos and metadata, we show that it outperforms Zillow’s estimates.

Omid Poursaeed, Tomas Matera, Serge Belongie

Machine Vision and Applications, 2017.

PDF Dataset ArXiv Supp

Omid Poursaeed

Research Scientist at Meta AI

About

Selected Publications

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians

Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

A Unified Model for Tracking and Image-Video Object Detection

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

Universal Pyramid Adversarial Training for Improved ViT Performance

Robustness and Generalization via Generative Adversarial Training

Coupling Explicit and Implicit Surface Representations for Generative 3D Modeling

Interpolative AutoEncoders for Unsupervised Few-Shot Image Generation

Self-supervised Learning of Point Clouds via Orientation Estimation

Neural Puppet: Generative Layered Cartoon Characters

Differential Privacy has Disparate Impact on Model Accuracy

Deep Fundamental Matrix Estimation without Correspondences

Generative Adversarial Perturbations

Stacked Generative Adversarial Networks

Vision Based Real Estate Price Estimation

Experience

Presentations

Service

Patents

Contact