Skip to content

Research

Generative Model with Multi-modal Data
Sound-Guided Semantic Image Manipulation[CVPR ’22]
Sound-Guided Semantic Video Generation
[ECCV ’22]

A generative model is a model that can generate unseen images by learning data distribution. It has been widely researched in the computer vision community due to its usefulness in many applications. The generative model can take an input image as a condition and modify the image based on the condition. Similarly, one can generate novel artworks given existing artwork and specific modalities such as text messages or sound. Another application is colorization. For example, the input image can be a grayscale image and the model outputs the colorized image.

Our related publication
[ECCV ’22] Sound-Guided Semantic Video Generation
[CVPR ’22] Sound-Guided Semantic Image Manipulation
[NeurIPS workshop ’21] Sound-Guided Semantic Image Manipulation
[NeurIPS workshop ’21] Audio-Guided Image Manipulation for Artistic Paintings

3D Computer Vision with Deep Learning

A Large-scale Annotated Mechanical Components Benchmark for Classification and Retrieval Tasks with Deep Neural Networks. [ECCV ’20]
3D reconstruction from multi-view images

Due to the recent development of the VR and AR content industry , the demand for 3D content is increasing. However 3D content demands a lot of cost so the supply is not up to the demand. Although research to automate generating 3D contents is one of the oldest topics in the field of computer vision, classical methods have accumulated errors during passing huge pipelines. Due to the development of neural networks, representing 3D scene as neural networks which can represent 3D scene implicit and continuous is available. Therefore, in recent years, research is actively underway to increase the quantitative and qualitative performance of 3D reconstruction through such neural presentation.

Our related publication
[Computer-Aided Design ’20] Object Synthesis by Learning Part Geometry with Surface and Volumetric Representation
[ECCV ’20] A Large-scale Annotated Mechanical Components Benchmark for Classification and Retrieval Tasks with Deep Neural Networks

Multi-modal Representation Learning

First-Person View Hand Segmentation of Multi-Modal Hand Activity Video Dataset [BMVC ’20]
Sound-Guided Semantic Image Manipulation
[CVPR ’22]

Video contains not only visual information but also sound information. This multi-modal information helps understanding content, and it is more descriptive than visual information. Fusing this multi-modal information is an open question, and it is the first step toward general artificial intelligence. Additionally, there are a lot of different kinds of visual sensors which provide complementary information. Fusing this heterogeneous information reduces the uncertainty of the estimation model. In our lab, we research the integration of multiple sensor data with deep learning models.

Our related publication
[ECCV ’22] Sound-Guided Semantic Video Generation
[CVPR ’22] Sound-Guided Semantic Image Manipulation
[NeurIPS workshop ’21] Sound-Guided Semantic Image Manipulation
[NeurIPS workshop ’21] Audio-Guided Image Manipulation for Artistic Paintings
[arXiv ’21] Egocentric View Hand Action Recognition by Leveraging Hand Surface and Hand Grasp Type
[BMVC ’20] First-Person View Hand Segmentation of Multi-Modal Hand Activity Video Dataset
[ICCV ’17] Learning hand articulations by hallucinating heat distribution

Large Scale Dataset Curation

A Large-scale Annotated Mechanical Components Benchmark for Classification and Retrieval Tasks with Deep Neural Networks [ECCV ’20]

Large-scale dataset curation and developing efficient annotation methods are crucial for deep learning since deep learning requires a large-scale dataset to optimize the deep neural networks. For this reason, the performance of the model has a positive correlation with the size and quality of the dataset.

Our related publication
[ECCV ’20] A Large-scale Annotated Mechanical Components Benchmark for Classification and Retrieval Tasks with Deep Neural Networks
[BMVC ’20] First-Person View Hand Segmentation of Multi-Modal Hand Activity Video Dataset

Multi-view 3D Object Detection for Autonomous Driving

ORA3D: Overlap Region Aware Multi-view 3D Object Detection [arXiv ’22]

Object detection in 3D space plays a crucial role in various real-world applications, including autonomous driving systems. Existing 3D object detection methods based on point clouds from LiDAR sensors often yield reliable results, but these methods suffer from a large budget to establish LiDAR sensors per vehicle. Further, camera-based object detection methods using monocular images are economical but, their performance is suboptimal due to insufficient depth cues. Recently, multi-view (and surround-view) camera systems have become an alternative balanced option as they can resolve some of the weaknesses of monocular and stereo vision systems for the 3D object detection task, potentially replacing LiDAR sensors.

Our related publication
[arXiv ’22] ORA3D: Overlap Region Aware Multi-view 3D Object Detection

Dynamic Vision Sensor

Dynamic vision sensor is the next generation of vision camera, which mimics human eyes to visualize motions. Unlike a conventional camera, it locates individual pixel locations in microseconds as event data. Therefore, it has low latency and low power consumption, and it is also robust from motion blur, unlike conventional cameras. With these advantages, it has huge potential usages in AR/VR applications and autonomous driving.

Machine Perception

Enet: A deep neural network architecture for real-time semantic segmentation [arXiv ’16]

Machine perception has been widely researched in the computer vision community due to its importance in many real-life applications. A recent breakthrough in machine perception by introducing deep learning significantly increased the usage of visual machine perception. In particular, hand/body pose estimation, object detection, object recognition, and pixel-wise segmentation have been opened the door to many real-life applications.

Our related publication
[arXiv ’16] Enet: A deep neural network architecture for real-time semantic segmentation

Novel & Future View Synthesis

CT-GAN: Conditional Transformation Generative Adversarial Network for Image Attribute Modification [ECCV ’18]

In computer vision, view synthesis has been used to apply changes in lighting and viewpoint to single-view images of rigid and non-rigid objects. In real-life applications, synthetic views can be used to help predict unobserved part locations and also improve the performance of object grasping with manipulators and the path planning of an autonomous driving system.

Our related publication
[Visual Computer ’19] Latent transformations neural network for object view synthesis
[ECCV ’18] CT-GAN: Conditional Transformation Generative Adversarial Network for Image Attribute Modification


Computer Vision Lab
Department of Artificial Intelligence, Korea University
603, Woojung Hall of Informatics, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul