Spotlight Session

Chair :

Christian Wallraven

Foundational AI

Prefer to Classify: Improving Text Classifiers via Auxiliary Preference Learning

Jaehyung Kim

(KAIST)
The development of largely human-annotated benchmarks has driven the success of deep neural networks in various NLP tasks. To enhance the effectiveness of existing benchmarks, collecting new additional input-output pairs is often too costly and challenging, particularly considering their marginal impact on improving the current model accuracy. Instead, additional or complementary annotations on the existing input texts in the benchmarks can be preferable as an efficient way to pay the additional human cost. In this paper, we investigate task-specific preferences between pairs of input texts as a new alternative way for such auxiliary data annotation. From 'pair-wise' comparisons with respect to the task, the auxiliary preference learning enables the model to learn an additional informative training signal that cannot be captured with 'instance-wise' task labels. To this end, we propose a novel multi-task learning framework, called prefer-to-classify (P2C), which can enjoy the cooperative effect of learning both the given classification task and the auxiliary preferences. Here, we provide three different ways to collect preference signals in practice: (a) implicitly extracting from annotation records (for free, but often unavailable), (b) collecting explicitly from crowd workers (high paid), or (c) pre-trained large language models such as GPT-3 (low paid). Given existing classification NLP benchmarks, we demonstrate that the proposed auxiliary preference learning via P2C on them is effective in improving text classifiers.

Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning

Jaehyun Kang

(Yonsei University)
Object-centric learning (OCL) aspires compositional understanding of scenes by using a collection of object representations. OCL for single-view images suffers from inconsistent learning of object representation. We introduce a novel OCL framework for single-view images which consists of two simple modules on top of Slot Attention: Attention Refining Kernel and Intermediate Point Predictor and Encoder prevent slots from being distracted by the background noise and indicate locations for slots to focus on to facilitate learning. We also propose a weak semi-supervision approach for OCL.

Dual-path Adaptation from Image to Video Transformers

Jungin Park

(Yonsei University)
We efficiently transfer the surpassing representation power of the vision foundation models for video understanding with only a few trainable parameters. We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block. Especially for temporal dynamic modeling, we incorporate consecutive frames into a grid-like frameset to precisely imitate vision transformers' capability that extrapolates relationships between tokens. We extensively investigate the multiple baselines from a unified perspective in video understanding and compare them with DualPath.

Compress and Accelerate Binary Neural Networks with Computation Rearrangement

Quang Hieu Vo

(Kyung Hee University)
Binary neural networks (BNNs) have been widely adopted to reduce the computational cost and memory storage on edge-computing devices by using one-bit representation for activations and weights. However, as neural networks become wider/deeper to improve accuracy and meet practical requirements, the computational burden remains a significant challenge even on the binary version. To address these issues, this paper proposes a novel method called Minimum Spanning Tree (MST) compression that learns to compress and accelerate BNNs. The proposed architecture leverages an observation from previous works that an output channel in a binary convolution can be computed using another output channel and XNOR operations with weights that differ from the weights of the reused channel. We first construct a fully connected graph with vertices corresponding to output channels, where the distance between two vertices is the number of different values between the weight sets used for these outputs. Then, the MST of the graph with the minimum depth is proposed to reorder output calculations, aiming to reduce computational cost and latency. Moreover, we propose a new learning algorithm to reduce the total MST distance during training. Experimental results on benchmark models demonstrate that our method achieves significant compression ratios with negligible accuracy drops, making it a promising approach for resource-constrained edge-computing devices.

Interaction AI with Reality

Stable and Consistent Prediction of 3D Characteristic Orientation via Invariant Residual Learning

Chunghyun Park

(POSTECH)
Learning to predict reliable characteristic orientations of 3D point clouds is an important yet challenging problem, as different point clouds of the same class may have largely varying appearances. In this work, we introduce a novel method to decouple the shape geometry and semantics of the input point cloud to achieve both stability and consistency. The proposed method integrates shape-geometry-based SO(3)-equivariant learning and shape-semantics-based SO(3)-invariant residual learning, where a final characteristic orientation is obtained by calibrating an SO(3)-equivariant orientation hypothesis using an SO(3)-invariant residual rotation. In experiments, the proposed method not only demonstrates superior stability and consistency but also exhibits state-of-the-art performances when applied to point cloud part segmentation, given randomly rotated inputs.

GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation

Junghyun Kim

(Seoul National University)
Language-Guided Robotic Manipulation (LGRM) is a challenging task as it requires a robot to understand human instructions to manipulate everyday objects. Recent approaches in LGRM rely on pre-trained Visual Grounding (VG) models to detect objects without adapting to manipulation environments. This results in a performance drop due to a substantial domain gap between the pre-training and real-world data. In this paper, we propose Grounding Vision to Ceaselessly Created Instructions (GVCCI), a lifelong learning framework for LGRM, which continuously learns VG without human supervision.

Synthetic Tumor Manipulation: With Radiomics Features

Inye Na

(Sungkyunkwan University)
We introduce RadiomicsFill, a synthetic tumor generator conditioned on radiomics features, enabling detailed control and individual manipulation of tumor subregions. This conditioning leverages conventional high-dimensional features of the tumor and thus is biologically well-grounded. Our model combines GANs, radiomics-feature conditioning, and multi-task learning. RadiomicsFill's ability to erase existing tumors and generate an unlimited number of realistic synthetic tumors offers significant prospects for advancing medical imaging research and potential clinical applications.

AI for Scientific and Social Challenges

Brain-to-Speech: Neural Speech Synthesis based on Deep Generative Network

Seo-Hyun Lee

(Korea University)
Brain-To-Speech (BTS) provides non-verbal communication facilitated by current domain adaptation and speech synthesis technologies. Neural patterns are transformed into spoken language by directly associating the neural features with human language. The domain adaptation framework establishes a natural correspondence between the neural features and the speech ground truth. In addition, an automatic speech recognition decoder helped to decompose the phonemes of the generated speech, demonstrating the potential of brain signal-mediated communication.

The Protein Language Model Can Capture MSA Information

Jae-Won Lee

(Hanyang University)
Recently, transformer-based protein structure prediction methods such as AlphaFold and RossettaFold have demonstrated superior performance compared to traditional approaches in protein structure prediction. However, how transformers learn the information of proteins has not been extensively studied. This raises a fundamental question: how does the transformer learn patterns when training on protein sequences? In this paper, using protein domain data with structural similarities, we show that there are correlations between amino acids and if this information can be learned, a masked language model can achieve better predictions. On the model side, we propose a simplified BERT-based Protein Language Model to better understand the features of the transformer. Additionally, we empirically demonstrate that the key and query matrices in the transformer's architecture can capture essential axes in the embedding space which explains correlation in sequences.

Semi-Supervised Galaxy Morphological Classification with Deformable Attention Transformer

SeokUn Kang

(UNIST)
Galaxy morphological classification is an important but challenging task in astronomy. Most prior work study coarse-level morphological classification and use raster low-dynamic range images, but we are interested in high-dynamic-range images commonly produced in imaging surveys. To tackle this problem, first, we build a dataset with a high dynamic range for fine-level multi-class classification that is even challenging to human eyes. Then we propose to use a Deformable Attention Transformer for this difficult task with five-band images and masks, and in the experimental results, our model achieves about 71.436% and 95.509% for top-1 and top-2 test set accuracies in supervised learning, respectively. In addition, since acquiring labels is expansive for image surveys but there is an abundant amount of unlabeled data, we propose to use the semi-supervised learning approaches and improve the performance to 71.906% in top-1 accuracy. We also visualize attention maps and analyze the results with respect to different classes and mask sizes to understand the data and behavior of the model. We confirm that our model has similar confusion patterns in the confusion matrix as humans along with attention visualization for capturing morphological characteristics.

Meta Energy Intelligence

Seungmin Oh

(Chonnam National
University)
There are various problems in the energy industry. Problems such as data shortages, data awareness, and data hallucinations are particularly fatal to our industry. Deep learning technologies can deal with these problems effectively. Among them, we are focusing on three topics: energy management, understanding energy phenomena, and expanding energy characteristics. Let me introduce what kind of research each of us is doing and what deep learning technology can solve it. This gives us what we can get.