Multimodal Encoder/Decoder Transformer

Geo-Refined Point Transformer: Coordinate-Aware Excitation and Positional Upsampling for 3D Scene Segmentation ()

The proposed Coordinate-Aware Feature Excitation (CAFE) module and Position-Aware Upsampling (Pos-Up) module both adhere to ...

New Apple model combines vision understanding and image generation with impressive results

Manzano combines visual understanding and text-to-image generation, while significantly reducing performance or quality trade-offs.

GitHub

One Single Transformer to Unify Multimodal Understanding and Generation

Support image generation in a resolution of 512x512. Improve the multimodal understanding capabilities of purely discrete Show-o. Improve the performance on the GenEval benchmark. Explore the impact ...

IEEE

A Novel Hybrid Architecture With Fast Lightweight Encoder and Transformer Under Attention Fusion for the Enhancement of Sand Dust and Haze Image Restoration

Abstract: Outdoor weather conditions such as haze, fog, sand dust, and low light significantly degrade image quality, causing color distortions, low contrast, and poor visibility. In spite of the ...

Scientific Research Publishing

Kim, H.J., Lell, N. and Scherp, A. (2024) Text Role Classification in Scientific Charts Using Multimodal Transformers. In: Rapp, A., Di Caro, L., Meziane, F. and Sugumaran, V ...

ABSTRACT: This study proposes a multimodal AI model for classifying Vietnamese digital learning materials by integrating three key information sources: text content, image and graphic features, and ...

IEEE

U-MusT: A Unified Framework for Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio

Abstract: Music exists in various modalities, such as score images, symbolic scores, MIDI, and audio. Translations between such modalities are established as core tasks of music information retrieval, ...

blockchain

SEMI: Sample-Efficient Modality Integration Boosts Multimodal LLMs with Minimal Labeled Data

According to DeepLearning.AI, researchers have introduced Sample-Efficient Modality Integration (SEMI), a framework that enables any pretrained encoder—covering images, audio, video, sensors, and ...

marktechpost

Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval

Perception Encoder, PE, is the core vision stack in Meta’s Perception Models project. It is a family of encoders for images, video, and audio that reaches state of the art on many vision and audio ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results