Real-time visual exploration of very large image collections - Kai Uwe Barthel

When and where: 9:00am January 9, 2023 at Media City Bergen


In recent years, very powerful visual and visual-textual feature vectors have been developed that are suitable for searching images based on query images or text descriptions. A drawback is the high number of dimensions of these feature vectors, which lead to long search times, especially for very large image collections. Image archives keep growing, with the number of images often reaching into the tens of millions. On the one hand, this implies that it is impossible to get an overview of the entire content. On the other hand, there is no way to browse the image collection in an explorative way. In this talk, I will show how to significantly improve the quality and efficiency of image retrieval systems by combining three key ideas: The generation of highly efficient general-purpose feature vectors, the organization of image collections as similarity graphs, and the improved generation of visually sorted grid layouts. This combination enables visual exploration of even the largest image collections in real time using a standard web browser.

Kai Uwe Barthel is a professor at the Institute for Media and Computing at HTW Berlin, where he heads the Visual Computing and Creative Computing research groups. His work focuses on developing technologies and applications that enable media consumers to find media content more easily. Prof. Barthel's research and teaching include the theory, design, and development of digital media systems for analyzing and understanding digital images and videos. Current research topics include automatic image tagging, content-based image retrieval, metric learning, image understanding and sorting, and the development of visual image navigation systems.

As a pioneer in the use of digital images and videos, he has done foundational work in image and video annotation, retrieval, and automatic user-centered presentation. As part of his doctoral thesis at the Laboratory for Communication Systems at the Technical University of Berlin, Prof. Barthel developed fractal image compression techniques that, at the time, outperformed the JPEG standard by a large margin. After leading a research project on 3D video coding at the Technical University of Berlin, he became head of the research and development department at LuraTech Inc. in Berlin in 1997. He led research and development teams in image compression and mixed raster technology, for which he received two patents in 1997 and 1999. He was also a member of the JPEG2000 standardization committee. Kai Barthel became professor of visual computing at HTW Berlin, where he teaches courses such as image analysis, machine learning, computer vision, and visual information retrieval. In 2009 he founded pixolution, a visual image search company pixolution's visual search technology is used by many image agencies.

Prof. Barthel has numerous publications, lectures and workshops to his credit. He received several awards for user-centered approaches to fast image and video search. Demo systems and awards can be found at

Multimodal Augmented Homeostasis - Ramesh Jain

When and where: 9:00am January 10, 2023 at Media City Bergen
Chair: Alan Smeaton


Homeostasis is nature’s engineering behind the most complex autonomic system that exists: the human body. Homeostasis is a self-regulating process by which biological systems tend to maintain stability while adjusting to conditions that are optimal for survival. Disruption in homeostasis results in malfunctioning of natural autonomic system causing chronic diseases. Chronic diseases have been the leading cause of death and human suffering in the last 50 years. They also have resulted in highest financial burden for individuals and countries. This can be corrected using external augmentation of the homeostasis loop. Recent progress in artificial pancreas for Type 1 Diabetes is a compelling example for such augmentation. In this presentation we discuss emerging multimodal approaches for such augmentation in the context of chronical diseases. We show that multimodal sensing and fundamental technology developed for multimedia computing may offer powerful augmentation of natural homeostasis to assist in management of chronic diseases.

Ramesh Jain

Ramesh Jain is an entrepreneur, researcher, and educator.

He is an Emeritus Donald Bren Professor in Information & Computer Sciences at University of California, Irvine. His research interests covered Control Systems, Computer Vision, Artificial Intelligence, and Multimedia Computing. His current research passion is in addressing health issues using cybernetic principles building on the progress in sensors, mobile, processing, artificial intelligence, computer vision, and storage technologies. He is founding director of the Institute for Future Health at UCI. He is a Fellow of AAAS, ACM, IEEE, AAAI, IAPR, and SPIE.

Ramesh co-founded several companies, managed them in initial stages, and then turned them over to professional management. He enjoys new challenges and likes to use technology to solve them. He is participating in addressing the biggest technical challenge for us all: how to live long in good health.

Multi-perspective modelling of complex concepts - the case of olfactory history & heritage - Marieke van Erp

When and where: 9:00am January 12, 2023 at Media City Bergen


The success of AI technologies on standardised benchmark datasets, invites us to move towards more difficult and more complex concepts and tasks. In this talk, I will argue that the humanities presents the perfect playground for investigating the recognition and modelling of complex concepts thanks to massive digitisation efforts that have made available large and varied datasets, in multiple modalities, in this domain. Specifically, I will highlight the complexities in modelling a concept such as smell, dealing with its representations in various media, and how the temporal dimension of historical and linguistic research forces us to deal with issues such as changing social norms and our colonial history. This will show the exciting possibilities of what we term Cultural AI.

Marieke van Erp

Dr. Marieke van Erp is a Language Technology and Semantic Web expert engaged in interdisciplinary research. She holds a PhD in computational linguistics from Tilburg University and has worked on many projects involving cultural heritage partners. Since 2017, she has led the Digital Humanities Research Lab at the Royal Netherlands Academy of Arts and Sciences Humanities Cluster. She is one of the founders and scientific directors of the Cultural AI Lab, a collaboration between 8 research and cultural heritage institutions in the Netherlands aimed at the study, design and development of socio-technological AI systems that are aware of the subtle and subjective complexity of human culture.


Date: Monday from 8:30 to 9:00 and 13:00 to 13:30. Thursday to Tuesday from 8:30-9:00

Social Events


Date: Monday from 9:00 to 13:00
: Media City Bergen


Date: Monday from 15:00 to 17:00
: MCB Media Lab


Date: Tuesday at 19:00
Place: Terminus Hotel


Date: Thursday at 14:30
Place: Media City Bergen

Culture tour

Date: Thursday from 15:00 to 17:00
: starting from Media City Bergen

Main Session


Oral 1 - Multimedia Analytics Application

Place: Redaksjonsrom, MCB - Time: 11:00 - 12:00, January 9, 2023


Chair: Alan Smeaton

Single Cross-domain Semantic Guidance Network for Multimodal Unsupervised Image Translation

Authors: Jiayingm Lan, Lianglun Cheng, Guoheng Huang, Chi-Man Pun, Xiaochen Yuan, Shangyu Lai, Hongrui Liu, Wing-Kuen Ling


Multimodal image-to-image translation has received great attention due to its flexibility and practicality. The existing methods lack the generality of effective style representation, and cannot capture different levels of stylistic semantic information from cross-domain images. Besides, they ignore the parallelism for cross-domain image generation, and their generator can only be responsible for specific domains. To address these issues, we propose a novel Single Cross-domain Semantic Guidance Network (SCSG-Net) for coarse-to-fine semantically controllable multimodal image translation. Images from different domains are mapped to a unified visual semantic latent space by a dual sparse feature pyramid encoder, and then the generative module generates the result images by extracting semantic style representation from the input images in a self-supervised manner guided by adaptive discrimination. Especially, our SCSG-Net meets the needs of users in different styles as well as diverse scenarios. Extensive experiments on different benchmark datasets show that our method outperforms other state-of-the-art methods both quantitatively and qualitatively.

Towards Captioning an Image Collection from a Combined Scene Graph Representation Approach

Authors: Itthisak Phueaksri, Marc A. Kastner, Yasutomo Kawanishi, Takahiro Komamizu, Ichiro IDE


Most content summarization models from the field of natural language processing summarize the textual contents of a collection of documents or paragraphs. In contrast, summarizing the visual contents of a collection of images has not been researched to this extent. In this paper, we present a framework for summarizing the visual contents of an image collection. The key idea is to collect the scene graphs for all images in the image collection, create a combined representation, and then generate a visually summarizing caption using a scene-graph captioning model. Note that this aims to summarize common contents across all images in a single caption rather than describing each image individually. After aggregating all the scene graphs of an image collection into a single scene graph, we normalize it by using an additional concept generalization component. This component selects the common concept in each sub-graph with ConceptNet based on word embedding techniques. Lastly, we refine the captioning results by replacing a specific noun phrase with a common concept from the concept generalization component to improve the captioning results. We constructed a dataset for this task based on the MS-COCO dataset using techniques from image classification and image-caption retrieval. Evaluation of the proposed method on this dataset showed promising performance.

Health-oriented Multimodal Food Question Answering

Authors: Jianghai Wang, Menghao Hu; Yaguang Song, Xiaoshan Yang


Health-oriented food analysis has become a research hotspot in recent years because it can help people keep away from unhealthy diets. Significant progress has been made in recipe retrieval, food recommendation, nutrition and calorie estimation. However, existing works still cannot well balance the individual preference and the health. Multimodal food question and answering (Q&A) has a great potential in practical applications, but it is still not well studied. In this paper, we build a health-oriented multimodal food Q&A dataset (MFQA) with 9K question and answer pairs based on a multimodal food knowledge graph collected from a food-sharing website. In addition, we propose a knowledge-based multimodal food Q&A framework, which consists of three important parts: encoder module, retrieval module, and answer module. The encoder module can complete the encoding of text, vision and knowledge graph. The retrieval module is to filter out the most relevant knowledge from the knowledge graph. The answer module is responsible for analyzing the multimodal information in the query and relevant knowledge and predicting the correct answer. Extensive experimental results on the MFQA dataset demonstrate the effectiveness of our method. The code and dataset are available at

MM-Locate-News: Multimodal Focus Location Estimation in News

Authors: Golsa Tahmasebzadeh, Eric Müller-Budack, Sherzod Hakimov, Ralph Ewerth


The consumption of news has changed significantly as the Web has become the most influential medium for information. To analyze and contextualize the large amount of news published every day, the geographic focus of an article is an important aspect in order to enable content-based news retrieval. There are methods and datasets for geolocation estimation from text or photos, but they are typically considered as separate tasks. However, the photo might lack geographical cues and text can include multiple locations, making it challenging to recognize the focus location using a single modality. In this paper, a novel dataset called Multimodal Focus Location of News (MM-Locate-News) is introduced. We evaluate state-of-the-art methods on the new benchmark dataset and suggest novel models to predict the focus location of news using both textual and image content. The experimental results show that the multimodal model outperforms unimodal models.


Oral 2 - Real-time & Interactive Application

Place: Redaksjonsrom, MCB - Time: 11:00 - 12:00, Tuesday 10, 2023


Chair: Phoebe Chen

LiteHandNet: A Lightweight Hand Pose Estimation Network via Structural Feature Enhancement

Authors: Zhi-Yong Huang, Song-Lu Chen, Qi Liu, Chong-jian Zhang, Feng Chen, Xu-Cheng Yin


This paper presents a real-time lightweight network, LiteHandNet, for 2D hand pose estimation from monocular color images. In recent years, keypoint heatmap representation is dominant in pose estimation due to its high accuracy. Nevertheless, keypoint heatmaps require high-resolution representation to extract accurate spatial features, which commonly means high computational costs, e.g., high delay and tremendous model parameters. Therefore, the existing heatmap-based methods are not suitable for the scene with computation-limited resources and high real-time requirements. We find that high-resolution representation can obtain more clear structural features of a hand, e.g., contours and key regions, which can provide high-quality spatial features to the keypoint heatmap, thus improving the robustness and accuracy of a model. To fully extract the structural features without introducing unnecessary computational costs, we propose a lightweight module, which consists of two parts: a multi-scale feature block (MSFB) and a spatial channel attention block (SCAB). MSFB can extract structural features from hands using multi-scale information, while SCAB can further screen out high-quality structural features and suppress low-quality features. Comprehensive experimental results verify that our model is state-of-the-art in terms of the tradeoff between accuracy, speed, and parameters.

DilatedSegNet: A Deep Dilated Segmentation Network for Polyp Segmentation

Authors: Nikhil Kumar Tomar, Debesh Jha; Ulas Bagci


Colorectal cancer (CRC) is the second leading cause of cancer-related death worldwide. Excision of polyps during colonoscopy helps reduce mortality and morbidity for CRC. Powered by deep learning, computer-aided diagnosis (CAD) systems can detect regions in the colon overlooked by physicians during colonoscopy. Lacking high accuracy and real-time speed are the essential obstacles to be overcome for successful clinical integration of such systems. While literature is focused on improving accuracy, the speed parameter is often ignored. Toward this critical need, we intend to develop a novel real-time deep learning-based architecture, DilatedSegNet, to perform polyp segmentation on the fly. DilatedSegNet is an encoder-decoder network that uses pre-trained ResNet50 as the encoder from which we extract four levels of feature maps. Each of these feature maps is passed through a dilated convolution pooling (DCP) block. The outputs from the DCP blocks are concatenated and passed through a series of four decoder blocks that predicts the segmentation mask. The proposed method achieves a real-time operation speed of 33.68 frames per second with an average dice similarity coefficient (DSC) of 0.90 and mIoU of 0.83. The results on the publicly available Kvasir-SEG and BKAI-IGH datasets suggest that DilatedSegNet can give real-time feedback while retaining a high DSC, indicating high potential for using such models in real clinical settings in the near future. The GitHub link of the source code will be made publicly available upon acceptance of the study.

Music Instrument Classification Reprogrammed

Authors: Hsin-Hung Chen, Alexander Lerch


The performance of approaches to Music Instrument Classification, a popular task in Music Information Retrieval, is often impacted and limited by the lack of availability of annotated data for training. We propose to address this issue with ``reprogramming,'' a technique that utilizes pre-trained deep and complex neural networks originally targeting a different task by modifying and mapping both the input and output of the pre-trained model. We demonstrate that reprogramming can effectively leverage the power of the representation learned for a different task and that the resulting reprogrammed system can perform on par or even outperform state-of-the-art systems at a fraction of training parameters. Our results, therefore, indicate that reprogramming is a promising technique potentially applicable to other tasks impeded by data scarcity.

Cascading CNNs with S-DQN: A Parameter-parsimonious Strategy for 3D Hand Pose Estimation

Authors: Mingqi Chen, Shaodong Li, Feng Shuang, Kai Luo


This paper proposes a cascaded parameter-parsimonious 3D hand pose estimation strategy to improve real-time performance without sacrificing accuracy. The estimation process is first decomposed into feature extraction and feature exploitation. The feature extraction is seen as a dimension reduction process, where convolutional neural networks (CNNs) are used to ensure accuracy. Feature exploitation is considered as a policy optimization process, and a shallow reinforcement learning (RL)-based feature exploitation module is proposed to improve running rapidity. Ablation studies and experiments are carried out on NYU and ICVL datasets to evaluate the performance of the strategy, and multiple baselines are used to evaluate generalization. The results show that the improvement on testing time reaches 8.1% and 14.6% by the proposed strategy. Note that the overall accuracy also reaches state-of-the-art, which further shows the effectiveness of the proposed strategy.

Oral 3 - Detection, Recognition and Identification

Place: Redaksjonsrom, MCB - Time: 13:00 - 14:00, Tuesday 10, 2023


Chair: Klaus Schoeffmann

MMM-GCN: Multi-Level Multi-Modal Graph Convolution Network for Video-Based Person Identification

Authors: Ziyan Liao, Dening Di, Jingsong Hao, Jiang Zhang, Shulei Zhu, Jun Yin


Video-based multi-modal person identification has attracted rising research interest recently to address the inadequacies of single-modal identification in unconstrained scenes. Most existing methods model video-level and multi-modal-level information of target video respectively, which suffer from separation of different levels and insufficient information contained in a specific video. In this paper, we introduce extra neighbor-level information for the first time to enhance the informativeness of target video. Then a Multi-Level(neighbor-level, multi-modal-level, and video-level) and Multi-Modal GCN model is proposed, to capture correlation among different levels and achieve adaptive fusion in a unified model. Experiments on iQIYI-VID-2019 dataset show that MMM-GCN significantly outperforms current state-of-the-art methods, proving its superiority and effectiveness. Besides, we point out feature fusion is heavily polluted by noisy nodes that result in a suboptimal result. Further improvement could be explored on this basis to approach the performance upper bound of our paradigm.

Feature Enhancement and Reconstruction for Small Object Detection

Authors: Chong-Jian Zhang, Song-Lu Chen, Qi Liu, Zhi-Yong Huang, Feng Chen, Xu-Cheng Yin


Due to the small size and noise interference, small object detection is still a challenging task. The previous work can not effectively reduce noise interference and extract representative features of the small object. Although the upsampling network can alleviate the loss of features by enlarging feature maps, it can not enhance semantics and will introduce more noises. To solve the above problems, we propose CAU (Content-Aware Upsampling) to enhance feature representation and semantics of the small object. Moreover, we propose CSA (Content-Shuffle Attention) to reconstruct robust features and reduce noise interference using feature shuffling and attention. Extensive experiments verify that our proposed method can improve small object detection by 2.2% on the traffic sign dataset TT-100K and 0.8% on the object detection dataset MS COCO compared with the baseline model.

Toward More Accurate Heterogeneous Iris Recognition with Transformers and Capsules

Authors: Zhiyong Zhou, Yuanning Liu, Xiaodong Zhu, Shuai Liu, Shaoqiang Zhang, Zhen Liu


As diverse iris capture devices have been deployed, the performance of iris recognition under heterogeneous conditions, e.g., cross-spectral matching and cross-sensor matching, has drastically degraded. Nevertheless, the performance of existing manual descriptor-based methods and CNN-based methods is limited due to the enormous domain gap under heterogeneous acquisition. To tackle this problem, we propose a model with transformers and capsules to extract and match the domain-invariant feature effectively and efficiently. First, we represent the features from shallow convolution as vision tokens by spatial attention. Then we model the high-level semantic information and fine-grained discriminative features in the token-based space by a transformer encoder. Next, a Siamese transformer decoder exploits the relationship between pixel-space and token-based space to enhance the domain-invariant and discriminative properties of convolution features. Finally, a 3D capsule network not only efficiently considers part-whole relationships but also increases intra-class compactness and inter-class separability. Therefore, it improves the robustness of heterogeneous recognition. Experimental validation results on two common datasets show that our method significantly outperforms the state-of-the-art methods.

MCANet: Multiscale Cross-Modality Attention Network for multispectral pedestrian detection

Authors: Xiaotian Wang, Letian Zhao, Wei Wu, Xi Jin


Multispectral pedestrian detection is an important and challenging task, that can provide complementary information of visible images and thermal images for high-precision and robust object detection results. To fully exploit the different modalities, we propose a Multiscale Cross-Modality Attention (MCA) module to efficiently extract and fuse features. In this module, the transformer architecture is used to extract features of two modalities. Based on these features, we design a novel spatial attention mechanism that can adaptively enhance object details and suppress background. Finally, the features of each branch are fused using the channel attention mechanism and sent to the detector. To verify the effect of the MCA module, we propose the MCANet. The MCA modules are embedded at different depths of the two-stream network and interconnected to share multiscale information. Extensive experimental results demonstrate that MCANet achieves state-of-the-art detection accuracy on the challenging KAIST multispectral pedestrian dataset.

Oral 4 - Image Quality Assessment and Enhancement

Place: Redaksjonsrom, MCB - Time: 14:30 - 15:30, Tuesday 10, 2023


Chair: Björn Thór Jónsson

STN: Stochastic Triplet Neighboring Approach to Self-Supervised Denoising from Limited Noisy Images

Authors: Bowen Wan, Daming Shi, Yukun Liu


With the rapid development of artificial intelligence in recent years, deep learning has shown great potential in the field of image denoising. However, most of the work is based on supervised learning, and the lack of clean images in the real application will limit neural network training. For this reason, self-supervised learning in the absence of clean images is getting more and more attentions. Nevertheless, since both the source and target in self-supervised training are the limited noisy image itself, such denoising methods suffer from overfitting. To this end, a stochastic triplet neighboring approach, thereafter referred to as STN, is proposed in this paper. Given an input noisy image, the source fed to the STN is the downsized sub-image via 4-neighbor sampling, whereas the target in STN training is a stochastic combination from the two neighbored sub-images. Such a mechanism is actually the augmentation of training data, which leads to the relief of the overfitting problem. Extensive experimental results show that our proposed STN approach outperforms the state-of-the-art image denoising methods.

Fusion-Based Low-Light Image Enhancement

Authors: Haodian Wang, Yang Wang, Yang Cao, Zheng-Jun Zha


Recently, deep learning-based methods have made remarkable progress in low-light image enhancement. In addition to poor contrast, the images captured under insufficient light suffer from severe noise and saturation distortion. Most existing unsupervised learning-based methods adopt the two-stage processing method to enhance contrast and denoise sequentially. However, the noise will be amplified in the contrast enhancement process, thus increasing the difficulty of denoising. Besides, the saturation distortion caused by insufficient illumination is not considered well in existing unsupervised low-light enhancement methods. To address the above problems, we propose a novel parallel framework, which includes a saturation adaptive adjustment branch, brightness adjustment branch, noise suppression branch, and fusion module for adjusting saturation, correcting brightness, denoise, and multi-branch fusion, respectively. Specifically, the saturation is corrected via global adjustment, the contrast is enhanced through curve mapping estimation, and we use BM3D to preliminary denoise. Further, the enhanced branches are fed to the fusion module for a trainable guided filter, which is optimized in an unsupervised training manner. Experiments on the LOL, MIT-Adobe 5k, and SICE datasets demonstrate that our method achieves better quantitation and qualification results than the state-of-the-art algorithms.

Towards Interactive Facial Image Inpainting by Text or Exemplar Image

Authors: Ailin Li, Lei Zhao, Zhiwen Zuo, Zhizhong Wang, Wei Xing, Dongming Lu


Facial image inpainting aims to fill visually realistic and semantically new pixels for masked or missing pixels in a face image. Although current methods have made progress in achieving high visual quality, the controllable diversity of face inpainting remains an open issue. This paper proposes a new facial image inpainting interaction mode, which enables filling semantic contents based on the texts or exemplar images provided by users. We use the powerful image-text representation abilities from the recently introduced Contrastive Language-Image Pre-training (CLIP) models to achieve this interactive face inpainting. We present two thoughts on our method. Specifically, we first explore a simple and effective optimization-based text-guided facial inpainting method in which a CLIP model is used as a loss network to modify the latent code iteratively in response to the text prompt. Next, we describe a multi-modal inpainting mapper network to map the input conditions (e.g., text or image) into corresponding latent code changes, supporting the guidance of different text prompts and exemplars within one model. We also introduce an exemplar-semantic similarity loss, which maps the inpainted facial image and the exemplar image into the CLIP’s space to measure their similarity. This loss enables the generated image to include high-level semantic attributes from the exemplar image. Through extensive experiments, we demonstrate the effectiveness of our method in interactive facial inpainting based on the texts or exemplars.

Dual Feature Aggregation Network for No Reference Image Quality Assessment

Authors: Yihua Chen, Zhiyuan Chen, Mengzhu Yu, Zhenjun Tang


We propose an effective Dual-Feature Aggregation Network (DFAN) for NR-IQA by using the Convolutional Neural Networks (CNN) called ResNet50 and Vision Transformer (ViT). The proposed DFAN-IQA method consists of three modules: the module of attention feature aggregation (MAFA), the module of semantic feature aggregation (MSFA), and the module of prediction (MP). The MAFA uses the pre-trained ViT to extract attention features of different levels and aggregates these attention features by a novel attention feature aggregation (AFA) block which consists of Full-Connected (FC) layers, Rectified Linear Activation Function (ReLU) and Layer normalization. The MSFA exploits the pre-trained ResNet50 to extract semantic features of different levels and aggregates these semantic features by a novel semantic feature aggregation (SFA) block which consists of convolution, Depth-Wise Convolution (DWConv), ReLU and Batch-Normalization (BN). The MP uses concatenation and FC layers to process the outputs of MAFA and MSFA for calculating quality score. Many experiments on open IQA datasets are done to test the proposed DFAN-IQA method. IQA performance comparisons illustrate that the proposed DFAN-IQA method outperforms some state-of-the-art NR-IQA methods in a whole. Evaluation experiment on across-datasets validates the generalization ability of the proposed DFAN-IQA method.


Oral 5 - Human Action Understanding

Place: Redaksjonsrom, MCB - Time: 9:00 - 10:00, Wednesday 10, 2023


Chair: Stevan Rudinac

Overall-Distinctive GCN for Social Relation Recognition on Videos

Authors: Yibo Hu, Chenyu Cao, Fangtao Li, Chenghao Yan, Jinsheng Qi, Bin Wu


Recognizing social relationships between multiple characters from videos can enable intelligent systems to serve human society better. Previous studies mainly focus on the still image to classify the relationships while ignoring the important data source of the video. With the prosperity of multimedia, the methods of video-based social relationship recognition gradually emerge. However, those methods either only focus on the logical reasoning between multiple characters or only on the direct interaction in each character pair. To that end, inspired by the rules of interpersonal social communication, we propose Overall-Distinctive GCN (OD-GCN) to recognize the relationships of multiple characters in the videos. Specifically, we first construct an overall-level character heterogeneous graph with two types of edges and rich node representation features to capture the implicit relationship of all characters. Then, we design an attention module to find mentioned nodes for each character pair from feature sequences fused with temporal information. Further, we build distinctive-level graphs to focus on the interaction between two characters. Finally, we integrate multimodal global features to classify relationships. We conduct the experiments on the MovieGraphs dataset and validate the effectiveness of our proposed model.

Weakly-supervised Temporal Action Localization with Regional Similarity Consistency

Authors: Haoran Ren, Hao Ren, Hong Lu, Cheng Jin


The weakly-supervised temporal action localization task aims to train a model that can accurately locate each action instance in the video using only video-level class labels. The existing methods take into account the information of different modalities (primarily RGB and Flow), and present numerous multi-modal complementary methods. RGB features are obtained by calculating appearance information, which are easy to be disrupted by the background. On the contrary, Flow features are obtained by calculating motion information, which are usually less disrupted by the background. Based on this phenomenon, we propose a Regional Similarity Consistency (RSC) constraint between these two modalities to suppress the disturbance of background in RGB features. Specifically, we calculate the regional similarity matrices of RGB and Flow features, and impose the consistency constraint through $L_2$ loss. To verify the effectiveness of our method, we integrate the proposed RSC constraint into three recent methods. The comprehensive experimental results show that the proposed RSC constraint can boost the performance of these methods, and achieve the state-of-the-art results on the widely-used THUMOS14 and ActivityNet1.2 datasets.

A Spatio-Temporal Identity Verification Method for Person-Action Instance Search in Movies

Authors: Yanrui Niu, Jingyao Yang, Chao Liang, Baojin Huang, Zhongyuan Wang


As one of the challenging problems in video search, PersonAction Instance Search (P-A INS) aims to retrieve shots with a specific person carrying out a specific action from massive video shots. Most existing methods conduct person INS and action INS separately to compute the initial person and action ranking scores, which will be directly fused to generate the final ranking list. However, direct aggregation of two individual INS scores ignores spatial relationships of person and action, thus cannot guarantee their identity consistency and cause identity inconsistency problem (IIP). To address IIP, we propose a simple spatio-temporal identity verification method. Specifically, in the spatial dimension, we propose an identity consistency verification (ICV) step to revise the direct fusion score of person INS and action INS. Moreover, in the temporal dimension, we propose a double-temporal extension (DTE) operation, i.e., intra-shot DTE and inter-shot DTE. The former improves the coverage of ICV by adding detection results on keyframes, and the latter improves the accuracy of P-A INS by using the coherence of action among shots in movies. The proposed method is evaluated on the large-scale NIST TRECVID INS 2019-2021 tasks, and the experimental results show that it can effectively mitigate the IIP, and its performance surpasses that of the champion team in 2019 INS task and the second place teams in both 2020 and 2021 INS tasks.

Binary Neural Network for Video Action Recognition

Authors: Hongfeng Han, Zhiwu Lu, Ji-Rong Wen


In the typical video action classification scenario, it is critical to extract the temporal and spatial information in the videos with complex 3D convolution neural networks, which significantly improves the computation cost and memory costs. In this paper, based on the binary neural network, we propose a novel 1-bit 3D convolution block named 3D BitConv, capable of compressing the 3D convolutional networks while maintaining high accuracy. Due to its high flexibility, the proposed 3D BitConv block can be directly embedded in the latest 3D convolutional backbone. Consequently, we binarize two representative 3D convolutional neural networks (C3D and ResNet3D) and evaluate their performance on action recognition tasks. Extensive experiments on two widely used action recognition datasets demonstrate that the two binary 3D networks achieve impressive performance at a lower cost. Furthermore, we conduct a comprehensive ablation study to test and verify the effectiveness of the components in the proposed 3D BitConv.


Oral 6 - Multimedia Content Generation

Place: Redaksjonsrom, MCB - Time: 11:00 - 12:00, Thursday 11, 2023


Chair: Liting Zhou

C-GZS: Controllable Person Image Synthesis based on Group-Supervised Zero-shot Learning

Authors: Jiyun Li, Yuan Gao, Chen Qian, Jiachen Lu, Zhongqin Chen


The objective of person image synthesis is to generate an image of a person that is perceptually indistinguishable from an actual one. However, the technical challenges that occur in pose transfer, background swapping, and so forth ordinarily lead to an uncontrollable and unpredictable result. This paper proposes a zero-shot synthesis method based on group-supervised learning. The underlying model is a twofold auto-encoder, which first decomposes the latent feature of a target image into a disentangled representation of swappable components and then extracts and recombines the factors therein to synthesize a new person image. Finally, we demonstrate the superiority of our work through both qualitative and quantitative experiments.

DiffMotion: Speech-Driven Gesture Synthesis using Denoising Diffusion Model

Authors: Fan Zhang, Naye Ji, Fuxing Gao, Yongping Li


Speech-driven gesture synthesis is a field of growing interest in virtual human creation. However, a critical challenge is the inherent intricate one-to-many mapping between speech and gestures. Previous studies have explored and achieved significant progress with generative models. Notwithstanding, most synthetic gestures are still vastly less natural. This paper presents \emph{DiffMotion}, a novel speech-driven gesture synthesis architecture based on diffusion models. The model comprises an autoregressive temporal encoder and a denoising diffusion probability Module. The encoder extracts the temporal context of the speech input and historical gestures. The diffusion module learns a parameterized Markov chain to gradually convert a simple distribution into a complex distribution and generates the gestures according to the accompanied speech. Compared with baselines, objective and subjective evaluations confirm that our approach can produce natural and diverse gesticulation and demonstrate the benefits of diffusion-based models on speech-driven gesture synthesis.

TG-dance: TransGAN-based intelligent dance generation with music

Authors: Dongjin Huang, Yue Zhang, Zhenyan Li, Jinhua Liu


Intelligent choreographic from music is a popular field of study currently. Many works use fragment splicing to generate new motions, which lacks motion diversity. When the input is only music, the frame-by-frame generation methods lead to similar motions generated by the same music. Some works improve this problem by adding motions as one of the inputs, but requires a high number of frames. In this paper, a new transformer-based neural network, TG-dance, is proposed for predicting high-quality 3D dance motions that follow the musical rhythms. We propose a new idea of multi-level expansion of motion sequences and design a new motion encoder, using a multi-level transformer-upsampling layer. The multi-head attention in the transformer allows better access to contextual information. The upsampling can greatly reduce motion frames input, and is memory friendly. We use generative adversarial network to effectively improve the quality of generated motions. We designed experiments on the publicly available large dataset AIST++. The experimental results show that TG-dance network outperforms the latest models in quantitative and qualitative. Our model inputs fewer frames of motion sequences and audio features to predict high-quality 3D dance motion sequences that follow the rhythm of the music.

Visual Question Generation under Multi-Granularity Cross-Modal Interaction

Authors: Zi Chai, Xiaojun Wan, Soyeon Caren Han, Josiah Poon


Visual question generation (VQG) aims to ask human-like questions automatically from input images targeting on given answers. A key issue of VQG is performing effective cross-modal interaction, i.e., dynamically focus on answer-related regions during question. In this paper, we propose a novel framework based on multi-granularity cross-modal interaction for VQG containing both object-level and relation-level interaction. For object-level interaction, we leverage both semantic and visual features under a contrastive learning scenario. We further illustrate the importance of high-level relations (e.g., spatial, semantic) between regions and answers for generating deeper questions. Since such information were somewhat ignored by prior VQG studies, we propose relation-level interaction based on graph neural networks. We perform experiments on VQA2.0 and Visual7w datasets under automatic and human evaluations and our model outperforms all baseline models.

Oral 7 - Multimodal and Multidimensional Imaging Application

Place: Redaksjonsrom, MCB - Time: 13:00 - 14:00, Thursday 11, 2023


Chair: Mehdi Elahi

Optimizing Local Feature Representations with Anisotropic Edge Modeling for 3D Point Cloud Understanding

Authors: Haoyi Xiu, Xin Liu, Weimin Wang, Kyoung-Sook Kim, Takayuki Shinohara, Qiong Chang, Masashi Matsuoka


An edge between two points describes rich information about the underlying surface. However, recent works merely use edge information as an ad hoc feature, which may undermine its effectiveness. In this study, we propose the Anisotropic Edge Modeling (AEM) block by which edges are modeled adaptively. As a result, the local feature representation is optimized where edges (e.g., object boundaries defined by ground truth) are appropriately enhanced. By stacking AEM blocks, AEM-Nets are constructed to tackle various point cloud understanding tasks. Extensive experiments demonstrate that AEM-Nets compare favorably to recent strong networks. In particular, AEM-Nets achieve state-of-the-art performance in object classification on ScanObjectNN, object segmentation on ShapeNet Part, and scene segmentation on S3DIS. Moreover, it is verified that AEM-Net outperforms the strong transformer-based method with significantly fewer parameters and FLOPs, achieving efficient learning. Qualitatively, the intuitive visualization of learned features successfully validates the effect of the AEM block.

Floor Plan Vectorization with Multimodal Analysis

Authors: Tao Wen, Chao Liang, You-Ming Fu, Chun-Xia Xiao, Hai-Ming Xiang


Floor plan analysis and vectorization are of practical importance in real estate and interior design fields. Existing methods focus on visual modality, which is weak in identifying room types. On the other hand, standard floor plan images have textual annotations that provide semantic guidance of room layouts. Motivated by this fact, we propose a multimodal segmentation network that exploits textual information for the analysis of floor plan images. We embed the texts extracted by optical character recognition (OCR) and fuse them with visual features by a cross-attention mechanism. We also make improvements to the existing vectorization algorithm in efficiency, which is featured by intense optimization-solving. We replace the gradient-descent method with the fast principle components analysis (PCA) to convert doors and windows into rectangles. Moreover, we remove the unnecessary iterative optimization steps when extracting room contours. Both quantitative and qualitative experiments validate the effectiveness and efficiency of the proposed method.

Safe Contrastive Clustering

Authors: Pengwei Tang, Huayi Tang, Wei Wang, Yong Liu


Contrastive clustering is an effective deep clustering approach, which learns both instance-level consistency and cluster-level consistency in a contrastive learning fashion. However, the strategies of data augmentation used by contrastive clustering is an important prior knowledge such that inappropriate strategies may severely cause performance degradation. By converting the different strategies of data augmentations into a multi-view problem, we propose a safe contrastive clustering method which is guaranteed to alleviate the reliance on prior knowledge of data augmentations. The proposed method can maximize the complementary information between these different views and minimize the noise caused by the inferior views. Such a method addresses the safeness that contrastive clustering with multiple data augmentation strategies is no worse than that with one of those strategies. Moreover, we provide the theoretical guarantee that the proposed method can achieve empirical safeness. Extensive experiments demonstrate that our method can reach safe contrastive clustering on popular benchmark datasets.

SRes-NeRF: Improved Neural Radiance Fields for Realism and Accuracy of Specular Reflections

Authors: Shufan Dai, Yangjie Cao, Pengsong Duan, Xianfu Chen


The Neural Radiance Fields (NeRF) is a popular view synthesis technique that represents a scene using a multilayer perceptron(MLP) combined with classic volume rendering and uses positional encoding techniques to increase image resolution. Although it can effectively represent the appearance of a scene, they often fail to accurately capture and reproduce the specular details of surfaces and require a lengthy training time ranging from hours to days for a single scene. We address this limitation by introducing a representation consisting of a density voxel grid and an enhanced MLP for a complex view-dependent appearance and model acceleration. Modeling with explicit and discretized volume representations is not new, but we propose Swish Residual MLP(SResMLP). Compared with the standard MLP+ReLU network, the introduction of layer scale module allows the shallow information of the network to be transmitted to the deep layer more accurately, maintaining the consistency of features. Introduce affine layers to stabilize training, accelerate convergence and use the Swish activation function instead of ReLU. Finally, an evaluation of four inward-facing benchmarks shows that our method surpasses NeRF’s quality, it only takes about 18 minutes to train from scratch for a new scene and accuracy capture the specular details of the scene surface. Excellent performance even without positional encoding.

Special Sessions

VBS - Video Browser Showdown

Place: MCB Media Lab - Time: 11:00 - 17:00, Thursday 11, 2023


Chair: TBA

VISIONE at Video Browser Showdown 2023

Authors: Amato, Giuseppe; Bolettieri, Paolo; Carrara, Fabio; Falchi, Fabrizio; Gennaro, Claudio; Messina, Nicola; Vadicamo, Lucia; Vairo, Claudio


In this paper, we present the fourth release of VISIONE, a tool for fast and effective video search on a large-scale dataset. It includes several search functionalities like text search, object and color-based search, semantic and visual similarity search, and temporal search. VISIONE uses ad-hoc textual encoding for indexing and searching video content, and it exploits a full-text search engine as search backend.

Traceable Asynchronous Workflows in Video Retrieval with vitrivr-VR

Authors: Spiess, Florian; Heller, Silvan; Rossetto, Luca; Sauter, Loris; Weber, Philipp; Schuldt, Heiko


Virtual reality (VR) interfaces allow for entirely new modes of user interaction with systems and interfaces. Much like in physical workspaces, documents, tools, and interfaces can be used, put aside, and used again later. Such asynchronous workflows are a great advantage of virtual environments, as they enable users to perform multiple tasks in an interleaved manner, allowing partial results or insights from one task to affect the choices made in another. However, the usage patterns of such workflows are much more challenging to analyze. In this paper we present the version of vitrivr-VR participating in the Video Browser Showdown (VBS) 2023. We describe the current state of our system, with a focus on text input methods and logging of asynchronous workflows.

Video Search With CLIP And Interactive Text Query Reformulation

Authors: Lokoč, Jakub; Vopálková, Zuzana; Dokoupil, Patrik; Peška, Ladislav


Nowadays, deep learning based models like CLIP allow simple design of cross-modal video search systems able to solve many tasks considered as highly challenging several years ago. In this paper, we analyze a CLIP based search approach that focuses on situations, where users cannot find proper text queries to describe searched video segments. The approach relies on suggestions of classes for displayed intermediate result sets and thus allows users to realize missing words and ideas to describe video frames. This approach is supported with a preliminary study showing potential of the method. Based on the results, we extend a respected known-item search system for the Video Browser Showdown, where more challenging visual known-item search tasks are planned.

Perfect Match in Video Retrieval

Authors: Lubos, Sebastian; Rubino, Massimiliano; Tautschnig, Christian; Tautschnig, Markus; Wen, Boda; Schoeffmann, Klaus; Felfernig, Alexander


This paper presents the first version of our video search application Perfect Match for the Video Browser Showdown 2023 competition. The system indexes videos from the large V3C video dataset and derives visual content descriptors automatically. Further, it provides an interactive video search UI, which implements approaches from the domain of Critiquing-based recommendation.

QIVISE: A Quantum-inspired Interactive Video Search Engine in VBS2023

Authors: Song, Weixi; He, Jiangshan; Li, Xinghan; Feng, Shiwei; Liang, Chao


In this paper, we present a quantum-inspired interactive video search engine (QIVISE), which will be tested in VBS2023. QIVISE aims at assisting the user in dealing with Known-Item Search and Ad- hoc Video Search tasks with high efficiency and accuracy. QIVISE is based on a text-image encoder to achieve multi-modal embedding and introduces multiple interaction possibilities, including a novel quantum- inspired interaction on paradigm, label search, and multi-modal search to refine the retrieval results via user’s interaction and feedback.

Exploring Effective Interactive Text-based Video Search in vitrivr

Authors: Sauter, Loris; Gasser, Ralph; Heller, Silvan; Rossetto, Luca; Saladin, Colin; Spiess, Florian; Schuldt, Heiko


vitrivr is a general purpose retrieval system that supports a wide range of query modalities. In this paper, we briefly introduce the system and describe the changes and adjustments made for the 2023 iteration of the video browser showdown. These focus primarily on text-based retrieval schemes and corresponding user-feedback mechanisms.

V-FIRST 2.0: Video Event Retrieval with Flexible Textual-Visual Intermediary for VBS 2023

Authors: Hoang-Xuan, Nhat; Nguyen, E-Ro; Nguyen-Ho, Thang-Long; Pham, Minh-Khoi; Nguyen, Quang-Thuc; Trang-Trung, Hoang-Phuc; Ninh, Van-Tu; Le, Tu-Khiem; Gurrin, Cathal; Tran, Minh-Triet


In this paper, we present a new version of our interactive video retrieval system V-FIRST. Besides the existing features of querying by textual descriptions and visual examples, we include a novel referring expression segmentation module to highlight the objects in an image. This is the first step towards providing adequate explainability to retrieval results, ensuring that the system can be trusted and used in domain-specific and critical scenarios. Searching by sequence of events is also a new addition, as it proves to be pivotal in finding events from memory. Furthermore, we improved our Optical Character Recognition capability, especially in the case of scene text. Finally, the inclusion of relevance feedback allows the user to explicitly refine the search space. All combined, our system has greatly improved for user interaction, leveraging more explicit information and providing more tools for the user to work with.

VERGE in VBS 2023

Authors: Pantelidis, Nick; Andreadis, Stelios; Pegia, Maria; Moumtzidou, Anastasia; Galanopoulos, Damianos; Apostolidis, Konstantinos; Touska, Despoina; Gkountakos, Konstantinos; Gialampoukidis, Ilias; Vrochidis, Stefanos; Mezaris, Vasileios; Kompatsiaris, Ioannis


This paper describes VERGE, an interactive video retrieval system for browsing a collection of images from videos and searching for specific content. The system utilizes many retrieval techniques as well as fusion and reranking capabilities. A Web Application is also part of VERGE, where a user can create queries, view the top results and submit the appropriate data, all in a user-friendly way.

Vibro: Video Browsing with Semantic and Visual Image Embeddings

Authors: Schall, Konstantin; Hezel, Nico; Barthel, Kai Uwe; Jung, Klaus


Vibro represents a powerful tool for interactive video retrieval and browsing and is the winner of the Video Browser Showdown 2022. Following the saying of "never change a winning system" we did not change any of the underlying concepts or added any new features. Instead, we focused on improving the three existing corner stones of the software, which are text-to-image search, image-to-image retrieval and browsing of results with 2D sorted maps. The changes to these three parts are summarized in this paper and additionally, a high-level overview of the AVS-mode of vibro is presented.

VideoCLIP: An Interactive Video CLIP-based Retrieval System at VBS2023

Authors: Nguyen, Thao-Nhu; Puangthamawathanakun, Bunyarit; Gurrin, Cathal; Caputo, Annalina; Healy, Graham; Nguyen, Binh T.; Arpnikanondt, Chonlameth


In this paper, we present an interactive video retrieval system named VideoCLIP developed for the Video Browser Showdown 2023. To support users in solving retrieval tasks, the system enables search using a variety of modalities, such as rich text, dominant colour, OCR, and query-by-image. Moreover, a new search modality has been added to empower our core engine, which is inherited from the Contrastive Language–Image Pre-training (CLIP) model. Finally, the user interface is enhanced to display results in groups in order to reduce the effort for a user when locating potentially relevant targets.

Free-form Multi-Modal Multimedia Retrieval (4MR)

Authors: Arnold, Rahel; Sauter, Loris; Schuldt, Heiko


Due to the ever increasing amount of multimedia data, efficient means for multimedia management and retrieval are required. Especially with the rise of deep-learning-based analytics methods, the semantic gap has shrunk considerably, but a human in the loop is still considered mandatory. One of the driving factors of video search is that humans tend to refine their queries after reviewing the results. Hence, the entire process is highly interactive. A natural approach to interactive video search is using textual descriptions of the content of the expected result, enabled by deep learning-based joint visual text co-embedding. In this paper, we present the Multi-Modal Multimedia Retrieval (4MR) system, a novel system inspired by vitrivr, that empowers users with almost entirely free-form query formulation methods. The top-ranked teams of the last few iterations of the Video Browser Showdown have shown that CLIP provides an ideal feature extraction method. Therefore, while 4MR is capable of image and text retrieval as well, for VBS video retrieval is based primarily based on CLIP.

diveXplore at the Video Browser Showdown 2023

Authors: Schoeffmann, Klaus; Stefanics, Daniela; Leibetseder, Andreas


The diveXplore system has been participating in the VBS since 2017 and uses a sophisticated content analysis stack as well as an advanced interface for concept, object, event, and texts search. This year, we perform several changes in order to make the system both much easier to use as well as more efficient. These changes include using shot-based analysis over uniform sampled segments, integration of textual and image transformers (CLIP), as well as optimizing the user interface for faster browsing of video summaries.

Reinforcement Learning Enhanced PicHunter for Interactive Search

Authors: MA, Zhixin; WU, Jiaxin; LOO, Weixiong; NGO, Chong wah


With the tremendous increase of video data size, the search performance could be impacted significantly. Specifically, in an interactive system, a real-time system allows a user to quickly browse, search and refine a query. Without a speedy system, which is the main ingredient to engage a user to stay focus, interactive system becomes less effective even with a sophisticated deep learning system. This paper addresses this challenge by leveraging approximate search, Bayesian inference and reinforcement learning. For approximate search, we apply hierarchical navigable small world, which is an efficient approximate nearest neighbor search algorithm. To quickly prune the search scope, we integrate PicHunter, one of the most popular engines in Video Browser Showdown, with reinforcement learning. The integration enhances PicHunter with the ability of systematic planning. Specifically, PicHunter performs Bayesian update with greedy strategy to select a small number of candidates for display. With reinforcement learning, the greedy strategy is replaced with a policy network that learns to select candidates that will result in minimum number of user iterations, which is analytically defined by a reward function. With these improvements, the interactive system only searches a subset of video dataset relevant to a query while being able to quickly perform Bayesian update with systematic planning to recommend the most probable candidates that can potentially lead to minimum iteration rounds.

SNL - Sport & Nutrition Lifelogging

Place: Seminar 1 - Time: 11:00 - 12:30, Tuesday 10, 2023


Chair: TBA

Arctic HARE: A Machine Learning-based System for Performance Analysis of Cross-country Skiers

Authors: Nordmo, Tor-Arne Schmidt; Riegler, Michael Alexander; Johansen, Håvard Dagenborg; Johansen, Dag


Advances in sensor technology and big data processing enable new and improved performance analysis of sport athletes. With the increase in data variety and volume, both from on-body sensors and cameras, it has become possible to quantify the specific movement patterns that make a good athlete. This paper describes Arctic Human Activity Recognition on the Edge (Arctic HARE): a skiing-technique training system that captures movement of skiers to match those against optimal patterns in well-known cross-country techniques. Arctic HARE uses on-body sensors in combination with stationary cameras to capture movement of the skier, and provides classification of the perceived technique. We explore and compare two approaches for classifying data, and determine optimal representations that embody the movement of the skier. We achieve higher than 96% accuracy for real-time classification of cross-country techniques.

Soccer Athlete Data Visualization and Analysis with an Interactive Dashboard

Authors: Boeker, Matthias; Midoglu, Cise


Soccer is among the most popular and followed sports in the world. As its popularity increases, it becomes highly professionalized. Even though research on soccer makes up for a big part in classic sports science, there is a greater potential for applied research in digitalization and data science. In this work we present SoccerDashboard, a user-friendly, interactive, modularly designed and extendable dashboard for the analysis of health and performance data from soccer athletes, which is open-source and publicly accessible over the Internet for coaches, players and researchers from fields such as sports science and medicine. We demonstrate a number of the applications of this dashboard on the recently released SoccerMon dataset from Norwegian elite female soccer players. SoccerDashboard can simplify the analysis of soccer datasets with complex data structures, and serve as a reference implementation for multidisciplinary studies spanning various fields, as well as increase the level of scientific dialogue between professional soccer institutions and researchers.

Sport and Nutrition Digital Analysis: A Legal Assessment

Authors: Juliussen, Bjørn Aslak; Rui, Jon Petter; Johansen, Dag


This paper presents and evaluates legal aspects related to digital technologies applied in the elite soccer domain. Data Protection regulations in Europe clearly indicate that compliance-by-design is needed when developing and deploying such technologies. This is particularly true when health data is involved, but a complicating factor is that the distinction between what is health data or not is unclear. Add to the fact that modern analysis algorithms might deduce personal medical-related data when correlating and sifting through what might seem more harmless data in isolation. We conclude with a set of recommendations rooted in current legal frameworks for developers of sports and wellness systems where privacy and data protection can be at risk.

Towards Deep Personal Lifestyle Models using Multimodal N-of-1 Data

Authors:Nagesh, Nitish; Azimi, Iman; Andriola, Tom; Rahmani, Amir M.; Jain, Ramesh


The rise of wearable technology has enabled users to collect data about food, exercise, sleep, bio-markers and other lifestyle parameters continuously and almost unobtrusively. However, there is untapped potential in developing personal models due to challenges in collecting longitudinal data. Therefore, we collect N-of-1 dense multimodal data for an individual over a period of three years that encompasses their food intake, physical activity, sleep and other physiological parameters. We formulate hypotheses to examine relationships between these parameters and test their validity through a combination of correlation, network mapping and causality techniques. While we use correlation analysis and GIMME (Group Iterative Multiple Model Estimation) network plots to investigate association between parameters, we use causal inference to estimate causal effects and check the robustness of causal estimates by performing refutation analysis. Through our experiments, we achieve statistical significance for the causal estimate thereby validating our hypotheses. We hope to motivate individuals to collect and share their long-term multimodal data for building personal models thereby revolutionizing future health approaches.

Capturing Nutrition Data for Sports: Challenges and Ethical Issues

Authors: Sharma, Aakash; Czerwinska, Katja Pauline; Johansen, Dag; Johansen, Håvard D.


Nutrition plays a key role in an athlete's performance, health, and mental well-being. Capturing nutrition data is crucial to analyze relations and perform necessary interventions. Using traditional methods to capture long-term nutritional data requires intensive labor, and is prone to errors and biases. Artificial Intelligence (AI) methods can be used to remedy such problems by using Image-Based Dietary Assessment (IBDA) methods where athletes can take pictures of their food before consuming it. However, the current state of IBDA is not perfect. In this paper, we discuss the challenges faced in employing such methods to capture nutrition data. We also discuss ethical and legal issues that must be addressed before using these methods on a large scale.

MDRE - Multimedia Datasets for Repeatable Experimentation

Place: Seminar 1 - Time: 14:30 - 16:30, Tuesday 10, 2023


Chair: TBA

The VTF Dataset and a Multi-Scale Thermal-to-Visible Face Synthesis System

Authors: Ho, Tsung-Han; Yu, Chen-Yin; Ko, Tsai-Yen; Chu, Wei-Ta


We propose a multi-scale thermal-to-visible face synthesis system to achieve thermal face recognition. A generative adversarial network is constructed by one generator that transforms a given thermal face into a face in the visible spectrum, and three discriminators that consider multi-scale feature matching and high-frequency components, respectively. In addition, we provide a new paired thermal-visible face dataset called VTF that mainly contains Asian subjects captured in various visual conditions. This new dataset not only poses technical challenges to thermal face recognition, but also enables us to point out the race bias issue in current thermal face recognition methods. Overall, the proposed system achieves the state-of-the-art performance in both the EURECOM and VTF datasets.

Link-Rot in Web-Sourced Multimedia Datasets

Authors: Lakic, Viktor; Rossetto, Luca; Bersntein, Abraham


The Web is increasingly used as a source for content of datasets of various types, especially multimedia content. These datasets are then often distributed as a collection of URLs, pointing to the original sources of the elements. As these sources become offline over time, the datasets experience decay in the form of link-rot. In this paper, we analyze 24 web-sourced datasets with a combined total of 270 million URLs and find that over 20% of the content is no longer available. We discuss the adverse effects of this decay on the reproducibility of work based on such data and make some recommendations on how they could be mediated in the future.

People@Places and ToDY: Two Datasets for Scene Classification in Media Production and Archiving

Authors: Bailer, Werner; Fassold, Hannes


In order to support common queries in visual media production and archiving, we propose two datasets which cover the annotation of the bustle of a scene (i.e., populated to unpopulated), the cinematographic type of a shot and the time of day and season of a shot. The dataset for bustle and shot type, called People@Places, adds annotations to the Places365 dataset, and the ToDY (time of day/year) dataset adds annotations to the SkyFnder dataset. For both datasets, we provide a toolchain to create automatic annotations, which have been manually verified and corrected for parts of the two datasets. We provide baseline results for these tasks using the EfficientNet-B3 model, pretrained on the Places365 dataset.

ScopeSense: An 8.5-month sport, nutrition, and lifestyle lifelogging dataset

Authors: Riegler, Michael; Thambawita, Vajira; Chatterjee, Ayan; Nguyen, Thu; Hicks, Steven; Telle-Hansen, Vibeke; Pettersen, Svein Arne; Johansen, Dag; Jain, Ramesh; Halvorsen, Pål


Nowadays almost every person has a smart phone with them that is tracking their everyday performance. Increasingly more people also have advanced smart watches that in addition to activity data are also able to track several bio-markers. However, it is still unclear how this data could actually be put in use to improve certain aspects in people’s lives. One of the main challenges is that the data collected is often not structured and a link to important other information such as when, what, and how much food was consumed or how the person collecting the data was feeling are missing. It is widely believed that such detailed structured longitudinal data about a person is essential to model a person and provide personalized and precise guidance. Despite the strong belief of researchers about the power of such data-oriented approach, such datasets have been difficult to collect. In this work, we present a unique dataset where two individuals performed a structured data collection over more than 8 months. In addition to the sensor data, they also collected all their nutrition, training and well-being related data. Availability of food data with many other important objective and subjective longitudinal data streams will facilitate research related to food for health and enjoyment. We present this dataset for two people and discuss its potential use. We are using this as a possible starting point for developing methods that can collect and make sense of this data for a larger cohort of people. We intend to make this dataset available to community for research in food and health.

Fast Accurate Fish Recognition with Deep Learning based on a Domain-Specific Large-Scale Fish Dataset

Authors: Lin, Yuan; Chu, Zhaoqi; Korhonen, Jari; Xu, Jiayi; Liu, Xiangrong; Liu, Juan; Liu, Min; Fang, Lvping; Yang, Weidi; Ghose, Debasish; You, Junyong


Fish species recognition is an integral part of sustainable marine biodiversity and aquaculture. The rapid emergence of deep learning methods has shown great potential on classification and recognition tasks when trained on a large scale dataset. Nevertheless, some practical challenges remain for automating the task, e.g., the lack of appropriate methods applied to a complicated fish habitat. In addition, most publicly accessible fish datasets have small-scale and low resolution, imbalanced data distributions, or limited labels and annotations, etc. In this work, we aim to overcome the aforementioned challenges. First, we construct the FishNet database with higher image quality and resolution that covers a large scale and diversity of marine-domain fish species in East China sea. The current version covers $63,622$ pictures of $136$ fine-grained fish species. Accompanying the dataset, we propose a fish recognition testbed by incorporating two widely applied deep neural network based object detection models to exploit the facility of the enlarged dataset, which achieves a convincing performance in detection precision and speed. The scale and hierarchy of FishNet can be further enlarged by enrolling new fish species and annotations. We hope that the FishNet database and the fish recognition testbed can serve as a generalized benchmark that motivates further development in related research communities.

GIGO, Garbage in, Garbage out: An Urban Garbage Classification Dataset

Authors: Sukel, Maarten; Rudinac, Stevan; Worring, Marcel


This paper presents a real-world domain-specific dataset, which facilitates algorithm development and benchmarking on the challenging problem of multimodal classification of urban waste in street-level imagery. The dataset, which we have named ``GIGO: Garbage in, Garbage out,'' consists of 25k images collected from a large geographic area of Amsterdam. It is created with the aim of helping cities to collect different types of garbage from the streets in a more sustainable fashion. The collected data differs from existing benchmarking datasets, introducing unique scientific challenges. In this fine-grained classification dataset, the garbage categories are visually heterogeneous with different sizes, origins, materials, and visual appearance of the objects of interest. In addition, we provide various open data statistics about the geographic area in which the images were collected. Examples are information about demographics, different neighborhood statistics, and information about buildings in the vicinity. This allows for experimentation with multimodal approaches. Finally, we provide several state-of-the-art baselines utilizing the different modalities of the dataset.

Marine Video Kit: A New Marine Video Dataset for Content-based Analysis and Retrieval

Authors: Truong, Quang-Trung; Vu, Tuan-Anh; Ha, Tan-Sang; Lokoč, Jakub; Tim, Yue Him Wong; Joneja, Ajay; Yeung, Sai-Kit


Effective analysis of unusual domain specific video collections represents an important practical problem, where state-of-the-art general purpose models still face limitations. Hence, it is desirable to design benchmark datasets that challenge novel powerful models for specific domains with additional constraints. It is important to remember that domain specific data may be noisier (e.g., endoscopic or underwater videos) and often require more experienced users for effective search. In this paper, we focus on single-shot videos taken from moving cameras in underwater environments, which constitute a nontrivial challenge for research purposes. The first shard of a new Marine Video Kit dataset is presented to serve for video retrieval and other computer vision challenges. Our dataset is used in a special session during Video Browser Showdown 2023. In addition to basic meta-data statistics, we present several insights based on low-level features as well as semantic annotations of selected keyframes. The analysis also contains experiments showing limitations of respected general purpose models for retrieval. Our dataset and code are publicly available at

ICDAR - Intelligent Cross-Data Analysis & Retrieval

Place: Seminar 1 - Time: 11:00 - 12:30, Wednesday 11, 2023


Chair: TBA

EvIs-Kitchen: Egocentric Human Activities Recognition with Video and Inertial Sensor data

Authors: Hao, Yuzhe; Uto, Kuniaki; Kanezaki, Asako; Sato, Ikuro; Kawakami, Rei; Shinoda, Koichi


Egocentric Human Activity Recognition (ego-HAR) has received attention in fields where human intentions in a video must be estimated. The performance of existing methods, however, are limited due to insufficient information about the subject's motion in egocentric videos. We consider that a dataset of egocentric videos along with two inertial sensors attached to both wrists of the subject to obtain more information about the subject's motion will be useful to study the problem in depth. Therefore, this paper provides a publicly available dataset, EvIs-Kitchen, which contains well-synchronized egocentric videos and two-hand inertial sensor data, as well as interaction-highlighted annotations. We also present a baseline multimodal activity recognition method with two-stream architecture and score fusion to validate that such multimodal learning on egocentric videos and intertial sensor data is more effective to tackle the problem. Experiments show that our multimodal method outperforms other single-modal methods on EvIs-Kitchen.

COMIM-GAN: Improved Text-to-Image Generation via Condition Optimization and Mutual Information Maximization

Authors: Zhou, Longlong; Wu, Xiao-Jun; Xu, Tianyang


Language-based image generation is a challenging task. Current studies normally employ conditional generative adversarial network (cGAN) as the model framework and have achieved significant progress. Nonetheless, a close examination of their methods reveals two fundamental issues. First, the discrete linguistic conditions make the training of cGAN extremely difficult and impair the generalization performance of cGAN. Second, the conditional discriminator cannot extract semantically consistent features based on linguistic conditions, which is not conducive to conditional discrimination. To address these issues, we propose a condition optimization and mutual information maximization GAN (COMIM-GAN). To be specific, we design (1) a text condition construction module, which can construct a compact linguistic condition space, and (2) a mutual information loss between images and linguistic conditions to motivate the discriminator to extract more features associated with the linguistic conditions. Extensive experiments on CUB-200 and MS-COCO datasets demonstrate that our method is superior to the existing methods.

A Study of a Cross-modal Interactive Search Tool Using CLIP and Temporal Fusion

Authors: Lokoc, Jakub; Peska, Ladislav


Recently, the CLIP model demonstrated impressive performance in text-image search and zero classification tasks. Hence, CLIP was used as a primary model in many cross-modal search tools at evaluation campaigns. In this paper, we show a study performed with the model integrated to a successful video search tool at respected Video Browser Showdown competition. The tool allows more complex querying actions on top of the primary model. Specifically, temporal querying and Bayesian like relevance feedback were tested as well as their natural combination -- temporal relevance feedback. In a thorough analysis of the tool's performance, we show current limits of cross-modal searching with CLIP and also the impact of more advanced query formulation strategies. We conclude that current cross-modal search models enable users to solve some types of tasks trivially with a single query, however, for more challenging tasks it is necessary to rely also on interactive search strategies.

A Cross-modal Attention Model for Fine-Grained Incident Retrieval from Dashcam Videos

Authors: Pham, Dinh-Duy; Dao, Minh-Son; Nguyen, Thanh-Binh


Dashcam video has become popular recently due to the safety of both individuals and communities. While individuals can have undeniable evidence for legal and insurance, communities can benefit from sharing these dashcam videos for further traffic education and criminal investigation. Moreover, relying on recent computer vision and AI development, a few companies have launched the so-called AI dashcam that can alert drivers to near-risk accidents (e.g., following distance detection, forward collision warning) to improve driver safety. Unfortunately, even though dashcam videos create a driver's travel log (i.e., a traveling diary), little research focuses on creating a valuable and friendly tool to find any incident or event with few described sketches by users. Inspired by these observations, we introduce an interactive incident detection and retrieval system for first-view travel-log data that can retrieve fine-grained incidents for both defined and undefined incidents. Moreover, the system gives promising results when evaluated on several public datasets and popular text-image retrieval methods. The source code is published at \textit{\_Model\_Attention}

Textual Concept Expansion with Commonsense Knowledge to Improve Dual-Stream Image-Text Matching

Authors: Liang, Mingliang; Liu, Zhuoran; Larson, Martha


We propose a Textual Concept Expansion (TCE) approach for creating joint textual-visual embeddings. TCE uses a multi-label classifier that takes a caption as input and produces as output a set of concepts that are used to expand, i.e., enrich the caption. TCE addresses the challenge of the limited number of concepts common between an image and its caption by leveraging general knowledge about the world, i.e., commonsense knowledge. Following a recent trend, the commonsense knowledge is acquired by creative use of the training data. We test TCE within a popular dual-stream approach, Consensus-Aware Visual-Semantic Embedding (CVSE). This popular approach leverages a graph that encodes the co-occurrence of concepts, which it takes to represent a consensus between the textual and visual modality that captures commonsense knowledge. Experimental results demonstrate an improvement of image-text matching when TCE is used for the expansion of the background collection and the query. Query expansion, not possible in the original CVSE, is particularly helpful. TCE can be extended in the future to make use of data that is similar to the target domain, but is drawn from an additional, external data set.

Generation of Synthetic Tabular Healthcare Data Using Generative Adversarial Networks

Authors: Nik, Alireza Hossein Zadeh; Riegler, Michael A.; Halvorsen, Pål; Storås, Andrea M.


High-quality tabular data is a crucial requirement for developing data-driven applications, especially healthcare-related ones, because most of the data nowadays collected in this context is in tabular form. However, strict data protection laws complicates the access to medical datasets. Thus, synthetic data has become an ideal alternative for data scientists and healthcare professionals to circumvent such hurdles. Although many healthcare institutions still use the classical de-identification and anonymization techniques for generating synthetic data, deep learning-based generative models such as Generative Adversarial Networks (GANs) have shown a remarkable performance in generating tabular datasets with complex structures. This paper examines the GANs’ potential and applicability within the healthcare industry, which often faces serious challenges with insufficient training data and patient records sensitivity. We investigate several state-of-the-art GAN-based models proposed for tabular synthetic data generation. Healthcare datasets with different sizes, numbers of variables, column data types, feature distributions, and inter-variable correlations are examined. Moreover, a comprehensive evaluation framework is defined to evaluate the quality of the synthetic records and the viability of each model in preserving the patients’ privacy. The results indicate that the proposed models can generate synthetic datasets that maintain the statistical characteristics, model compatibility and privacy of the original data. Moreover, synthetic tabular healthcare datasets can be a viable option in many data-driven applications. However, there is still room for further improvements in designing a perfect architecture for generating synthetic tabular data.

FL-Former: Flood Level Estimation with Vision Transformer for Images from Cameras in Urban Areas

Authors: Le, Quoc-Cuong; Le, Minh-Quan; Tran, Mai-Khiem; Le, Ngoc-Quyen; Tran, Minh-Triet


Flooding in urban areas is one of the serious problems and needs special attention in urban development and improving people's living quality. Flood detection to promptly provide data for hydrometeorological forecasting systems will help make timely forecasts for life. In addition, providing information about rain and flooding in many locations in the city will help people make appropriate decisions about traffic. Therefore, in this paper, we present our FL-Former solution for detecting and classifying rain and inundation levels in urban locations, specifically in Ho Chi Minh City, based on images recorded from cameras using Vision Transformer. We also build the HCMC-URF dataset with more than 10K images of various rainy and flooding conditions in Ho Chi Minh City to serve the community's research. Finally, we propose the software architecture and construction of an online API system to provide timely information about raining and flooding at several locations in the city as extra input for hydrometeorological analysis and prediction systems, as well as provide information to citizens via mobile or web applications.

Brave News Ideas

Place: Seminar 2 - Time: 11:00 - 12:00, Wednesday 11, 2023


Chair: TBA

Multimedia datasets: challenges and future possibilities

Authors: Nguyen, Thu; Storaas, Andrea M.; Thambawita, Vajira; Hicks, Steven A.; Halvorsen, Paal; Riegler, Michael A.


Public multimedia datasets can enhance knowledge discovery and model development as more researchers have the opportunity to contribute to exploring them. However, as these datasets become larger and more multimodal, besides analysis, efficient storage and sharing can become a challenge. Furthermore, there are inherent privacy risks when publishing any data containing sensitive information about the participants, especially when combining different data sources leading to unknown discoveries. Proposed solutions include standard methods for anonymization and new approaches that use generative models to produce fake data that can be used in place of real data. However, there are many open questions regarding whether these generative models hold information about the data used to train them and if this information could be retrieved, making them not as privacy-preserving as one may think. This paper discusses the long-term and short-term challenges associated with publishing open multimedia datasets, including questions regarding efficient sharing, data modeling, and ensuring that the data is appropriately anonymized.

The Importance of Image Interpretation: Patterns of Semantic Misclassification in Real-World Adversarial Images

Authors: Zhao, Zhengyu; Dang, Nga; Larson, Martha


Adversarial images are created with the intention of causing an image classifier to produce a misclassification. In this paper, we propose that adversarial images should be evaluated based on semantic mismatch, rather than label mismatch, as used in current work. In other words, we propose that an image of a ``mug'' would be considered adversarial if classified as ``turnip'', but not as ``cup'', as current systems would assume. Our novel idea of taking semantic misclassification into account in the evaluation of adversarial images offers two benefits. First, it is a more realistic conceptualization of what makes an image adversarial, which is important in order to fully understand the implications of adversarial images for security and privacy. Second, it makes it possible to evaluate the transferability of adversarial images to a real-world classifier, without requiring the classifier's label set to have been available during the creation of the images. The paper carries out an evaluation of a transfer attack on a real-world image classifier that is made possible by our semantic misclassification approach. The attack reveals patterns in the semantics of adversarial misclassifications that could not be investigated using conventional label mismatch.


Place: Seminar 2 - Time: 13:00 - 14:00, Wednesday 11, 2023


Chair: TBA

Social Relation Graph Generation on Untrimmed Video

Authors: Hu, Yibo; Yan, Chenghao; Cao, Chenyu; Wang, Haorui; Wu, Bin


For a more intuitive understanding of videos, we demonstrate SRGG-UnVi, a social relation graph generation system for untrimmed videos. Given a video, the demonstration can combine existing knowledge to build a dynamic relation graph and a static multi-relation graph. SRGG-UnVi integrates various multimodal technologies, including Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), face recognition and clustering, multimodal video relation extraction, etc. The system consists of three modules: (1) The video process engine takes advantage of parallelization, efficiently providing multimodal information to other modules. (2) The relation recognition module utilize multimodal information to extract the relationship between characters in each scene. (3) The graph generation module generates social relation graph for users.

Improving Parent-Child Co-play in a Roblox Game

Authors: Geffen, Jonathan


Co-play of digital games between parents and their children is a fruitful but underutilized parental mediation strategy. Previous research on this topic has resulted in various design recommendations meant to support and encourage co-play. However, most of these recommendations have yet to be applied and systematically validated within co-play focused games. Based on such design recommendations, our demo paper bridges this research gap by advancing the co-play experience of an existing Roblox game, Funomena’s Magic Beanstalk. In our study, we departed from a subset of potential design recommendations to redesign two of Magic Beanstalk’s mini-games. The two in-house redesigned mini-games were then evaluated by parent-child dyads in a qualitative evaluation, comparing the co-play experience of the original and of our redesigned games. This initial evaluation demonstrates that designing games according to established design recommendations has the potential to improve co-play experiences.

Taylor – Impersonation of AI for Audiovisual Content Documentation and Search

Authors: de Jesus Oliveira, Victor Adriel; Rottermanner, Gernot; Boucher, Magdalena; Größbacher, Stefanie; Judmaier, Peter; Bailer, Werner; Thallinger, Georg; Kurz, Thomas; Frank, Jakob; Bauer, Christoph; Fröschl, Gabriele; Batlogg, Michael


While AI-based audiovisual analysis tools have without doubt made huge progress, integrating them in media production and archiving workflows is still challenging, as the provided annotations may not match needs in terms of type, granularity and accuracy of metadata, and do not well align with existing workflows. We propose a system for annotation and search in media archive applications, using a range of AI-based analysis methods. In order to facilitate communication of explanations and collect relevance feedback, an impersonation of the systems' intelligence, named Taylor, is included as an element of the user interfaces.

Virtual Try-On Considering Temporal Consistency for VideoConferencing

Authors: Shimizu, Daiki; Yanai, Keiji


Virtual fitting, in which a person's image is changed to an arbitrary clothing image, is expected to be applied to shopping sites and videoconferencing. In real-time virtual fitting, image-based methods using a knowledge distillation technique can generate high-quality fitting images by inputting only the image of arbitrary clothing and a person without requiring the additional data like pose information. However, there are few studies that perform fast virtual fitting from arbitrary clothing images stably with real person images for situations such as videoconferencing considering temporal consistency. Therefore, the purpose of this demo is to perform robust virtual fitting with temporal consistency for videoconferencing. First, we developed a virtual fitting system and verified how effective the existing fast image fitting method is for webcam video.The results showed that the existing methods do not adjust the dataset and do not consider temporal consistency, and thus are unstable for input images similar to videoconferencing.Therefore, we propose to train a model that adjusts the dataset to be similar to a videoconference and to add temporal consistency loss. Qualitative evaluation of the proposed model confirms that the model exhibits less flicker than the baseline.


Place: Redaksjonsrom - Time: 13:00 - 17:00, Wednesday 11, 2023


Each poster will be displayed for 1h in the showroom (Zoom Room) on its schedulled slot, i.e., Poster 1 from 13:00 to 14:00, Poster 2 from 14:00 to 15:00, Poster 3 from 15:00 to 16:00, and Poster 4 from 16:00 to 17:00.

Online participants can see the posters, interact using Zoom and move freely between showrooms.


Poster 1 - Transparent Object Detection with Simulation Heatmap Guidance and Context Spatial Attention

Authors: Chen, Shuo; Li, Di; Ju, Bobo; Jiang, Linhua; Zhao, Dongfang


The texture scarcity properties make transparent object localization a challenging task in the computer vision community. This paper addresses this task in two aspects. (i) Additional guidance cues: we propose a Simulation Heatmap Guidance (SHG) to improve the localization ability of the model. Concretely, the target’s extreme points and inference centroids are used to generate simulation heatmaps to offer additional position guides. A high recall is rewarded even in extreme cases. (ii) Enhanced attention: we propose a Context Spatial Attention (CSA) combined with a unique backbone to build dependencies between feature points and to boost multi-scale attention fusion. CSA is a lightweight module and brings apparent perceptual gain. Experiments show that our method achieves more accurate detection for cluttered transparent objects in various scenarios and background settings, outperforming the existing methods.

Poster 2 - CCF-Net: A Cascade Center-based Framework Towards Efficient Human Parts Detection

Authors: Ye, Kai; Ji, Haoqin; Li, Yuan; Wang, Lei; Liu, Peng; Shen, Linlin


Human parts detection has made remarkable progress due to the development of deep convolutional networks. However, many SOTA detection methods require large computational cost and are still difficult to be deployed to edge devices with limited computing resources. In this paper, we propose a lightweight Cascade Center-based Framework, called CCF-Net, for human parts detection. Firstly, a Gaussian-Induced penalty strategy is designed to ensure that the network can handle objects of various scales. Then, we use Cascade AttentionModule to capture relations between different feature maps, which refines intermediate features. With our novel cross-dataset training strategy, our framework fully explores the datasets with incomplete annotations and achieves better performance. Furthermore, Center-based Knowledge Distillation is proposed to enable student models to learn better representation without additional cost. Experiments show that our method achieves a new SOTA performance on Human-Parts and COCO Human Parts benchmarks.

Poster 3 - Dynamic-static Cross Attentional Feature Fusion Method for Speech Emotion Recognition

Authors: Dong, Ke; Peng, Hao; Che, Jie


The dynamic-static fusion features play an important role in speech emotion recognition (SER). However, the fusion methods of dynamic features and static features generally are simple addition or serial fusion, which might cause the loss of certain underlying emotional information. To address this issue, we proposed a dynamic-static cross attentional feature fusion method (SD-CAFF) with a cross attentional feature fusion mechanism (Cross AFF) to extract superior deep dynamic-static fusion features. To be specific, the Cross AFF is utilized to parallel fuse the deep features from the CNN/LSTM feature extraction module, which can extract the deep static features and the deep dynamic features from acoustic features (MFCC, Delta, and Delta-delta). In addition to the SD-CAFF framework, we also employed muti-label auxiliary learning in the training process to further improve the accuracy of emotion recognition. The experimental results on IEMOCAP demonstrated the WA and UA of SD-CAFF are 75.78% and 74.89%, respectively, which outperformed the current SOTAs. Furthermore, SD-CAFF achieved competitive performances (WA: 56.77%; UA: 56.30%) in the comparison experiments of cross-corpus capability on MSP-IMPROV. Finally, we ablation experiments to verify the effectiveness of SD-CAFF and the necessity of each module.

Poster 4 - Audio-Visual Sensor Fusion Framework using Person Attributes Robust to Missing Visual Modality for Person Recognition

Authors: John, Vijay Cornelius Kirubakaran; Kawanishi, Yasutomo


Audio-visual person recognition is the problem of recognizing a person’s identity from the multimodal audio-visual data. Audio-visual person recognition has many applications in security, surveillance, biometrics etc. Deep learning-based audio-visual camera person recognition report state-of-the-art person recognition accuracy. However, existing audio-visual frameworks require the presence of both modalities. This approach is limited by the problem of missing modalities, where one or more of the modalities could be missing. In this paper, we formulate an audio-visual person recognition framework where we define and address the missing visual modality problem. The proposed framework enhances the robustness of audio-visual person recognition even under the condition of missing visual modality using audio-based person attributes and a multi-head attention transformer-based network, termed the CTNet. The audio-based person attributes such as age, gender and race are predicted from the audio data using a deep learning model, termed the S2A network. The attributes predicted from the audio data, which are assumed to be always available, provide additional cues for the person recognition framework. The predicted attributes, the audio data and the image data, which may be missing, are given as input to the CTNet, which contains the multi-head attention branch. The multi-head attention branch addresses the problem of missing visual modality by assigning attention weights to the audio features, visual features and the audio-based attributes. The proposed framework is validated with the CREMA-D public dataset using comparative analysis and an ablation study. The results show that the proposed framework enhances the robustness of person recognition even under the condition of missing visible camera.


Poster 1 - Deep3DSketch+: Rapid 3D Modeling from Single Free-hand Sketches

Authors: Chen, Tianrun; Fu, Chenglong; Zang, Ying; Zhu, Lanyun; Zhang, Jia; Mao, Papa; Sun, Lingyun


The rapid development of AR/VR brings tremendous demands for 3D content. While the widely-used Computer-Aided Design (CAD) method requires a time-consuming and labor-intensive modeling process, sketch-based 3D modeling offers a potential solution as a natural form of computer-human interaction. However, the sparsity and ambiguity of sketches make it challenging to generate high-fidelity content reflecting creators' ideas. Precise drawing from multiple views or strategic step-by-step drawings is often required to tackle the challenge but is not friendly to novice users. In this work, we introduce a novel end-to-end approach, Deep3DSketch+, which performs 3D modeling using only a single free-hand sketch without inputting multiple sketches or view information. Specifically, we introduce a lightweight generation network for efficient inference in real-time and a structural-aware adversarial training approach with a Stroke Enhancement Module (SEM) to capture the structural information to facilitate learning of the realistic and fine-detailed shape structures for high-fidelity performance. Extensive experiments demonstrated the effectiveness of our approach with the state-of-the-art (SOTA) performance on both synthetic and real datasets.

Poster 2 - Low-light Image Enhancement Under Non-uniform Dark

Authors: Li, Yuhang; Cai, Feifan; Tu, Yifei; Ding, Youdong


The low visibility of low-light images due to lack of exposure poses a significant challenge for vision tasks such as image fusion, detection and segmentation in low-light conditions. Real-world situations such as backlighting and shadow occlusion mostly exist with non-uniform low-light, while existing enhancement methods tend to brighten both low-light and normal-light regions, we actually prefer to enhance dark regions but suppress overexposed regions. To address this problem, we propose a new non-uniform dark visual network (NDVN) that uses the attention mechanism to enhance regions with different levels of illumination separately. Since deep-learning needs strong data-driven, for this purpose we carefully construct a non-uniform dark synthetic dataset (UDL) that is larger and more diverse than existing datasets, and more importantly it contains more non-uniform light states. We use the manually annotated luminance domain mask (LD-mask) in the dataset to drive the network to distinguish between low-light and extremely dark regions in the image. Guided by the LD-mask and the attention mechanism, the NDVN adaptively illuminates different light regions while enhancing the color and contrast of the image. More importantly, we introduce a new region loss function to constrain the network, resulting in better quality enhancement results. Extensive experiments show that our proposed network outperforms other state-of-the-art methods both qualitatively and quantitatively.

Poster 3 - Research on multi-task Semantic segmentation based on attention and feature Fusion method

Authors: Dong, Aimei; Liu, Sidi


Recently, single-task learning on semantic segmentation tasks has achieved good results. When multiple tasks are handled simultaneously, single-task learning requires an independent network structure for each task and no intersection between tasks. This paper proposes a feature fusion-attention mechanism multi-task learning method, which can simultaneously handle multiple related tasks (semantic segmentation, surface normal estimation task, etc.). Our model includes a feature extraction module to extract semantic information at different scales, a feature fusion module to refine the extracted features, and an attentional mechanism for processing information from fusion modules to learn information about specific tasks. The network architecture proposed in this paper trains in an end-to-end manner and simultaneously improves the performance of multiple tasks. Experiments are carried out on two well-known semantic segmentation datasets, and the accuracy of the proposed model is verified.

Poster 4 - Rumor Detection on Social Media by Using Global-Local Relations Encoding Network

Authors: Zhang, Xinxin; Pan, Shanliang; Qian, Chengwu; Yuan, Jiadong


With the rapid development of the Internet, social media has become the main platform for users to obtain news and share their opinions. While social media provides convenience to the life of people, it also offers advantageous conditions for publishing and spreading rumors. The rumors widely propagated on social media can cause mass panic and damage personal reputation. Since artificial de-tection methods take a lot of time, it becomes crucial to use intelligent methods for rumor detection. The recent rumor detection methods mostly use the meta-paths of post propagation to construct isomorphic graphs to find clues in the propaga-tion structure. However, these methods do not fully use the global and local rela-tions in the propagation graph and do not consider the correlation between differ-ent types of nodes. In this paper, we propose a Global-Local Relations Encoding Network (GLREN), which encodes node relations in the heterogeneous graph from global and local perspectives. First, we explore the semantic similarity be-tween all source posts and comment posts to generate global and local semantic representations. Then, we construct user credibility levels and interaction relations to explore the potential relationship between users and misinformation. Finally, we introduce a root enhancement strategy to enhance the influence of source posts and publisher information. The experimental results show that our model can outperform the accuracy of the state-of-the-art methods by 3.0% and 6.0% on Twitter15 and Twitter16, respectively.


Poster 1 - Manga Text Detection with Manga-Specific Data Augmentation and Its Applications on Emotion Analysis

Authors: Yang, Yi-Ting; Chu, Wei-Ta


We especially target at detecting text in atypical font styles and in cluttered background for Japanese comics (manga). To enable the detection model to detect atypical text, we augment training data by the proposed manga-specific data augmentation. A generative adversarial network is developed to generate atypical text regions, which are then blended into manga pages to largely increase the volume and diversity of training data. We verify the importance of manga-specific data augmentation. Furthermore, with the help of manga text detection, we fuse global visual features and local text features to enable more accurate emotion analysis.

Poster 2 - A Proposal-Improved Few-shot Embedding Model with Contrastive Learning

Authors: Gong, Fucai; Xie, Yuchen; Jiang, Le; Chen, Keming (1); Liu, Yunxin; Ye, Xiaozhou; Ouyang, Ye


Few-shot learning is increasingly popular in image classification. The key is to learn the significant features from source classes to match the support and query pairs. In this paper, we redesign the contrastive learning scheme in a few-shot manner with selected proposal boxes generated by Navigator network. The main work of this paper includes: (i) We analyze the limitation of hard sample generating proposed by current few-shot learning methods with contrastive learning and find additional noise introduced in contrastive loss construction. (ii) We propose a novel embedding model with contrastive learning named infoPB which improves hard samples with proposal boxes to improve Noise Contrastive Estimation. (iii) We demonstrate infoPB is effective in few-shot image classification and benefited from Navigator network through the ablation study. (iv) The performance of our method is evaluated thoroughly on typical few-shot image classification tasks. It verifies a new state-of-the-art performance compared with outstanding competitors with their best results on miniImageNet in 5-way, 5-shot, and tieredImageNet in 5-way, 1-shot/5-way, 5-shot.

Poster 3 - Space-time Video Super-resolution 3D Transformer

Authors: Zheng, Minyan; Luo, Jianping


Space-time video super-resolution, which aims to generate a high resolution (HR) and high frame rate (HRF) video from a low frame rate (LFR), low resolution (LR) video. Simply combining video frame interpolation (VFI) and video super-resolution (VSR) network to solve this problem cannot bring satisfying performance, which also requires a heavy computational burden. In this paper, we investigate a one-stage network to jointly up-sample video both in time and space. In our framework, a 3D pyramid structure with channel attention is proposed to fuse input frames and generate intermediate features. The features are fed into the 3D Transformer network to model global relationships between features. Our proposed network, 3DTFSR, can efficiently process videos without explicit motion compensation. Extensive experiments on benchmark datasets demonstrate that the proposed method achieves better quantitative and qualitative performance compared to a two-stage network.

Poster 4 - Unsupervised Encoder-Decoder Model For Anomaly Prediction Task

Authors: Wu, Jinmeng; Shu, Pengcheng; Hong, Hanyu; Li, Xingxun; Ma, Lei; Zhang, Yaozong; Zhu, Ying; Wang, Lei


For the anomaly detection task of video sequences, CNN-based methods have been able to learn to describe the normal situation without abnormal samples at training time by reconstructing the input frame or predicting the future frame, and then use the reconstruction error to represent the abnormal situation at testing time. Transformers, however, have achieved the same spectacular outcomes as CNN on many tasks after being utilized in the field of vision, and they have also been used in the task of anomaly detection. We present an unsupervised learning method based on Vision Transformer in this work. The model has an encoder-decoder structure, and the Memory module is used to extract and en-hance the video sequence's local pattern features. We discover anomalies in various data sets and visually compare distinct scenes in the data set. The experimental results suggest that the model has a significant impact on the task of dealing with anomalies.


Poster 1 - SPEM: Self-adaptive Pooling Enhanced Attention Module for Image Recognition

Authors: Zhong, Shanshan; Wen, Wushao; Qin, Jinghui


Recently many effective attention modules are proposed to boot the model performance by exploiting the internal information of convolutional neural networks in computer vision. In general, many previous works ignore considering the design of the pooling strategy of the attention mechanism since they adopt the global average pooling for granted, which hinders the further improvement of the performance of the attention mechanism. However, we empirically find and verify a phenomenon that the simple linear combination of global max-pooling and global min-pooling can produce pooling strategies that match or exceed the performance of global average pooling. Based on this empirical observation, we propose a simple-yet-effective attention module SPEM, which adopts a self-adaptive pooling strategy based on global max-pooling and global min-pooling and a lightweight module for producing the attention map. The effectiveness of SPEM is demonstrated by extensive experiments on widely-used benchmark datasets and popular attention networks.

Poster 2 - Weighted Multi-view Clustering Based on Internal Evaluation

Authors: Xu, Haoqi; Hou, Jian; Yuan, Huaqiang


As real-world data are often represented by multiple sets of features in different views, it is desirable to improve clustering results with respect to ordinary single-view clustering by making use of the consensus and complementarity among different views. For this purpose, weighted multi-view clustering is proposed to combine multiple individual views into one single combined view, which is used to generate the final clustering result. In this paper we present a simple yet effective weighted multi-view clustering algorithm based on internal evaluation of clustering results. Observing that an internal evaluation criterion can be used to estimate the quality of clustering results, we propose to weight different views to maximize the clustering quality in the combined view. We firstly introduce an implementation of the Dunn index and a heuristic method to determine the scale parameter in spectral clustering. Then an adaptive weight initialization and updating method is proposed to improve the clustering results iteratively. Finally we do spectral clustering in the combined view to generate the clustering result. In experiments with several publicly available image and text datasets, our algorithm compares favorably or comparably with some other algorithms.

Poster 3 - Graph-Based Data Association in Multiple Object Tracking: A Survey

Authors: Touska, Despoina; Gkountakos, Konstantinos; Ioannidis, Konstantinos; Tsikrika, Theodora; Vrochidis, Stefanos; Kompatsiaris, Ioannis


In Multiple Object Tracking (MOT), data association is a key component of the tracking-by-detection paradigm and endeavors to link a set of discrete object observations across a video sequence, yielding possible trajectories. Our intention is to provide a classification of numerous graph-based works according to the way they measure object dependencies and their footprint on the graph structure they construct. In particular, methods are organized into Measurement-to-Measurement (MtM), Measurement-to-Track (MtT), and Track-to-Track (TtT). At the same time, we include recent Deep Learning (DL) implementations among traditional approaches to present the latest trends and developments in the field and offer a performance comparison. In doing so, this work serves as a foundation for future research by providing newcomers with information about the graph-based bibliography of MOT.

Poster 4 - CTDA: Contrastive Temporal Domain Adaptation for Action Segmentation

Authors: Han, Hongfeng; Lu, Zhiwu; Wen, Ji-Rong


In video action segmentation scenarios, sufficient training data is needed for training a deep learning model. However, due to the high cost of human annotation for action segmentation, only very limited training videos can be accessible. Further, large spatio-temporal variations exist between the training and test data. Therefore, it is critical to obtain effective representations with limited training videos and efficiently utilize unlabeled test videos. To this end, based on self-supervised learning and domain adaptation, we propose a novel Contrastive Temporal Domain Adaptation (CTDA) framework for action segmentation. Specifically, in the self-supervised learning module, two auxiliary tasks are defined based on binary and sequential domain prediction and then are addressed by the combination of domain adaptation and contrastive learning. Further, a multi-stage architecture is devised with one prediction generation stage and three refinement stages to obtain the final results of action segmentation. Extensive experiments on two action segmentation benchmarks demonstrate that the proposed CTDA achieves state-of-the-art performance.


Poster 1 - Less is More: Similarity Models for Content-based Video Retrieval

Authors: Veselý, Patrik; Peska, Ladislav


The concept of object-to-object similarity plays a crucial role in interactive content-based video retrieval tools. Similarity (or distance) models are core components of several retrieval concepts, e.g. Query by Example or relevance feedback. In these scenarios, the common approach is to apply some feature extractor that transforms the object to a vector of features, i.e., positions it into an induced latent space. The similarity is then based on some distance metric in this space. Historically, feature extractors were mostly based on some color histograms or hand-crafted descriptors such as SIFT, but nowadays state-of-the-art tools mostly rely on some deep learning (DL) approaches. However, so far there were no systematic study of how suitable are individual feature extractors in the video retrieval domain. Or, in other words, to what extent are human-perceived and model-based similarities concordant. To fill this gap, we conducted a user study with over 4000 similarity judgements comparing over 20 variants of feature extractors. Results corroborate the dominance of deep learning approaches, but surprisingly favor smaller and simpler DL models instead of larger ones.

Poster 2 - BENet: Boundary Enhance Network for Salient Object Detection

Authors: yan, zhiqi; liang, shuang


Although deep convolutional networks have achieved good results in the field of salient object detection, most of these methods cannot work well near the boundary. This results in poor boundary quality of network predictions, accompanied by a large number of blurred contours and hollow objects. To solve this problem, this paper proposes a Boundary Enhance Network (BENet) for salient object detection, which makes the network pay more attention to salient edge features by fusing auxiliary boundary information of objects. We adopt the Progressive Feature Extraction Module (PFEM) to obtain multi-scale edge and object features of saliency objects. In response to the semantic gap problem in feature fusion, we propose an Adaptive Edge Fusion Module (AEFM) to allow the network to adaptively and complementarily fuse edge features and salient object features. The Self Refinement (SR) module further repairs and enhances edge features. Moreover, in order to make the network pay more attention to the boundary, we design an edge enhance loss function, which uses the additional boundary maps to guide the network to learn rich boundary features at the pixel level. Experimental results show that our proposed method outperforms state-of-the-art methods on five benchmark datasets.

Poster 3 - Multi-view Adaptive Bone Activation from Chest X-Ray with Conditional Adversarial Nets

Authors: Niu, Chaoqun; Li, Yuan; Wang, Jian (1); Zhou, Jizhe; Xiong, Tu; Yu, Dong; Guo, Huili; Zhang, Lin; Liang, Weibo; Lv, Jiancheng


Activating bone from a chest X-ray (CXR) is significant for disease diagnosis and health equity for under-developed areas, while the complex overlap of anatomical structures in CXR constantly challenges bone activation performance and adaptability. Besides, due to high data collection and annotation costs, no large-scale labeled datasets are available. As a result, existing methods commonly use single-view CXR with annotations to activate bone. To address these challenges, in this paper, we propose an adaptive bone activation framework. This framework leverages the Dual-Energy Subtraction (DES) images to consist of multi-view image pairs of the CXR and the contrastive learning theory to construct training samples. In particular, we first devise a Siamese/Triplet architecture supervisor; correspondingly, we establish a cGAN-styled activator based on the learned skeletal information to generate the bone image from the CXR. To our knowledge, the proposed method is the first multi-view bone activation framework obtained without manual annotation and has more robust adaptability. The mean of Relative Mean Absolute Error and the Fréchet Inception Distance are 3.45% and 1.12 respectively, which proves the results activated by our method retain more skeletal details with few feature distribution changes. From the visualized results, our method can activate bone images from a single CXR ignoring overlapping areas. Bone activation has drastically improved compared to the original images.

Poster 4 - Multi-scale and Multi-stage Deraining Network with Fourier Space Loss

Authors: Yan, Zhaoyong; Ma, Liyan; Luo, Xiangfeng; Sun, Yan


The goal of rain streak removal is to recover the rain-free background scenes of an image degraded by rain streaks. Most current deep convolutional neural networks methods have achieved dramatic performance. However, these methods still cannot capture the discriminative features to well distinguish the rain streaks and the important image content. To solve this problem, we propose a Multi-scale and Multi-stage deraining network in the end-to-end manner. Specifically, we design a multi-scale rain streak extraction module to capture complex rain streak features across different scales through the multi-scale selection kernel attention mechanism. In addition, multi-stage learning is used to extract deeper feature representations of rain streak and fuse different stages of background information. Furthermore, we introduce a Fourier space loss function to reduce the loss of high-frequency information in the background image and improve the quality of deraining results. Extensive experiments demonstrate that our network performs favorably against the state-of-the-art deraining methods.


Poster 1 - Edge Assisted Asymmetric Convolution Network for MR Image Super-Resolution

Authors: Wang, Wanliang; Xing, Fangsen; Chen, Jiacheng; Tu, Hangyao


High-resolution magnetic resonance (MR) imaging is beneficial for accurate disease diagnosis and subsequent analysis. Currently, the single image super-resolution (SR) technique is an effective and less costly alternative technique to improve the spatial resolution of MR images. Structural information in MR images is crucial during clinical diagnosis, but it is often ignored by existing deep learning MR image SR technique. Consequently, we propose edge assisted feature extraction block (EAFEB), which can efficiently extract the content and edge features from low-resolution (LR) images, allowing the network to focus on both content and geometric structure. To fully utilize the features extracted by EAFEB, an asymmetric convolutional group (ACG) is proposed, which can balance structural feature preservation and content feature extraction. Moreover, we design a novel contextual spatial attention (CSA) method to facilitate the network focus on critical information. Experiment results in various MR image sequences, including T1, T2, and PD, show that our Edge Assisted Asymmetric Convolution Network (EAACN) has superior results relative to recent leading SR models.

Poster 2 - PEFNet: Positional Embedding Feature for Polyp Segmentation

Authors: Nguyen Mau, Trong-Hieu; Trinh, Quoc-Huy; Bui, Nhat-Tan; Vo Thi, Phuoc-Thao; Nguyen, Minh-Van; Cao, Xuan-Nam; Tran, Minh-Triet; Nguyen, Hai-Dang


With the development of biomedical computing, the segmentation task has an integral part in helping the doctor correctly identify the position of the polyps or the ache in the system. However, precise polyp segmentation is challenging because the same type of polyps has a diversity of size, color, and texture; previous methods cannot fully transfer information from encoder to decoder due to the lack of details and knowledge of previous layers. To deal with this problem, we propose PEFNet, a novel model using modified UNet with a new Positional Embedding Feature block in the merging stage, which has more accuracy and generalization in polyps segmentation. The PEF block utilizes the information of the position, concatenated features, and extracted features to enrich the gained knowledge and improve the model's comprehension ability. With EfficientNetV2-L as the backbone, we obtain the IOU score of 0.8201 and the Dice coefficient of 0.8802 on the Kvasir-SEG dataset. By PEFNet, we also take second place on the task Medico: Transparency in Medical Image Segmentation at MediaEval 2021, which is clear proof of the effectiveness of our models.

Poster 3 - Multimodal Reconstruct and Align Net for Missing Modality Problem in Sentiment Analysis

Authors: Luo, Wei; Xu, Mengying; Lai, Hanjiang


(MSA) aims at recognizing emotion categories by textual, visual, and acoustic cues. However, in real-life scenarios, one or two modalities may be missing due to various reasons. And when text modality is missing, obvious deterioration will be observed since text modality contains much more semantic information compared to vision and audio modality. To this end, we propose the Multimodal Reconstruct and Align Net (MRAN) to tackle the missing modality problem, especially to relieve the decline caused by the text modality's absence. We first propose the Multimodal Embedding and Missing Index Embedding to guide the reconstruction of missing modalities features. Then, visual and acoustic features are projected to the textual feature space, and all three modalities' features are learned to be close to the word embedding of their corresponding emotion category, making visual and acoustic features aligned with textual features. In this text-centered way, vision and audio modality benefit from the more informative text modality. Thus it improves the robustness of the network for different modality missing conditions, especially when text modality is missing. Experimental results conducted on two multimodal benchmarks IEMOCAP and CMU-MOSEI show that our method outperforms baseline methods, gaining superior results on different kinds of modality missing conditions.

Poster 4 - DHP: A Joint Video Download and Dynamic Bitrate Adaptation Algorithm for Short Video Streaming

Authors: Gao, Wenhua; Zhang, Lanju; Yang, Hao; Zhang, Yuan; Yan, Jinyao; Lin, Tao


With the development of multimedia technology and the upgrading of mobile terminal equipment, short video platforms and applications are becoming more and more popular. Compared with traditional long video, short video users tend to slide from current viewing video more frequently. Unviewed preloaded video chunks cause a large amount of bandwidth waste and do not contribute to improving the user QoE. Since bandwidth savings conflict with user QoE improvements, it is very challenging to satisfy both. To solve this problem, this paper proposes DHP, a joint video download and dynamic bitrate adaptation algorithm for short video streaming. DHP makes the chunk download decision based on the maximum buffer model and retention rate, and makes the dynamic bitrate adaptation decision according to past bandwidth and buffer size. Experimental results show that DHP can reduce the bandwidth waste by at most 66.74% and improve the QoE by at most 42.5% compared to existing solutions under various network conditions.


Poster 1 - An Occlusion Model for Spectral Analysis of Light Field Signal

Authors: Chen, Weiyan; Zhu, Changjian; Zhang, Shan; Xiang, Sen


Occlusion is a common phenomenon in actual scenes, this phenomenon will seriously influence application of light field rendering (LFR) technology. We propose an occlusion model of scene surface that approximating the scene surface as a set of concave and convex parabolas to solve the light field (LF) reconstruction problem. The first step in this model involves determining the occlusion function. After obtaining the occlusion function, we can then perform the plenoptic spectral analysis. Through the plenoptic spectral analysis, the plenoptic spectrum will reveal occlusion characteristics. Finally, the occlusion characteristics can be used to determine the minimal sampling rate and a new reconstruction filter can also be applied to calibrate the aliasing spectrum to achieve high quality view synthesis. This extends previous works of LF reconstruction that considering reconstruction filter. Experimental evaluation demonstrates that our occlusion model significantly address occlusion problem while improve the rendering quality of light field.

Poster 2 - MCOM-Live: A Multi-Codec Optimization Model at the Edge for Live Streaming

Authors: Lorenzi, Daniele; Tashtarian, Farzad; Amirpour, Hadi; Timmerer, Christian; Hellwagner, Hermann


HTTP Adaptive Streaming (HAS) is the predominant technique to deliver video contents across the Internet with the increasing demand of its applications. With the evolution of videos to deliver more immersive experiences, such as their evolution in resolution and framerate, highly efficient video compression schemes are required to ease the burden on the delivery process. While AVC/H.264 still holds the billboard of the most adopted codecs, we are experiencing an increase in the usage of new generation codecs (HEVC/H.265, VP9, AV1, VVC/H.266, etc.). Compared to AVC/H.264, these codecs can either achieve the same quality besides a bitrate reduction or improve the quality while targeting the same bitrate. In this paper, we propose a Mixed-Binary Linear Programming (MBLP) model called Multi-Codec Optimization Model at the edge for Live streaming (MCOM-Live) to jointly optimize (i) the overall streaming costs, and (ii) the visual quality of the content played out by the end-users by efficiently enabling multi-codec content delivery. Given a video content encoded with multiple codecs according to a fixed bitrate ladder, the model will choose among three available policies, i.e., fetch, transcode, or skip, the best option to handle the representations. We compare the proposed model with traditional approaches used in the industry. The experimental results show that our proposed method can reduce the additional latency by up to 23% and the streaming costs by up to 78%, besides improving the visual quality of the delivered segments by up to 0.5 dB, in terms of PSNR.

Poster 3 - Lightweight image hashing via knowledge distillation and optimal transport for face retrieval

Authors: Feng, Ping; Zhang, Hanyun; Sun, Yingying; Tang, Zhenjun


This paper proposes a lightweight image hashing based on knowledge distillation and optimal transport for face retrieval. A key contribution is the attention-based triplet knowledge distillation, whose loss function includes attention loss, Kullback-Leibler (KL) loss and identity loss. It can significantly reduce network size with almost no decrease of retrieval performance. Another contribution is the hash quantization based on optimal transport. It partitions the face feature space by calculating class-centers and conducts binary quantization based on the optimal transport. It can make performance improvement in face retrieval with short bit length. In addition, an alternating training strategy is designed for tuning network parameters of our lightweight hashing. Many experiments on two face datasets are carried out to test performances of the proposed lightweight hashing. Retrieval comparisons illustrate that the proposed lightweight hashing outperforms some well-known hashing methods.

Poster 4 - Generating new paintings by semantic guidance

Authors: pan, ting; wang, fei; xie, junzhou; liu, weifeng


In order to facilitate the human painting process, numerous research efforts have been made on teaching machines how to "paint like a human", which is a challenging problem. Recent stroke-based rendering algorithms generate non-photorealistic imagery using a number of strokes to mimic a target image. However, the applicability of previous methods can only draw the content of one target image on a canvas that limits generation ability. We propose a novel painting approach that teach machines to paint with multiple target images and then generate new paintings. We consider the order of human painting and propose a combined stroke rendering method that can merge the content of multiple images into the same painting. We use semantic segmentation to obtain semantic information in multiple images, and add the semantic information in different images to the same painting process. Finally, our model can generate new paintings with contents from different images by the guidance of this semantic information. Experimental results demonstrate that our model can effectively generate new paintings which can assist human beings to create.


Poster 1 - Context-Guided Multi-view Stereo with Depth Back-Projection

Authors: Feng, Tianxing; Zhang, Zhe; Xiong, Kaiqiang; Wang, Ronggang


Depth map based Multi-view stereo (MVS) is a task that focuses on taking images from multiple views of one same scene as input, estimating depth in each view, and generating 3D reconstructions of objects in the scene. Though most matching based MVS methods take features of the input images into account, few of them make the best of the underlying global information in images. They may suffer from difficult image regions, such as object boundaries, low-texture areas, and reflective surfaces. Human beings perceive these cases with the help of global awareness, that is to say, the context of the objects we observe. Similarly, we propose Context-guided Multi-view Stereo (ContextMVS), a coarse-to-fine pyramidal MVS network, which explicitly utilizes the context guidance in asymmetrical features to integrate global information into the 3D cost volume for feature matching. Also, with a low computational overhead, we adopt a depth back-projection refined up-sampling module to improve the non-parametric depth up-sampling between pyramid levels. Experimental results indicate that our method outperforms classical learning-based methods by a large margin on public benchmarks, DTU and Tanks and Temples, demonstrating the effectiveness of our method.

Poster 2 - LAE-Net: Light and Efficient Network for Compressed Video Action Recognition

Authors: Guo, Jinxin; Zhang, Jiaqiang; Zhang, Xiaojing; Ma, Ming


Action recognition is a crucial task in computer vision and video analysis. The Two-stream network and 3D ConvNets are representative works. Although both of them have achieved outstanding performance, the optical flow and 3D convolution require huge computational effort, without taking into account the need for real-time applications. Current work extracts motion vectors and residuals directly from the compressed video to replace optical flow. However, due to the noisy and inaccurate representation of the motion, the accuracy of the model is significantly decreased when using motion vectors as input. Besides the current works focus only on improving accuracy or reducing computational cost, without exploring the tradeoff strategy between them. In this paper, we propose a light and efficient multi-stream framework, including a motion time fusion module (MTFM) and a double compressed knowledge distillation module (DCKD). MTFM improves the network's ability to extract complete motion information and compensates to some extent for the problem of inaccurate description of motion information by motion vectors in compressed video. DCKD allows the student network to gain more knowledge from teacher with less parameters and input frames. Experimental results on the two public benchmarks(UCF-101 and HMDB-51) outperform the state of the art on the compressed domain.

Poster 3 - CMFG:Cross-Model Fine-Grained Feature Interaction for Text-Video Retrieval

Authors: Zhao, Shengwei; Liu, Yuying; Du, Shaoyi; Tian, Zhiqiang; Qu, Ting; Xu, Linhai


As a fundamental task in the multimodal domain, text-tovideo retrieval task has received great attention in recent years. Most of the current research focuses on the interaction between cross-modal coarse-grained features. However, the feature granularity of retrieval models has not been fully explored. Therefore, we introduce video internal region information into cross-modal retrieval and propose a cross-model fine-grained feature retrieval framework. Videos are represented as video-frame-region triple features, texts are represented as sentence-word dual features, and the cross-similarity between visual features and text features is computed through token-wise interaction. It effectively extracts the detailed information in the video, guides the model to pay attention to the effective video region information and keyword information in the sentence, and reduces the adverse effects of redundant words and interfering frames. On the most popular retrieval dataset MSRVTT, the framework achieves state-of-the-art results (51.1@1). Excellent experimental results demonstrate the superiority of fine-grained feature interaction.

Poster 4 - A Multi-Stream Fusion Network for Image Splicing Localization

Authors: Siopi, Maria; Kordopatis-Zilos, Giorgos; Charitidis, Polychronis; Kompatsiaris, Ioannis; Papadopoulos, Symeon


In this paper, we address the problem of image splicing localization with a multi-stream network architecture that processes the raw RGB image in parallel with other handcrafted forensic signals. Unlike previous methods that either use only the RGB images or stack several signals in a channel-wise manner, we propose an encoder-decoder architecture that consists of multiple encoder streams. Each stream is fed with either the tampered image or handcrafted signals and processes them separately to capture relevant information from each one independently. Finally, the extracted features from the multiple streams are fused in the bottleneck of the architecture and propagated to the decoder network that generates the output localization map. We experiment with two handcrafted algorithms, i.e., DCT and Splicebuster. Our proposed approach is benchmarked on three public forensics datasets, demonstrating very competitive performance against several competing methods and achieving state-of-the-art results, e.g., on CASIA with 0.898 AUC.


Poster 1 - RLSCNet: A Residual Line-shaped Convolutional Network for Vanishing Point Detection

Authors: Wang, Wei; Lu, Peng; Peng, Xujun; Yin, Wang; Zhao, Zhaoran


The convolutional neural network (CNN) is an effective model for vanishing point (VP) detection, but its success heavily relies on a massive amount of training data to ensure high accuracy. Without sufficient and balanced training data, the obtained CNN-based VP detection models can be easily overfitted with less generalization. By acknowledging that a VP in the image is the intersection of projections of multiple parallel lines in the scene and treating this knowledge as a geometric prior, we propose a prior-guided residual line-shaped convolutional network for VP detection to reduce the dependence of CNN on training data. In the proposed end-to-end approach, the probabilities of VP in the image are computed through an edge extraction subnetwork and a VP prediction subnetwork, which explicitly establishes the geometric relationships among edges, lines, and vanishing points by stacking the differentiable residual line-shaped convolutional modules. Our extensive experiments on various datasets show that the proposed VP detection network improves accuracy and outperforms previous methods in terms of both inference speed and generalization performance.

Poster 2 - DARTS-PAP: Differentiable neural architecture search by polarization of instance complexity weighted architecture parameters

Authors: Li, Yunhong; Li, Shuai; Yu, Zhenhua


Neural architecture search has attracted much attention because it can automatically find architectures with high performance. In recent years, differentiable architecture search further emerges as one of the main techniques for automatic network design. However, related methods suffer from performance collapse due to excessive skip-connect operations and discretization gaps in search and evaluation. To relieve performance collapse, we propose a polarization regularizer on instance-complexity weighted architecture parameters to push the probability of the most important operation in each edge to 1 while the probabilities of other operations to 0. The polarization regularizer effectively removes the discretization gaps between the search and evaluation procedures, and instance-complexity aware learning of the architecture parameters gives higher weights to hard inputs therefore further improves the network performance. Similar to existing methods, the search process is conducted under a differentiable way. Extensive experiments on a variety of search spaces and datasets show our method can well polarize the architecture parameters and greatly reduce the number of skip-connect operations, which contributes to the performance elevation of network search.

Poster 3 - Transferable Adversarial Attack on 3D Object Tracking in Point Cloud

Authors: Liu, Xiaoqiong; Lin, Yuewei; Yang, Qing; Fan, Heng


3D point cloud object tracking has recently witnessed considerable progress relying on deep learning. Such progress, however, mainly focuses on improving tracking accuracy. The risk, especially considering that deep neural network is vulnerable to adversarial perturbations, of a tracker being attacked is often neglected and rarely explored. In order to attract attentions to this potential risk and facilitate the study of robustness in point cloud tracking, we introduce a novel transferable attack network (TAN) to deceive 3D point cloud tracking. Specifically, TAN consists of a 3D adversarial generator, which is trained with a carefully designed multi-fold drift (MFD) loss. The MFD loss considers three common grounds, including classification, intermediate feature and angle drifts, across different 3D point cloud tracking frameworks for perturbation generation, leading to high transferability of TAN for attack. In our extensive experiments, we demonstrate the proposed TAN is able to not only drastically degrade the victim 3D point cloud tracker, i.e., P2B [20], but also effectively deceive other unseen state-of-the-art approaches such as BAT [33] and M2Track [34], posing a new threat to 3D point cloud tracking. Code will be released upon publication.

Poster 4 - Fusion of multiple classifiers using self supervised learning for satellite image change detection

Authors: Oikonomidis, Alexandros; Pegia, Maria; Moumtzidou, Anastasia; Gialampoukidis, Ilias; Vrochidis, Stefanos; Kompatsiaris, Ioannis


Deep learning methods are widely used in the domain of change detection in remote sensing images. While datasets of that kind are abundant, annotated images, specific for the task at hand, are still scarce. Neural networks trained with Self supervised learning aim to harness large volumes of unlabeled satellite high resolution images to help in finding better solutions for the change detection problem. In this paper we experiment with this approach by presenting 4 different change detection methodologies. We propose a fusion method that under specific parameters can provide better results. We evaluate our results using two openly available datasets with Sentinel-2 satellite images, S2MTCP and OSCD, and we investigate the impact of using 2 different Sentinel 2 band combinations on our final predictions. Finally we conclude by summarizing the benefits of this approach as well as we propose future areas of interest that could be of value in enhancing the change detection task's outcomes.


Poster 1 - Energy Transfer Contrast Network for Unsupervised Domain Adaption

Authors: Ouyang, Jiajun; Lv, Qingxuan; Zhang, Shu; Dong, Junyu


The main goal of unsupervised domain adaptation is to improve the classification performance on unlabeled data in target domains. Many methods try to reduce the domain gap by treating multiple domains as one to enhance the generalization of a model. However, aligning domains as a whole does not account for instance-level alignment, which might lead to sub-optimal results. Currently, many researchers utilize meta-learning and instance segmentation approaches to tackle this problem. But it can only obtain a further optimized the domain-invariant feature learned by the model, rather than achieve instance-level alignment. In this paper, we interpret unsupervised domain adaptation from a new perspective, which exploits the energy difference between the source and target domains to reduce the performance drops caused by the domain gap. At the same time, we improve and exploit the contrastive learning loss, which can push the target domain away from the decision boundary. The experimental results on different benchmarks against a range of the state-of-the-art approaches justify the performance and the effectiveness of the proposed method.

Poster 2 - Pseudo-Label Diversity Exploitation for Few-Shot Object Detection

Authors: Chen, Song; Wang, Chong; Liu, Weijie; Ye, Zhengjie; Deng, Jiacheng


Few-Shot Object Detection (FSOD) task is widely used in various data-scarce scenarios, aiming to expand the object detector with a few novel class samples. The current mainstream FSOD models improve the accuracy by mining novel class instances in the training set and fine-tuning the detector with mined pseudo-label data. Substantial progress has been made using pseudo-label ap-proaches, but the impact of pseudo-label diversity on FSOD tasks has not been explored. In our work, for the purpose of fully utilizing the pseudo-label data and exploring their diversity, we propose a new framework mainly including Novel Instance Bank (NIB) and Correlation-Guided Loss Correction (CGLC). Dynamically updated NIB stores the novel class instances to ensure the diver-sity of novel instances in each batch. Moreover, to better exploit the pseudo-label diversity, CGLC adaptively employs k-shot samples to guide correct and incorrect pseudo-labels to pull away from each other. Experimental results on the MS-COCO dataset demonstrate the effectiveness of our method, which does not require any additional training samples or parameters. Our code is available at:

Poster 3 - A Spectrum Dependent Depth Layered Model for Optimization Rendering Quality of Light Field

Authors: Gan, Xiangqi; Zhu, Changjian; Bai, Mengqin; Wei, Ying; Chen, Weiyan


Light field rendering technology is an important tool, which is applied a set of multi view image to render realistic novel views and experiences through some simple interpolation. However, the rendered novel views often have various distortions or low quality due to complexity of the scene, e.g., occlusion and non Lambertian. The distortion of novel views in the spectrum of light field signal can be reflected in periodic aliasing. In this paper, we propose a spectrum dependent depth layered (SDDL) model to eliminate the spectrum aliasing of light field signals, so as to improve the rendering quality of novel views. The SDDL model is about taking advantage of the characteristics that light field signal spectrum structure is only limited by the minimum depth and the maximum depth. So we in manner of increasing the depth of the layer between the minimum and maximum depth,it will reduce the sampling interval between cameras, and the adjacent two sampling during the period of the spectrum interval will become bigger. Thus, the aliasing of novel views will become smaller, it can be eliminated by this method to achieve the purpose of aliasing. In fact,the result of experiment prove our method can improve the rendering quality of light field.

Poster 4 - Improving the Robustness to Variations of Objects and Instructions with a Neuro-Symbolic Approach for Interactive Instruction Following

Authors: Shinoda, Kazutoshi; Takezawa, Yuki; Suzuki, Masahiro; Iwasawa, Yusuke; Matsuo, Yutaka


An interactive instruction following task has been proposed as a benchmark for learning to map natural language instructions and first-person vision into sequences of actions to interact with objects in 3D environments. We found that an existing end-to-end neural model for this task tends to fail to interact with objects of unseen attributes and follow various instructions. We assume that this problem is caused by the high sensitivity of neural feature extraction to small changes in vision and language inputs. To mitigate this problem, we propose a neuro-symbolic approach that utilizes high-level symbolic features, which are robust to small changes in raw inputs, as intermediate representations. We verify the effectiveness of our model with the subtask evaluation on the ALFRED benchmark. Our experiments show that our approach significantly outperforms the end-to-end neural model by 9, 46, and 74 points in the success rate on the ToggleObject, PickupObject, and SliceObject subtasks in unseen environments respectively.


Poster 1 - Recombining Vision Transformer Architecture for Fine-Grained Visual Categorization

Authors: Deng, Xuran; Liu, Chuanbin; Lu, Zhiying


Fine-grained visual categorization (FGVC) is a challenging task in the image analysis field which requires comprehensive discriminative feature extraction and representation. To get around this problem, previous works focus on designing complex modules, the so-called necks and heads, over simple backbones, while bringing a huge computational burden. In this paper, we bring a new insight: Vision Transformer itself is an all-in-one FGVC framework that consists of basic Backbone for feature extraction, Neck for further feature enhancement and Head for selecting discriminative feature. We delve into the feature extraction and representation pattern of ViT for FGVC and empirically show that simply recombining the original ViT structure to leverage multi-level semantic representation without introducing any other parameters is able to achieve higher performance. Under such insight, we proposed RecViT, a simple recombination and modification of original ViT, which can capture multi-level semantic features and facilitate fine-grained recognition. In RecViT, the deep layers of the original ViT are served as Head, a few middle layers as Neck and shallow layers as Backbone. In addition, we adopt an optional Feature Processing Module to enhance discriminative feature representation at each semantic level and align them for final recognition. With the above simple modifications, RecViT obtains significant improvement in accuracy in FGVC benchmarks: CUB-200-2011, Stanford Cars and Stanford Dogs.

Poster 2 - HSS: A Hierarchical Semantic Similarity Hard Negative Sampling Method for Dense Retrievers

Authors: Xie, Xinjia; Liu, Feng; Gai, Shun; Huang, Zhen; Hu, Minghao; Wang, Ankun


Dense Retriever (DR) for Open-domain textual question answering (OpenQA), which aims to retrieve passages from large data sources like Wikipedia or Google, has gained wide attention in recent years. Although DR models continuously refresh state-of-the-art performances, their improvement relies on negative sampling during the training process. Existing sampling strategies mainly focus on developing a complex algorithm based on computer science, and ignore the abundant semantic features of datasets. We discover that there exists obvious changes in semantic similarity and present a three-level hierarchy of semantic similarity: same topic, same class, other class, whose rationality is further demonstrated by ablation study. Based on this, we propose a hard negative sampling strategy named Hierarchical Semantic Similarity (HSS). Our HISSIM model performs negative sampling at semantic levels of topic and class, and experimental results on four datasets show that it achieves comparable or better retrieval performance compared with existing competitive baselines. The code is available in

Poster 3 - Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training

Authors: Yang, Jing; Chen, Junwen; Yanai, Keiji


mutual retrieval between recipe images and texts, which is clear for human but arduous to formulate. Although many previous works endeavored to solve this problem, most works did not efficiently exploit the cross-modal information among recipe data. In this paper, we present a frustratingly straightforward cross-modal recipe retrieval framework, Transformer-based Network for Large Batch Training (TNLBT) achieving high performance on both recipe retrieval and image generation tasks, which is designed to efficiently exploit the rich cross-modal information. In our proposed framework, Transformer-based encoders are applied for both image and text encoding for cross-modal embedding learning. Furthermore, we apply generative adversarial network to force learned recipe embedding to recover visual information to encourage models to retain modality-specific information. We also adopt several loss functions like selfsupervised learning loss on recipe text to encourage the model to further promote the cross-modal embedding learning. Since contrastive learning could benefit from a larger batch size according to the recent literature on self-supervised learning, we also adopt a large batch size during training and have validated its effectiveness. In the experiments, the proposed framework significantly outperformed the current state-of-the-art frameworks in both cross-modal recipe retrieval and image generation tasks on the benchmark Recipe1M by a huge margin. We also found that CLIP-ViT performs better than ViT-B as the image encoder backbone in the experiment. This is the first work which confirmed the effectiveness of large batch training on cross-modal recipe embedding learning.

Poster 4 - Interpretable Driver Fatigue Estimation Based on Hierarchical Symptom Representations

Authors: Lin, Jiaqin; Du, Shaoyi; Liu, Yuying; Tian, Zhiqiang; Qu, Ting; Zheng, Nanning


Traffic accidents caused by driver fatigue lead to millions of death and financial loss every year. Current end-to-end methods for driver fatigue detection are not capable of distinguishing the detailed fatigue symptoms and interpretably inferring the fatigue state. In this paper, we propose an interpretable driver fatigue detection method with hierarchical fatigue symptom representations. In pursuit of a more general and interpretable driver fatigue detection approach, we propose to detect detailed fatigue symptoms before driver state inferring. First of all, we propose a hierarchical method that accurately classifies abnormal behaviors into detailed fatigue symptoms. Moreover, to fuse the fatigue symptom detection results accurately and efficiently, we propose an effective and interpretable fatigue estimation method with maximum a posteriori and experience constraints. Finally, we evaluate the proposed method on the driver fatigue detection benchmark dataset, and the experimental results endorse the feasibility and effectiveness of the proposed method.


Poster 1 - A Length-sensitive Language-bound Recognition Network for Multilingual Text Recognition

Authors: Gao, Ming; Wu, Shilian; Wang, Zengfu


Due to the widespread use of English, considerable attention has been paid to scene text recognition with English as the target language, rather than multilingual scene text recognition. However, it is increasingly necessary to recognize multilingual texts with the continuous advancement of global integration. In this paper, a Length-sensitive Language-bound Recognition Network(LLRN) is proposed for multilingual text recognition. LLRN follows the traditional encoder-decoder structure. We improve the encoder and decoder respectively to better adapt to multilingual text recognition. On the one hand, we propose a Length-sensitive Encoder(LE) to encode features of different scales for long-text images and short-text images respectively. On the other hand, we present a Language-bound Decoder(LD). LD leverages language prior information to constrain the original output of the decoder to further modify the recognition results. Moreover, to solve the problem of multilingual data imbalance, we propose a Language-balanced Data Augmentation(LDA) approach. Experiments show that our method outperforms English-oriented mainstream models and achieves state-of-the-art results on MLT-2019 multilingual recognition benchmark.

Poster 2 - Realtime Sitting Posture Recognition on Embedded Device

Authors: Fang, Jingsen; Shi, Shoudong; Fang, Yi; Huo, Zheng


It is difficult to maintain a standard sitting posture for long periods of time, and a non-standard sitting posture can damage human health. Therefore, it is important to detect sitting posture in real time and remind users to adjust to a healthy sitting posture. Deep learning-based sitting posture recognition methods currently achieve better improvement in recognition accuracy, but the models cannot combine high accuracy and speed on embedded platforms, which in turn makes it difficult to be applied in edge intelligence. Aiming to overcome the challenge, we propose a fast sitting posture recognition method based on OpenPose, using a ShuffleNetV2 network to replace the original backbone network to extract the underlying features, and using a cosine information distance to find redundant filters to prune and optimize the model. Lightweight model after pruning for more efficient real-time interaction. At the same time, the sitting posture recognition method is improved by fusing joint distance and angle features on the basis of skeletal joint features to improve the accuracy while ensuring the recognition detection speed. The optimized model can not only run at 8 fps on the Jetson Nano embedded device, but can also ensure recognition accuracy of 94.73%. Experimental results show that the improved model can meet the real-time detection of sitting posture on embedded devices.

Poster 3 - Self-supervised Multi-object Tracking with Cycle-consistency

Authors: Yin, Yuanhang; Hua, Yang; Song, Tao; Ma, Ruhui; Guan, Haibin


Multi-object tracking is a challenging video task that requires both locating the objects in the frames and associating the objects among the frames, which usually utilizes the tracking-by-detection paradigm. Supervised multi-object tracking methods have made stunning progress recently, however, the expensive annotation costs for bounding boxes and track ID labels limit the robustness and generalization ability of these models. In this paper, we learn a novel multi-object tracker using only unlabeled videos by designing a self-supervisory learning signal for an association model. Specifically, inspired by the cycle-consistency used in video correspondence learning, we propose to track the objects forwards and backwards, i.e., each detection in the first frame is supposed to be matched with itself after the forward-backward tracking. We utilize this cycle-consistency as the self-supervisory learning signal for our proposed multi-object tracker. Experiments conducted on the MOT17 dataset show that our model is effective in extracting discriminative association features, and our tracker achieves competitive performance compared to other trackers using the same pre-generated detections, including UNS20, Tracktor++, FAMNet, and CenterTrack.

Poster 4 - VAISL: Visual-Aware Identification of Semantic Locations in Lifelog

Authors: Tran, Ly-Duyen; Nie, Dongyun; Gurrin, Cathal; Nguyen, Binh; Zhou, Liting


Organising and preprocessing are crucial steps in order to perform analysis on lifelogs. This paper presents a method for preprocessing, enriching, and segmenting lifelogs based on GPS trajectories and images captured from wearable cameras. The proposed method consists of four components: data cleaning, stop/trip point classification, post-processing, and event characterisation. The novelty of this paper lies in the incorporation of a visual module (using a pretrained CLIP model) to improve outlier detection, correct classification errors, and identify each event's movement mode or location name. This visual component is capable of addressing imprecise boundaries in GPS trajectories and the partition of clusters due to data drift. The results are encouraging, which further emphasises the importance of visual analytics for organising lifelog data.


Poster 1 - Lightweight Multi-Level Information Fusion Network for Facial Expression Recognition

Authors: Zhang, Yuan; Tian, Xiang; Zhang, Ziyang; Xu, Xiangmin


The increasing capability of networks for facial expression recognition with disturbing factors is often accompanied by a large computational burden, which imposes limitations on practical applications. In this paper, we propose a lightweight multi-level information fusion network with distillation loss, which can be more lightweight compared with other methods under the premise of not losing accuracy. The multi-level information fusion block uses fewer parameters to focus on information from multiple levels with greater detail awareness, and the channel attention used in this block allows the network to concentrate more on sensitive information when processing facial images with disturbing factors. In addition, the distillation loss makes the network less susceptible to the errors of the teacher network. The proposed method has the fewest parameters of 0.98 million and GFLOPs of 0.142 compared with the state-of-the-art methods while achieving 88.95%, 64.77%, 60.63%, and 62.28% on the datasets RAF-DB, AffectNet-7, AffectNet-8, and SFEW, respectively. Abundantly experimental results show the effectiveness of the method.

Poster 2 - Comparison of deep learning techniques for video-based automatic recognition of Greek folk dances

Authors: Loupas, Georgios; Pistola, Theodora; Diplaris, Sotiris; Ioannidis, Konstantinos; Vrochidis, Stefanos; Kompatsiaris, Ioannis


Folk dances consist an important part of the Intangible Cultural Heritage (ICH) of each place. Nowadays, there is a great amount of videos related to folk dances. An automatic dance recognition algorithm can ease the management of this content and enforce the promotion of folk dances to the younger generations. Automatic dance recognition is still an open research area that belongs to the more general field of human activity recognition. Our work focuses on the exploration of existing deep neural network architectures for automatic recognition of Greek folk dances depicted in standard videos, as well as the experimentation with different representations of input. For our experiments, we have collected YouTube videos of Greek folk dances from north-eastern Greece. Specifically, we have validated three different deep neural network architectures using raw RGB and grayscale video frames, optical flow, as well as “visualised” multi-person 2D poses. In this paper, we describe our experiments, and, finally, we present the results and findings of the conducted research.

Poster 3 - Video-based Precipitation Intensity Recognition using Dual-dimension and Dual-scale Spatiotemporal Convolutional Neural Network

Authors: Lin, Chih-Wei; Chen, Zhongsheng; Huang, Xiuping; Yang, Suhui


This paper proposes the dual-dimension and dual-scale spatiotemporal convolutional neural network, namely DDS-CNN, which consists of two modules, the global spatiotemporal module (GSM) and the local spatiotemporal module (LSM), for precipitation intensity recognition. The GSM uses 3D LSTM operations to study the influence of the relationship between sampling points on precipitation. The LSM takes 4D convolution operations and forms the convolution branches with various convolution kernels to learn the rain pattern of different precipitation. We evaluate the performance of DDS-CNN using the self-collected dataset, IMLab-RAIN-2018, and compare it with the state-of-the-art 3D models. DDS-CNN has the highest overall accuracy and achieves 98.63%. Moreover, we execute the ablation experiments to prove the effectiveness of the proposed modules.

Poster 4 - Multi-Scale Gaussian Difference Preprocessing and Dual Stream CNN-Transformer Hybrid Network for Skin Lesion Segmentation

Authors: Zhao, Xin; Ren, Zhihang


Skin lesions segmentation from dermoscopic images has been a long-standing challenging problem, which is important for improving the analysis of skin cancer. Due to the large variation of melanin in the lesion area, the large number of hairs covering the lesion area, and the unclear boundary of the lesion, most previous works were hard to accurately segment the lesion area. In this paper, we propose a Multi-Scale Gaussian Difference Preprocessing and Dual Stream CNN-Transformer Hybrid Network for Skin Lesion Segmentation, which can accurately segment a high-fidelity lesion area from a dermoscopic image. Specifically, we design three specific sets of Gaussian difference convolution kernels to significantly enhance the lesion area and its edge information, conservatively enhance the lesion area and its edge information, and remove noise features such as hair. Through the information enhancement of multi-scale Gaussian convolution, the model can easily extract and represent the enhanced lesion information and lesion edge information while reducing the noise information. Secondly, we adopt dual steam network to extract features from the Gaussian difference image and the original image separately and fuse them in the feature space to accurately align the feature information. Thirdly, we apply the convolution neural network (CNN) and vision transfomer (ViT) hybrid architectures to better exploit the local and global information. Finally, we use the coordinate attention mechanism and the self-attention mechanism to enhance the sensitivity to the necessary features. Extensive experimental results on the ISIC 2016, PH2, and ISIC 2018 dataset demonstrate that our proposed approach achieves compelling performance in skin lesions segmentation.


Poster 1 - Practical Analyses of How Common Social Media Platforms and Photo Storage Services Handle Uploaded Images

Authors: Dang-Nguyen, Duc-Tien; Sjøen, Vegard Velle; Le, Dinh-Hai; Dao, Thien-Phu; Tran, Anh-Duy; Tran, Minh-Triet


The research done in this study has delved deeply into the changes made to digital images that are uploaded to three of the major social media platforms and image storage services in today's society: Facebook, Flickr, and Google Photos. In addition to providing up-to-date data on an ever-changing landscape of different social media networks' digital fingerprints, a deep analysis of the social networks' filename conventions has resulted in two new approaches in (i) estimating the true upload date of Flickr photos, regardless of whether the dates have been changed by the user or not, and regardless of whether the image is available to the public or has been deleted from the platform; (ii) revealing the photo ID of a photo uploaded to Facebook based solely on the file name of the photo.

Poster 2 - Dynamic Feature Selection for Structural Image Content Recognition

Authors: Fu, Yingnan; Zheng, Shu; Cai, Wenyuan; Gao, Ming; Jin, Cheqing; Zhou, Aoying


Structural image content recognition (SICR) aims to transcribe a two-dimensional structural image (e.g., mathematical expression, chemical formula, or music score) into a token sequence. Existing methods are mainly encoder-decoder based and overlook the importance of feature selection and spatial relation extraction in the feature map. In this paper, we propose DEAL (shorted for Dynamic fEAture seLection) for SICR, which contains a dynamic feature selector and a spatial relation extractor as two cornerstone modules. Specifically, we propose a novel loss function and random exploration strategy to dynamically select useful image cells for target sequence generation. Further, we consider the positional and surrounding information of cells in the feature map to extract spatial relations. We conduct extensive experiments to evaluate the performance of DEAL. Experimental results show that DEAL outperforms other state-of-the-arts significantly.

Poster 3 - Multi-Layered Projected Entangled Pair States for Image Classification

Authors: Li, Lei; Lai, Hong


Tensor networks is a numerical tool in quantum many-body systems, it have been widely used in machine learning. The projected entangled pair states (PEPS) tensor network with a similar structure to the image achieves significant superiority compared to other tensor network models in the image classification task. Based on the PEPS tensor network, this paper constructs a Multi-layered PEPS (MLPEPS) tensor network model for supervised learning. PEPS is used to extract features layer by layer from the image mapped to the Hilbert space, which fully utilizes the correlation between pixels while retaining the global structural information of the image. By performing classification tasks on Fashion-MNIST and COVID-19 Radiography datasets, MLPEPS outperforms other tensor network models including PEPS. Under the same experimental conditions, the learning ability of MLPEPS is close to that of some well-known neural networks. MLPEPS can obtain different structures by setting different number of layers and PEPS block size, so it has great potential in machine learning.

Poster 4 - AutoRF: Auto Learning Receptive Fields with Spatial Pooling

Authors: Dong, Peijie; Niu, Xin; Pan, Hengyue; Li, Dongsheng; huang, Zhen


The search space is crucial in neural architecture search (NAS), and can determine the upper limit of the performance. Most methods focus on the design of depth and width when designing the search space, ignoring the receptive field. With a larger receptive field, the model is able to aggregate hierarchical information and strengthen its representational power. However, expanding the receptive fields directly with large convolution kernels suffers from high computational complexity. we instead enlarge the receptive field by introducing pooling operations with little overhead. In this paper, we propose a method named Auto Learning Receptive Fields (AutoRF), which is the first attempt at the auto attention module design with regard to the adaptive receptive field. In this paper, we present a pooling-based auto-learning approach for receptive field search. Our proposed search space encompasses typical multi-scale receptive field integration modules theoretically. Detailed experiments demonstrate the generalization ability of AutoRF and outperform various hand-crafted methods as well as NAS-based ones.


Poster 3 - Low-light image enhancement based on U-Net and Haar wavelet pooling

Authors: Batziou, Elissavet; Ioannidis, Konstantinos; Patras, Ioannis; Vrochidis, Stefanos; Kompatsiaris, Ioannis


The inevitable environmental and technical limitations of image capturing has as a consequence that many images are frequently taken in inadequate and unbalanced lighting conditions. Low-light image enhancement has been very popular for improving the visual quality of image representations, while low-light images often require advanced techniques to improve the perception of information for a human viewer. One of the main objectives in increasing the lighting conditions is to retain patterns, texture, and style with minimal deviations from the considered image. To this direction, we propose a low-light image enhancement method with Haar wavelet-based pooling to preserve texture regions and increase their quality. The presented framework is based on the U-Net architecture to retain spatial information, with a multi-layer feature aggregation (MFA) method. The method obtains the details from the low-level layers in the stylization processing. The encoder is based on dense blocks, while the decoder is the reverse of the encoder, and extracts features that reconstruct the image. Experimental results show that the combination of the U-Net architecture with dense blocks and the wavelet-based pooling mechanism comprises an efficient approach in low-light image enhancement applications. Qualitative and quantitative evaluation demonstrates that the proposed framework reaches state-of-the-art accuracy but with less resources than LeGAN.

Poster 4 - In-air handwritten Chinese text recognition with attention convolutional recurrent network

Authors: Wu, Zhihong; Qu, Xiwen; Huang, Jun; Wu, Xuangou


In-air handwriting is a new and more humanized way of human-computer interaction, which has a broad application prospect. One of the existing online handwritten Chinese text recognition model is to convert the trajectory data into image-like representation and use two-dimensional convolutional neural network (2DCNN) for feature extraction, and the another one directly process trajectory sequence with Long Short-Term Memory (LSTM). However, when using 2DCNN, many information will be lost in the process of conversion into images. When using LSTM, LSTM network is easy to cause gradient problem. So we propose an attention convolutional recurrent network (ACRN) for in-air handwritten Chinese text, which introduces one-dimensional convolutional neural network (1DCNN) containing dilation convolution for feature extraction of trajectory data directly. After that, the ACRN uses LSTM combined with multihead attention mechanism to focus on some key words in handwritten Chinese text, mines multi-level dependencies and outputs to softmax for classification. Finally the ACRN uses the Connectionist Temporal Classification (CTC) objective function without input-output alignment to decode the coding results. We conduct experiments on the CASIA-OLHWDB2.0-2.2 dataset and in-air handwritten Chinese text dataset IAHCT-UCAS2018. Experimental results demonstrate that compared with previous methods, our method obtains a more compact model with a higher recognition accuracy.


Research2Biz: From Research to Prototype

Place: MCB Media Lab - Time: 11:00 - 14:30, Friday 13, 2023


Chair: TBA

Students Take Charge of Climate Communication

Authors: Jensen, Fredrik Håland; Nordberg, Oda Elise; Opel, Andy; Nyre, Lars


It is an arduous task to communicate the gravity and complexity of climate change in an engaging and fact-based manner. One might even call this a wicked problem; a problem that is difficult to define and does not have one specific solution. Climate communication is a complex societal challenge that needs processes of dialogue and argumentation between a variety of stakeholders to be tackled. In this paper we present a pedagogical approach where thirty-one undergraduate Media and Interaction Design students collaborated with media companies to explore innovative ways to communicate climate change, and make it more engaging for citizens. The students conducted multi-method evaluations of existing journalistic work with citizens from a variety of demographics, and then conceptualized and developed innovative prototypes for communicating climate change causes, impacts, and future potentials. This project demonstrates the potential of innovation pedagogy to establish transdisciplinary collaboration between students and industry partners when dealing with the wicked problems of climate change communication. We explain how the pedagogical method works, describe the results of the collaboration, and discuss the outcome. While the approach has potential to improve climate change communication, it also leads to tension among students due to its normative, problem-oriented approach.