Bidirectional Temporal Context Fusion with Bi-Modal Semantic Features using a gating mechanism for Dense Video Captioning

Khaled, Noorhan and Aref, M and marey, mohammed (2021) Bidirectional Temporal Context Fusion with Bi-Modal Semantic Features using a gating mechanism for Dense Video Captioning. International Journal of Intelligent Computing and Information Sciences, 21 (2). pp. 1-22. ISSN 2535-1710

[thumbnail of IJICIS_Volume 21_Issue 2_Pages 1-22.pdf]

Text
IJICIS_Volume 21_Issue 2_Pages 1-22.pdf - Published Version
Download (1MB)

Official URL: https://doi.org/10.21608/ijicis.2021.60216.1055

Abstract

Dense video captioning involves detecting interesting events and generating textual descriptions for each event in an untrimmed video. Many machine intelligent applications such as video summarization, search and retrieval, automatic video subtitling for supporting blind disabled people, benefit from automated dense captions generator. Most recent works attempted to make use of an encoder-decoder neural network framework which employs a 3D-CNN as an encoder for representing a detected event frames, and an RNN as a decoder for caption generation. They follow an attention based mechanism to learn where to focus in the encoded video frames during caption generation. Although the attention-based approaches have achieved excellent results, they directly link visual features to textual captions and ignore the rich intermediate/high-level video concepts such as people, objects, scenes, and actions. In this paper, we firstly propose to obtain a better event representation that discriminates between events nearly ending at the same time by applying an attention based fusion. Where hidden states from a bi-directional LSTM sequence video encoder, which encodes past and future surrounding context information of a detected event are fused along with its visual (R3D) features. Secondly, we propose to explicitly extract bi-modal semantic concepts (nouns and verbs) from a detected event segment and equilibrate the contributions from the proposed event representation and the semantic concepts dynamically using a gating mechanism while captioning. Experimental results demonstrates that our proposed attention based fusion is better in representing an event for captioning. Also involving semantic concepts improves captioning performance.

Item Type:	Article
Subjects:	STM Library > Computer Science
Depositing User:	Managing Editor
Date Deposited:	29 Jun 2023 03:42
Last Modified:	17 Oct 2023 05:07
URI:	http://open.journal4submit.com/id/eprint/2412

Actions (login required)

: View Item