Video Moment Localization Network Based on Text Multi-semantic Clues Guidance

doi:10.4316/AECE.2023.03010

3/2023 - 10

View TOC | « Previous Article | Next Article »

Video Moment Localization Network Based on Text Multi-semantic Clues Guidance

WU, G. , XU, T.

View the paper record and citations in

Click to see author's profile in

SCOPUS,

IEEE Xplore,

Web of Science

Download PDF (1,730 KB) | Citation | Downloads: 451 | Views: 620

Author keywords
information retrieval, machine learning, computer vision, natural language processing, pattern matching

References keywords
vision(24), video(16), cvpr(16), temporal(14), moment(14), localization(14), recognition(12), language(12), iccv(12), videos(11)
Blue keywords are present in both the references section and the paper title.

About this article
Date of Publication: 2023-08-31
Volume 23, Issue 3, Year 2023, On page(s): 85 - 92
ISSN: 1582-7445, e-ISSN: 1844-7600
Digital Object Identifier: 10.4316/AECE.2023.03010
Web of Science Accession Number: 001062641900010
SCOPUS ID: 85172352256

Abstract

Full text preview

With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video moment which best matches the query in the untrimmed video. Existing text-guided methods only consider single-scale text features, which cannot fully represent the semantic features of text, and also do not consider the masking of crucial information in the video by text information when using text to guide the extraction of video features. To solve the above problems, we propose a video moment localization network based on text multi-semantic clues guidance. Specifically, we first design a text encoder based on fusion gate to better capture the semantic information in the text through multi-semantic clues composed of word embedding, local features and global features. Then text guidance module guides the extraction of video features by text semantic features to highlight the video features related to text semantics. Experimental results on two datasets, Charades-STA and ActivityNet Captions, show that our approach provides significant improvements over state-of-the-art methods.

References

Cited By «-- Click to see who has cited this paper

[1] Z. Shou, D. Wang, and S.-F. Chang, "Temporal action localization in untrimmed videos via multi-stage CNNs," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 1049-1058.
[CrossRef] [Web of Science Times Cited 585] [SCOPUS Times Cited 776]

[2] T. Lin, X. Zhao, and Z. Shou, "Single shot temporal action detection," in Proceedings of the 25th ACM international conference on Multimedia, Mountain View California USA: ACM, Oct. 2017, pp. 988-996.
[CrossRef] [Web of Science Times Cited 269] [SCOPUS Times Cited 330]

[3] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, "CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI: IEEE, Jul. 2017, pp. 1417-1426.
[CrossRef] [Web of Science Times Cited 282] [SCOPUS Times Cited 432]

[4] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, "BMN: Boundary-matching network for temporal action proposal generation," in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South): IEEE, Oct. 2019, pp. 3888-3897.
[CrossRef] [Web of Science Times Cited 322] [SCOPUS Times Cited 401]

[5] J. Gao, C. Sun, Z. Yang, and R. Nevatia, "TALL: Temporal activity localization via language query," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 5277-5285.
[CrossRef] [Web of Science Times Cited 279] [SCOPUS Times Cited 444]

[6] L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, "Localizing moments in video with natural language," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 5804-5813.
[CrossRef] [Web of Science Times Cited 381] [SCOPUS Times Cited 520]

[7] H. Wang, Z.-J. Zha, X. Chen, Z. Xiong, and J. Luo, "Dual path interaction network for video moment localization," in Proceedings of the 28th ACM International Conference on Multimedia, Seattle WA USA: ACM, Oct. 2020, pp. 4116-4124.
[CrossRef] [Web of Science Times Cited 31] [SCOPUS Times Cited 39]

[8] H. Zhang, A. Sun, W. Jing, and J. T. Zhou, "Span-based localizing network for natural language video localization," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online: Association for Computational Linguistics, 2020, pp. 6543-6554.
[CrossRef]

[9] M. Soldan, M. Xu, S. Qu, J. Tegner, and B. Ghanem, "VLG-Net: Video-language graph matching network for video grounding," in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada: IEEE, Oct. 2021, pp. 3217-3227.
[CrossRef] [Web of Science Times Cited 9] [SCOPUS Times Cited 33]

[10] Y. Hu, M. Liu, X. Su, Z. Gao, and L. Nie, "Video moment localization via deep cross-modal hashing," IEEE Trans. Image Process., vol. 30, pp. 4667-4677, 2021.
[CrossRef] [Web of Science Times Cited 49] [SCOPUS Times Cited 54]

[11] S. Paul, N. C. Mithun, and A. K. Roy-Chowdhury, "Text-based localization of moments in a video corpus," IEEE Trans. Image Process., vol. 30, pp. 8886-8899, 2021.
[CrossRef] [Web of Science Times Cited 7] [SCOPUS Times Cited 9]

[12] D. Zhang, X. Dai, X. Wang, Y.-F. Wang, and L. S. Davis, "MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment," in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA: IEEE, Jun. 2019, pp. 1247-1257.
[CrossRef] [Web of Science Times Cited 147] [SCOPUS Times Cited 210]

[13] D. Liu, X. Qu, X.-Y. Liu, J. Dong, P. Zhou, and Z. Xu, "Jointly cross- and self-modal graph attention network for query-based moment localization," in Proceedings of the 28th ACM International Conference on Multimedia, Seattle WA USA: ACM, Oct. 2020, pp. 4070-4078.
[CrossRef] [Web of Science Times Cited 40] [SCOPUS Times Cited 73]

[14] T. Xu, H. Du, E. Chen, J. Chen, and Y. Wu, "Cross-modal video moment retrieval based on visual-textual relationship alignment," Sci. Sin. Informationis, vol. 50, no. 6, pp. 862-876, Jun. 2020.
[CrossRef] [SCOPUS Times Cited 12]

[15] M. Liu, X. Wang, L. Nie, X. He, B. Chen, and T.-S. Chua, "Attentive moment retrieval in videos," in the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor MI USA: ACM, Jun. 2018, pp. 15-24.
[CrossRef] [Web of Science Times Cited 156] [SCOPUS Times Cited 212]

[16] H. Xu, K. He, B. A. Plummer, L. Sigal, S. Sclaroff, and K. Saenko, "Multilevel language and vision integration for text-to-clip retrieval," Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, pp. 9062-9069, Jul. 2019.
[CrossRef]

[17] Z. Lin, Z. Zhao, Z. Zhang, Z. Zhang, and D. Cai, "Moment retrieval via cross-modal interaction networks with query reconstruction," IEEE Trans. Image Process., vol. 29, pp. 3750-3762, 2020.
[CrossRef] [Web of Science Times Cited 32] [SCOPUS Times Cited 37]

[18] J. Wang, L. Ma, and W. Jiang, "Temporally grounding language queries in videos by contextual boundary-aware prediction," Proc. AAAI Conf. Artif. Intell., vol. 34, no. 07, Art. no. 07, Apr. 2020.
[CrossRef]

[19] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, "Large-scale video classification with convolutional neural networks," in 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA: IEEE, Jun. 2014, pp. 1725-1732.
[CrossRef] [Web of Science Times Cited 4013] [SCOPUS Times Cited 5255]

[20] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4489-4497.
[CrossRef] [Web of Science Times Cited 4988] [SCOPUS Times Cited 6789]

[21] J. Carreira and A. Zisserman, "Quo vadis, action recognition? A new model and the kinetics dataset," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 4724-4733.
[CrossRef] [Web of Science Times Cited 4296] [SCOPUS Times Cited 5257]

[22] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, Cambridge, MA, USA: MIT Press, 2014, pp. 568-576.
[CrossRef]

[23] H. Xu, A. Das, and K. Saenko, "R-C3D: Region convolutional 3D network for temporal activity detection," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice: IEEE, Oct. 2017, pp. 5794-5803.
[CrossRef] [Web of Science Times Cited 405] [SCOPUS Times Cited 545]

[24] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137-1149, Jun. 2017.
[CrossRef] [Web of Science Times Cited 32999] [SCOPUS Times Cited 20414]

[25] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, "A Multi-stream bi-directional recurrent neural network for fine-grained action detection," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 1961-1970.
[CrossRef] [Web of Science Times Cited 278] [SCOPUS Times Cited 373]

[26] S. Buch, V. Escorcia, B. Ghanem, and J. C. Niebles, "End-to-end, single-stream temporal action detection in untrimmed videos," in Procedings of the British Machine Vision Conference 2017, London, UK: British Machine Vision Association, 2017, p. 93.
[CrossRef] [SCOPUS Times Cited 166]

[27] L. Wang et al., "Temporal segment networks: Towards good practices for deep action recognition," in Computer Vision - ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., in Lecture Notes in Computer Science, vol. 9912. Cham: Springer International Publishing, 2016, pp. 20-36.
[CrossRef] [Web of Science Times Cited 1861] [SCOPUS Times Cited 1950]

[28] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, "Temporal relational reasoning in videos," in Computer Vision - ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., in Lecture Notes in Computer Science, vol. 11205. Cham: Springer International Publishing, 2018, pp. 831-846.
[CrossRef]

[29] M. Liu, X. Wang, L. Nie, Q. Tian, B. Chen, and T.-S. Chua, "Cross-modal moment localization in videos," in Proceedings of the 26th ACM international conference on Multimedia, Seoul Republic of Korea: ACM, Oct. 2018, pp. 843-851.
[CrossRef] [Web of Science Times Cited 111] [SCOPUS Times Cited 135]

[30] Y. Yuan, T. Mei, and W. Zhu, "To find where you talk: Temporal sentence localization in video with attention based location regression," Proc. AAAI Conf. Artif. Intell., vol. 33, pp. 9159-9166, Jul. 2019.
[CrossRef]

[31] Y.-W. Chen, Y.-H. Tsai, and M.-H. Yang, "End-to-end multi-modal video temporal grounding," in 35th Conference on Neural Information Processing Systems, NeurIPS 2021. Neural information processing systems foundation, 2021, pp. 28442-28453

[32] Y. Liu, S. Li, Y. Wu, C. W. Chen, Y. Shan, and X. Qie, "UMT: Unified multi-modal transformers for joint video moment retrieval and highlight detection," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA: IEEE, Jun. 2022, pp. 3032-3041.
[CrossRef] [Web of Science Times Cited 18] [SCOPUS Times Cited 41]

[33] Y. Hu, L. Nie, M. Liu, K. Wang, Y. Wang, and X.-S. Hua, "Coarse-to-fine semantic alignment for cross-modal moment localization," IEEE Trans. Image Process., vol. 30, pp. 5933-5943, 2021.
[CrossRef] [Web of Science Times Cited 22] [SCOPUS Times Cited 24]

[34] Y. Zeng, "Point prompt tuning for temporally language grounding," in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid Spain: ACM, Jul. 2022, pp. 2003-2007.
[CrossRef] [Web of Science Times Cited 8] [SCOPUS Times Cited 10]

[35] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Comput., vol. 9, no. 8, pp. 1735-1780, Nov. 1997.
[CrossRef] [SCOPUS Times Cited 64529]

[36] Y. Gong and S. Bowman, "Ruminating reader: Reasoning with gated multi-hop attention," in Proceedings of the Workshop on Machine Reading for Question Answering, Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 1-11.
[CrossRef]

[37] K. Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1724-1734.
[CrossRef] [SCOPUS Times Cited 10188]

[38] X. Wang, R. Girshick, A. Gupta, and K. He, "Non-local neural networks," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA: IEEE, Jun. 2018, pp. 7794-7803.
[CrossRef] [Web of Science Times Cited 4536] [SCOPUS Times Cited 7796]

[39] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, "Hollywood in homes: Crowdsourcing data collection for activity understanding," in Computer Vision - ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., in Lecture Notes in Computer Science, vol. 9905. Cham: Springer International Publishing, 2016, pp. 510-526.
[CrossRef] [Web of Science Times Cited 533] [SCOPUS Times Cited 474]

[40] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, "Dense-captioning events in videos," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice: IEEE, Oct. 2017, pp. 706-715.
[CrossRef] [Web of Science Times Cited 545] [SCOPUS Times Cited 766]

[41] Y. Yuan, L. Ma, J. Wang, W. Liu, and W. Zhu, "Semantic conditioned dynamic modulation for temporal sentence grounding in videos," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2725-2741, 1 May 2022.
[CrossRef] [Web of Science Times Cited 5] [SCOPUS Times Cited 8]

[42] J. Pennington, R. Socher, and C. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1532-1543.
[CrossRef] [SCOPUS Times Cited 23937]

[43] D. P. Kingma and L. J. Ba, "Adam: A method for stochastic optimization," International Conference on Learning Representations (ICLR), 2015

[44] K. Li, D. Guo, and M. Wang, "Proposal-free video grounding with contextual pyramid network," Proc. AAAI Conf. Artif. Intell., vol. 35, no. 3, pp. 1902-1910, May 2021,
[CrossRef]

[45] M. Hahn, "Tripping through time: Efï¬cient localization of activities in videos," Br. Mach. Vis. Conf. BMVC, 2020

[46] C. Rodriguez-Opazo, E. Marrese-Taylor, F. S. Saleh, H. Li, and S. Gould, "Proposal-free temporal moment localization of a natural-language query in video using guided attention," in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA: IEEE, Mar. 2020, pp. 2453-2462.
[CrossRef] [Web of Science Times Cited 42] [SCOPUS Times Cited 89]

[47] L. Zhang and R. J. Radke, "Natural language video moment localization through query-controlled temporal convolution," in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA: IEEE, Jan. 2022, pp. 2524-2532.
[CrossRef] [Web of Science Times Cited 2] [SCOPUS Times Cited 7]

[48] Y. Zeng, D. Cao, X. Wei, M. Liu, Z. Zhao, and Z. Qin, "Multi-modal relational graph for cross-modal video moment retrieval," in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA: IEEE, Jun. 2021, pp. 2215-2224.
[CrossRef] [Web of Science Times Cited 32] [SCOPUS Times Cited 46]

References Weight

Web of Science® Citations for all references: 57,283 TCR
SCOPUS® Citations for all references: 152,381 TCR

Web of Science® Average Citations per reference: 1,146 ACR
SCOPUS® Average Citations per reference: 3,048 ACR

TCR = Total Citations for References / ACR = Average Citations per Reference

We introduced in 2010 - for the first time in scientific publishing, the term "References Weight", as a quantitative indication of the quality ... Read more

Citations for references updated on 2024-04-25 12:42 in 319 seconds.

Note¹: Web of Science® is a registered trademark of Clarivate Analytics.
Note²: SCOPUS® is a registered trademark of Elsevier B.V.
Disclaimer: All queries to the respective databases were made by using the DOI record of every reference (where available). Due to technical problems beyond our control, the information is not always accurate. Please use the CrossRef link to visit the respective publisher site.

Copyright ©2001-2024
Faculty of Electrical Engineering and Computer Science
Stefan cel Mare University of Suceava, Romania

All rights reserved: Advances in Electrical and Computer Engineering is a registered trademark of the Stefan cel Mare University of Suceava. No part of this publication may be reproduced, stored in a retrieval system, photocopied, recorded or archived, without the written permission from the Editor. When authors submit their papers for publication, they agree that the copyright for their article be transferred to the Faculty of Electrical Engineering and Computer Science, Stefan cel Mare University of Suceava, Romania, if and only if the articles are accepted for publication. The copyright covers the exclusive rights to reproduce and distribute the article, including reprints and translations.

Permission for other use: The copyright owner's consent does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific written permission must be obtained from the Editor for such copying. Direct linking to files hosted on this website is strictly prohibited.

Disclaimer: Whilst every effort is made by the publishers and editorial board to see that no inaccurate or misleading data, opinions or statements appear in this journal, they wish to make it clear that all information and opinions formulated in the articles, as well as linguistic accuracy, are the sole responsibility of the author.

Menu:

Video Moment Localization Network Based on Text Multi-semantic Clues Guidance