Multimodal Concept Detection and Annotation in Image and Video Collections

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
School of Science | Doctoral thesis (article-based) | Defence date: 2020-08-14
Degree programme
114 + app. 114
Aalto University publication series DOCTORAL DISSERTATIONS, 104/2020
The World Wide Web has become a common-place for finding for all kinds of purposes. The amount of data which one user can be dealing with has become large and its size is countinuously growing. The relevant data for users have not only become large, but also diverse. Hence, searching relevant information from such large and diverse resources is a critical task. However, users can not always formulate appropriate queries for finding the desired resources. In order to retrieve relevant information, the semantic relationships of the information in different modalities would need to be known and specified.This thesis approaches the multimodal cross-domain semantic retrieval and fusion problem from the point of view of content-based visual analysis and statistical natural language analysis. It also aims at using cross-domain textual semantics to generate pseudo tags for images to improve the performance of the information retrieval task. The main focus of the thesis is in bridging the semantic gap between textual and visual content domains.In order to combine and project the unimodal information to multimodal space, two approaches are used: one is the Multimodal Deep Boltzmann Machine (DBM) and the other is the late fusion of unimodal Support Vector Machines (SVM). One problem of the non-linear SVM approach is its high calculation cost. In this dissertation, the homogeneous kernel map method is used to improve the efficiency of SVM. In our experiments, we adopted deep convolutional neural network features, particularly GoogLeNet features, and the retrieval results of the SVM-based approaches improved to be nearly equal to those of the Multimodal DBM approch.One drawback of the multimodal information retrieval task is the requirement to be able to perform queries in each unimodal domain. In our experiments, if the query for image domain is missing or not appropriate, the approach is just the same as ordinal text search. Additionally, the image contents and its textual description do not always match. In order to improve the multimodal information retrieval, the method of pseudo tag generation is proposed in this thesis. The generation of pseudo tags is based on a text–image semantic map, which is calculated by the cooccurrence of latent topics in text and visual concepts in text–image data. In the experiments, the multimodal information retrieval results were considerably improved by using the pseudo tags.
Supervising professor
Kaski, Samuel, Prof., Aalto University, Department of Computer Science , Finland
Thesis advisor
Laaksonen, Jorma, Dr., Aalto University, Department of Computer Science, Finland
information retrieval, multimodal concepts, images, videos
Other note
  • [Publication 1]: Satoru Ishikawa, Jorma Laaksonen. Uni- and Multimodal Methods for Single- and Multi-label Recognition. Multimedia Tools and Applications, Volume 76, issue 21, pp.22405-22423 , October 2017.
    DOI: 10.1007/s11042-017-4733-7 View at publisher
  • [Publication 2]: Mats Sjöberg, Markus Koskela, Satoru Ishikawa, Jorma Laaksonen. Real-Time Large-Scale Visual Concept Detection with Linear Classifiers. In Proceedings of 21st International Conference on Pattern Recognition, Tsukuba, Japan, pp.421-424, November 2012
  • [Publication 3]: Mats Sjöberg, Markus Koskela, Satoru Ishikawa, Jorma Laaksonen. Large-Scale Visual Concept Detection with Explicit Kernel Maps and Power Mean SVM. In Proceedings of ACM International Conference on Multimedia Retrieval (ICMR2013), Dallas, Texas, USA, pp.239-246, April 16-19 2013.
    DOI: 10.1145/2461466.2461505 View at publisher
  • [Publication 4]: Iftikhar Ahmad, Petri Rantanen, Pekka Sillberg, Jorma Laaksonen, Shuhua Liu, Thomas Fross, Aqdas Malik, Marko Nieminen, Rakshith Shetty, Satoru Ishikawa, Jarno Kallio, Jukka P. Saarinen, Moncef Gabbouj, Jari Soini. VisualLabel: An Integrated Multimedia Content Manegement and Access Framework. Frontiers in Artificial Intelligence and Applications, Volume 301: Information Modelling and Knowledge Bases XXIX, pp.321-342, 2018.
    DOI: 10.3233/978-1-61499-834-1-321 View at publisher
  • [Publication 5]: Ville Viitaniemi, Mats Sjöberg, Markus Koskela, Satoru Ishikawa, Jorma Laaksonen. Advances in Visual Concept Detection: Ten Years of TRECVID. Advances in Independent Component Analysis and Learning Machines, 1st Edition, pp.249-278 , 2015.
    DOI: 10.1016/B978-0-12-802806-3.00012-9 View at publisher
  • [Publication 6]: Satoru Ishikawa, Jorma Laaksonen. Comparing and Combining Unimodal Methods for Multimodal Recognition. In Proceedings of the 14th International Workshop on Content-Based Multimedia Indexing, Bucharest, Romania, pp.1-6, June 2016.
    DOI: 10.1109/CBMI.2016.7500253 View at publisher
  • [Publication 7]: Satoru Ishikawa, Jorma Laaksonen, Juha Karhunen. Image Pseudo Tag Generation with Deep Boltzmann Machine and Topic–Concept Similarity Map. In Proceeding of the 30th International Joint Conference of Neural Network, Anchorage, Alaska, USA, pp.1305-1312, June 2017.
    DOI: 10.1109/IJCNN.2017.7966003 View at publisher