Temporal modelling of first-person actions using hand-centric verb and object streams
dc.contributor | Aalto-yliopisto | fi |
dc.contributor | Aalto University | en |
dc.contributor.author | Gökce, Zeynep | en_US |
dc.contributor.author | Pehlivan, Selen | en_US |
dc.contributor.department | Department of Computer Science | en |
dc.contributor.groupauthor | Lecturer Laaksonen Jorma group | en |
dc.contributor.organization | TED University | en_US |
dc.date.accessioned | 2022-05-10T10:34:12Z | |
dc.date.available | 2022-05-10T10:34:12Z | |
dc.date.embargo | info:eu-repo/date/embargoEnd/2023-08-26 | en_US |
dc.date.issued | 2021-11 | en_US |
dc.description | Publisher Copyright: © 2021 Elsevier B.V. | |
dc.description.abstract | Analysis of first-person (egocentric) videos involving human actions could help in the solutions of many problems. These videos include a large number of fine-grained action categories with hand–object interactions. In this paper, a compositional verb–noun model including two complementary temporal streams is proposed with various fusion strategies to recognize egocentric actions. The first step is based on construction of verb and object video models as decomposition of actions with a special attention on hands. Particularly, the verb video model that is the spatial–temporal encoding of hand actions and the object video model that is the object scores with hand–object layout are represented as two separate pathways. The second step is the fusion stage to identify action category, where distinct verb and object models are combined to give their action judgments. We propose fusion strategies with recurrent steps collecting verb and object label judgments along a temporal video sequence. We evaluate recognition performances for individual verb and object models; and we present extensive experimental evaluations for action recognition over recurrent-based fusion approaches on the EGTEA Gaze+ dataset. | en |
dc.description.version | Peer reviewed | en |
dc.format.extent | 17 | |
dc.format.mimetype | application/pdf | en_US |
dc.identifier.citation | Gökce, Z & Pehlivan, S 2021, 'Temporal modelling of first-person actions using hand-centric verb and object streams', SIGNAL PROCESSING: IMAGE COMMUNICATION, vol. 99, 116436. https://doi.org/10.1016/j.image.2021.116436 | en |
dc.identifier.doi | 10.1016/j.image.2021.116436 | en_US |
dc.identifier.issn | 0923-5965 | |
dc.identifier.other | PURE UUID: b69bb7a0-2198-49ae-9a2f-47c32c5aa8e8 | en_US |
dc.identifier.other | PURE ITEMURL: https://research.aalto.fi/en/publications/b69bb7a0-2198-49ae-9a2f-47c32c5aa8e8 | en_US |
dc.identifier.other | PURE LINK: http://www.scopus.com/inward/record.url?scp=85113626751&partnerID=8YFLogxK | |
dc.identifier.other | PURE FILEURL: https://research.aalto.fi/files/82535978/Temporal_modelling_of_first_person_actions_using_hand_centric_verb_and_object_streams.pdf | en_US |
dc.identifier.uri | https://aaltodoc.aalto.fi/handle/123456789/114176 | |
dc.identifier.urn | URN:NBN:fi:aalto-202205103040 | |
dc.language.iso | en | en |
dc.publisher | Elsevier | |
dc.relation.ispartofseries | SIGNAL PROCESSING: IMAGE COMMUNICATION | en |
dc.relation.ispartofseries | Volume 99 | en |
dc.rights | openAccess | en |
dc.subject.keyword | Action recognition | en_US |
dc.subject.keyword | Egocentric vision | en_US |
dc.subject.keyword | First-person vision | en_US |
dc.subject.keyword | RNN | en_US |
dc.subject.keyword | Temporal models | en_US |
dc.title | Temporal modelling of first-person actions using hand-centric verb and object streams | en |
dc.type | A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä | fi |
dc.type.version | acceptedVersion |