Data science for sociotechnical systems - from computational sociolinguistics to the smart grid

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
School of Science | Doctoral thesis (article-based) | Defence date: 2018-02-09
Degree programme
118 + app. 170
Aalto University publication series DOCTORAL DISSERTATIONS, 21/2018
We live in the Information Age characterized by the exponential growth of the technological capacity to produce and store data (big data) and to process them towards information and knowledge (data science). In particular, large amounts of data are produced during the interaction between people and technology in diverse sociotechnical systems. Data science, as a set of theories and techniques to distill knowledge from data, is recognized as an effective tool to support sociotechnical systems. This dissertation consists of four projects, in which we apply data science for monitoring and interventions in concrete sociotechnical systems: human dynamics, social networks, smart grid and Web cybersecurity. By analyzing mobile phone communication from a developing country, we show how people's socio-economic factors correlate with their dynamics inferred from the data. Consequently, we demonstrate how monitoring mobile phone network can serve as a proxy for census statistics. In developing countries, where censuses are rare and infrequent, this can prove important. Using the Twitter data, we investigate two social phenomena: homophily and the happiness paradox. In addition to finding the evidence for respective sociological theories, we also provide interesting hypotheses for further investigation. In another, theoretical study, we propose an epidemic spreading model for multiplex networks (representing, for instance, user engagement is several social networks). The simulations reveal when the spreading dynamics of the whole system is slower compared to any individual layer. Our model can be employed by the governments, companies, and others who aim to spread information using several social media. In the project on the residential smart grid, we design an intervention targeting improved sustainability. We develop a social energy app to teach and engage people in efficient practices. In data centers, a better understanding is needed of the interplay between computation and energy consumption, before interventions can be proposed. Our results are a step towards such better understanding.In the final project, given a Web crawl, we first show how the underlying distributions in this complex system differ between malicious and clean websites. Then we demonstrate how such knowledge can support detecting malware-affected websites. We conclude this dissertation by presenting a systematic overview and lessons learned from the data science process undertaken in each project.
Supervising professor
Gionis, Aristides, Prof., Aalto University, Department of Computer Science, Finland
Thesis advisor
Hui, Pan, Prof., Hong Kong University of Science and Technology, Hong Kong
Nurminen, Jukka K., Prof., VTT Technical Research Centre of Finland, Finland
data science, human mobility, smart grid, computational sociolinguistics, Web cybersecurity
Other note
  • [Publication 1]: Sanja Š́cepanovíc, Igor Mishkovski, Pan Hui, Jukka K Nurminen and Antti Ylä-Jääski. Mobile phone call data as a regional socio-economic proxy indicator. PLoS ONE, 10(4), p.e0124160, April 2015.
    DOI: 10.1371/journal.pone.0124160 View at publisher
  • [Publication 2]: Sanja Š́cepanovíc, Igor Mishkovski, Bruno Gonçalves, Nguyen Trung Hieu and Pan Hui. Semantic homophily in online communication: Evidence from Twitter. Accepted for publication in Online Social Networks and Media, 2C, pp. 1-18, June 2017.
    DOI: 10.1016/j.osnem.2017.06.001 View at publisher
  • [Publication 3]: Igor Mishkovski, Sanja Š́cepanovíc, Bruno Gonçalves and Pan Hui. On the traces of sentiment homophily and happiness paradox in online social network communication. Submitted to Physica A (pages: 27), March 2017.
  • [Publication 4]: Igor Mishkovski, Miroslav Mirchev, Sanja Š́cepanovíc and Ljupco Kocarev. Interplay Between Spreading and Random Walk Processes in Multiplex Networks. IEEE Transactions on Circuits and Systems I: Regular Papers, May 2017.
    DOI: 10.1109/TCSI.2017.2700948 View at publisher
  • [Publication 5]: Sanja Š́cepanovíc, Martijn Warnier and Jukka K Nurminen. The role of context in residential energy interventions: A meta review. Renewable and Sustainable Energy Reviews, vol. 77, pp. 1146-1168, September 2017.
    DOI: 10.1016/j.rser.2016.11.044 View at publisher
  • [Publication 6]: Yilin Huang, Hanna Hasselqvist, Giacomo Poderi, Sanja Š́cepanovíc, Filip Kis, Cristian Bogdan, Martijn Warnier and Frances Brazier. YouPower: An Open Source Platform for Community-Oriented Smart Grid User Engagement. In Proceedings of the 14th IEEE International Conference on Networking, Sensing and Control, pp. 1-6, May 2017.
    DOI: 10.1109/ICNSC.2017.8000058 View at publisher
  • [Publication 7]: Kashif Nizam Khan, Sanja Š́cepanovíc, Tapio Niemi, Jukka K. Nurminen, Sebastian Von Alfthan, Olli-Pekka Lehto. Analyzing the Power Consumption Behavior of a Large Scale Data Center. Accepted for publication in Computer Science Research and Development Journal, Springer (2017), (pages: 8; to appear), June 2017.
  • [Publication 8]: Jukka Ruohonen, Sanja Š́cepanovíc, Sami Hyrynsalmi, Igor Mishkovski, Tuomas Aura and Ville Leppänen. A Post-Mortem Empirical Investigation of the Popularity and Distribution of Malware Files in the Contemporary Web-Facing Internet. In Proceedings of the European Intelligence and Security Informatics Conference, pp. 144-147, August 2016.
    DOI: 10.1109/EISIC.2016.037 View at publisher
  • [Publication 9]: Jukka Ruohonen, Sanja Š́cepanovíc, Sami Hyrynsalmi, Igor Mishkovski, Tuomas Aura and Ville Leppänen. The Black Mark Beside My Name Server: Exploring the Importance of Name Server IP Addresses in Malware DNS Graphs. In Proceedings of the International Conference on Future Internet of Things and Cloud Workshops, pp. 264-269, August 2016.
    DOI: 10.1109/W-FiCloud.2016.61 View at publisher
  • [Publication 10]: Jukka Ruohonen, Sanja Š́cepanovíc, Sami Hyrynsalmi, Igor Mishkovski, Tuomas Aura and Ville Leppänen. Correlating File-Based Malware Graphs Against the Empirical Ground Truth of DNS Graphs. In Proceedings of the European Conference on Software Architecture Workshops, pp. 30:1–30:6, Nov-Dec 2016.
    DOI: 10.1145/2993412.2993414 View at publisher
  • [Publication 11]: Sanja Šcepanovic, Igor Mishkovski, Jukka Ruohonen, Frederick Ayala-Gómez, Tuomas Aura and Sami Hyrynsalmi. Malware and graph structure of the Web. Submitted to The Journal of Web Science (pages: 14), May 2017.