[dipl] Perustieteiden korkeakoulu / SCI

Permanent URI for this collectionhttps://aaltodoc.aalto.fi/handle/123456789/21

Browse

Recent Submissions

Now showing 1 - 20 of 6024
  • Comparative analysis: Software bill of materials generation tooling for large scale medical software
    (2025-12-24) Nicholson, Kohdy
    School of Science | Master's thesis
    Modern software systems rely heavily on open source and third-party components, making software supply chains increasingly complex and difficult to secure. For organizations operating in regulated domains such as medical device development, maintaining visibility into these components is essential for meeting cybersecurity requirements. Software Bills of Materials (SBOMs) have emerged as a core mechanism for achieving this visibility, yet relatively little research has examined the effectiveness of SBOM generation tools within the .NET ecosystem, despite its widespread industrial use. This thesis investigates the performance and automation usability of four SBOM generation tools across six .NET microservice projects. The study evaluates each tool (syft, cdxgen, cyclonedx-dotnet, sbom-tool) using quantitative metrics derived from comparisons against a Source of Truth generated using the dotnet CLI. Complementing this analysis, the thesis also assesses the developer experience of automating each tool, focusing on portability, installation requirements, command-line flexibility, and suitability for continuous integration pipelines. The findings reveal differences between general-purpose and ecosystem-specific tooling. CycloneDX-Dotnet achieved perfect accuracy due to its usage of the dotnet CLI, while syft produced extensive false positives by including runtime assemblies. Sbom-tool demonstrated strong component detection performance, though nothing notably better than other tools. Cdxgen showed consistent, balanced results; strong component detection and versatility, however, it presented minor automation challenges due to some execution constraints. Based on these findings, cdxgen is identified as the most suitable option for Varian, offering strong component detection, integration with our in-house license clearance solution, and long-term vision to support a varied technology stack. The thesis concludes by outlining opportunities for future research, including standardized evaluation frameworks and SBOM integration into broader vulnerability management workflows.
  • Assessing DevSecOps maturity: A case study
    (2025-12-31) Klimenko, Artem
    School of Science | Master's thesis
    As software organizations transition towards cloud architecture and Agile development practices, they face complex security challenges. Modern software development extensively relies on cloud infrastructure, DevOps, microservices, and CI/CD pipelines. While these technologies enable faster release cycles and better scalability, they also introduce broader and more complex attack surfaces. As a result, existing security practices and traditional software development methods are no longer effective in addressing risks emerging from this new technological landscape. DevSecOps has appeared as a response to this problem, embedding security practices directly into the development lifecycle and deployment pipelines. While it has gained substantial attention in both academic literature and industry practice, empirical understanding of how organizations adopt these practices and what factors influence this adoption remains limited. This gap is particularly pronounced for mid-sized organizations that face significant security requirements without having dedicated security teams to address them. This thesis examines the security maturity of a mid-sized software organization using the OWASP DevSecOps Maturity Model. Through an embedded case study involving three development teams, the research draws on interviews, documentation analysis, and technical observations to understand both the current state of DevSecOps practices and the organizational factors shaping them. The findings reveal a characteristic pattern of asymmetric maturity: the organization demonstrates strong technical implementation with comprehensive security tooling integrated into CI/CD pipelines, yet exhibits significant gaps in process governance and security culture. Security initiatives are predominantly reactive, driven by external compliance requirements rather than proactive risk management. The assessment identified that centralized DevOps functions and technological stack coherence serve as key enablers, while the absence of dedicated security personnel and formalized vulnerability management processes act as primary constraints. These findings contribute to the empirical understanding of DevSecOps adoption in mid-sized organizations and offer practical insights for similar contexts where security responsibilities must be distributed across development teams.
  • Evaluating privacy and ethical implications of AI-assisted documentation in healthcare: A case study
    (2025-12-18) Pulkkinen, Leevi
    School of Science | Master's thesis
    Clinical documentation is an essential part of patient care, but can be time consuming and reduce interaction time with patients. With the emergence of artificial intelligence (AI) technologies, such as automatic speech recognition (ASR), natural language processing (NLP), and large language models (LLMs), promising new tools have been proposed to make clinical documentation more efficient and patient centric. However, the introduction of AI powered documentation tools in healthcare settings raises a multitude of concerns regarding data protection, privacy, and ethics. This thesis examines these implications by conducting a Data Ethics Decision Aid (DEDA) evaluation of an AI assisted documentation tool called Smart Care and contrasting the findings with the overlying literature. The findings emphasized the importance of lawful data processing and robustness of AI services against attacks as key factors to maintaining the privacy of individuals. In addition, human oversight, accountability, and proper training were highlighted to ensure responsible use. Finally, transparency towards patients and proper acquisition of consent were seen as essential for creating trust between patients, users, and technology. Further research in practical settings is needed to perform a sophisticated evaluation of these issues and the proposed solutions.
  • Cross-platform plugin architecture for middleware-based robotic system
    (2025-12-31) Zhamashev, Yerzhan
    School of Science | Master's thesis
    This thesis presents a WebAssembly-based sandboxed plugin architecture for Robot Operating System 2. It addresses security issues inherent to dynamic plugin architectures with trust assumptions incompatible with multi-tenant architectures and remote deployment. With systems becoming increasingly multi-tenant, software components from various vendors require appropriate levels of isolation. This thesis uses WebAssembly as a sandboxing substrate to achieve improved security, API compatibility, workflow alignment, and acceptable tradeoffs. The work addresses core parts of the design by patching ROS 2 libraries (rcl, rclcpp) to enable compilation to WebAssembly targets, modifying WASI SDK toolchain to enable C++ exception handling, implementing parts of the ROS 2 middleware using wasi-messaging proposal to allow plugins use ROS 2 APIs without significant modifications to the codebase, and extending Wasmtime runtime to support wasi-messaging for debugging purposes. The evaluation shows partial satisfaction of the framework requirements, documenting the impact of ROS 2 modifications, compatibility of APIs, and tradeoff characteristics. The ROS 2 threading patches deactivate some of the expected ROS 2 features, and the RMW pull-based subscription model had to be adapted for wasi-messaging push-based model. The qualitative and quantitative tradeoffs that result from adopting WebAssembly-based sandboxing are characterized but require further empirical validation. In general, the architectural approach demonstrates viability and aligns with broader industry developments. The identified tradeoffs guide future deployment decisions for WebAssembly-based secure third-party component integration in middleware-based systems such as ROS 2. As a result, this work lay the groundwork for the design and development of sandboxed plugin architectures in robotics systems, with potential applicability in adjacent safety-critical domains, such as automotive and industrial automation systems.
  • Efficacy of a domain name based policy as the mechanism for zero trust browsing
    (2025-12-31) Campbell, Aaron
    School of Science | Master's thesis
    Zero Trust Architecture is a security model based around ephemeral per-request access to constrained resources. This architecture has been shown to be effective in a variety of security fields, securing both local processes and communications with network services, but applications of zero trust in the field of network security have focused primarily on securing enterprise endpoints and leaving general network access to be managed by traditional methods. This work explores an emerging model, DNS-RoT whereby zero trust principles are applied not just to managed network services, but to all network communcations into and out of the managed network. The model operates around the principle of DNS as a root of trust, requiring a DNS request be made to a validated domain name through the controlled DNS in order to open an ephemeral bi-directional tunnel for communication to occur. In the model, a policy engine backed firewall enforces zero trust principles by intercepting all network requests and allowing ephemeral, time-limited access tunnels between two endpoints only when the remote device is identified through a validated domain name rather than a raw IP address, automatically blocking any direct IP-based connection attempts while provisioning temporary secure channels with defined TTLs for approved domain-qualified communications. This thesis discusses the efficacy of the DNS-RoT approach from a usability and security standpoint as well as addressing the weaknesses of the model and how it compares to previous models in real-world scenarios.
  • High performance state space search by compilation to program code
    (2025-12-31) Nguyen, Son
    School of Science | Master's thesis
    State space search systems commonly represent actions as a pair of precondition and effect. The precondition is evaluated to check if the transition from the action is applicable, and the effect is applied to a copy of the current state to construct a successor. Even though this process has linear time complexity, there remains potential to optimize the implementation further. This thesis develops an automated planning implementation that compactly represents the state variables in 64-bit words and generate low level code for evaluating the preconditions and effects of actions from a multi-valued planning task. The compact state representation has logarithmic space complexity in the size of the variable domain. For each planning problem, low level code is generated to evaluate the preconditions and actions from that problem. The generated code is incorporated into an implementation of the Fast Forward heuristic and two systematic search algorithms: greedy best-first search and simulated annealing. The system is evaluated on a number of domain problems from planning competitions and compared with the performance of the Fast Downward planner. The results show good performance on the planning problems, especially during the successor generation phase, demonstrating the potential of the method as a viable path for optimizing state space search.
  • Automatic threat detection tn terahertz imaging
    (2025-12-30) Ashraf, Syed
    School of Science | Master's thesis
    Passive terahertz cameras provide a non-ionizing, high clothing-penetration alternative to traditional security imaging systems, but their deployment is limited by low visual contrast, sensor noise, and the practical difficulty of collecting large and diverse annotated datasets for machine learning. These constraints create challenges for supervised training, as models can overfit to narrow operating conditions and struggle to generalize to new subjects, poses, or environments. To address these limitations, this work develops a scalable synthetic data generation pipeline in Unreal Engine 5 capable of producing structured, repeatable training datasets that approximate key aspects of terahertz image formation. Synthetic samples are combined with real images to form controlled dataset configurations, enabling systematic evaluation of real-to-synthetic data ratios for threat-object detection. Experiments show that incorporating synthetic data into the training process can improve detection performance compared to models trained exclusively on real data. Gains are especially evident when the real dataset exhibits temporal correlation or low sample diversity, conditions which otherwise limit model generalization. Across the tested configurations, mixed datasets outperform the real-only baseline, suggesting that synthetic augmentation can compensate for gaps in the real data distribution. While synthetic data does not replace real data, the results indicate that it can serve as an effective supplement, increasing the range of examples available to the model without requiring additional data collection or manual annotation. The proposed pipeline also enables reproducible experimentation, controlled scenario variation, and the generation of targeted examples that are difficult or impractical to capture with physical hardware. These findings highlight the practical value of data-centric methodologies for terahertz security imaging and show that synthetic data can improve model stability and contribute to better generalization.
  • Regulating the cloud: Cross-border data transfer regulation and the geography of hyperscale cloud infrastructure
    (2025-12-31) Ngo, Linh
    School of Science | Master's thesis
    Cloud computing has become a central component of digital economic activity and depends on infrastructure that operates across national borders. As data are stored, processed, and accessed through globally distributed systems, cross-border data flows have become a routine feature of cloud-based services used by firms, governments, and individuals. At the same time, the growing concentration of cloud infrastructure among a small number of global providers has raised questions about regulatory reach, jurisdictional control, and economic dependence. These concerns have brought issues of digital sovereignty into debates on data governance and industrial policy. In response, governments increasingly rely on cross-border data transfer (CBDT) regulation to shape data flows, influence cloud market development, and manage reliance on foreign-controlled infrastructure. This thesis examines whether stricter CBDT regulation is associated with changes in the geographic expansion of hyperscale cloud infrastructure. In this study, stricter regulation refers to CBDT regimes that exceed GDPR requirements by imposing additional legal conditions on cross-border data transfers. To address this question, the analysis adopts a cross-national empirical design that combines country–year data on new hyperscale cloud availability zone announcements with systematic information on the timing and restrictiveness of CBDT regimes. Exploiting variation in the adoption of GDPR-equivalent or stricter CBDT frameworks across countries, the study applies an interaction-weighted difference-in-differences event-study design to examine infrastructure deployment patterns before and after regulatory adoption and to estimate regulatory effects under standard identifying assumptions. The results show no evidence that the adoption of stricter CBDT regimes leads to increased hyperscale cloud infrastructure deployment, either in the full sample or within subsamples of middle- and high-income economies and the largest economies. This finding suggests that regulatory tightening alone does not substantially affect the factors that shape hyperscale infrastructure siting decisions. Instead, infrastructure deployment appears to be driven primarily by structural considerations—such as energy availability, economies of scale, and existing network configurations—and by providers’ ability to comply with CBDT requirements through legal and organizational adjustments rather than physical relocation. From the perspective of digital sovereignty, the results point to a gap between regulatory intent and infrastructure outcomes, indicating that CBDT regulation may need to be complemented by broader approaches that address infrastructural dependence and market concentration more directly.
  • ORECA: Evaluating cadence and elasticity impacts on root cause analysis in edge-cloud continuum
    (2025-12-31) Nguyen, Dung
    School of Science | Master's thesis
    End-to-end microservice pipelines are increasingly deployed across the edge-cloud continuum, which introduces substantial heterogeneity in resource capabilities for highly elastic workloads. As service dependencies span across edge and cloud computing nodes, identifying the root cause of system and application faults becomes increasingly challenging. For the developer and application provider to consider suitable root cause analysis (RCA) techniques for their diverse edge-cloud fault scenarios, evaluation frameworks are needed to systematically assess their strengths and limitations across diverse operational conditions. Existing RCA evaluation frameworks allow customization of techniques and evaluation metrics, but they neglect configuration aspects related to fault scenarios, particularly those involving elasticity, fault severity, and observability cadences in edge-cloud continuum. In this thesis, we propose a holistic scenario-based RCA evaluation workflow, encompassing both fault scenario generation and RCA method assessment. With this foundation, we implement ORECA for evaluating RCA under diverse runtime behaviors, observability configurations, application models, and edge-cloud environments. ORECA provides fault scenario specifications and integrates observability tools along with chaos engineering to allow customized, scenario-based end-to-end RCA evaluation workflows, covering scenario setup, dataset creation, and performance evaluation. We use ORECA to evaluate the current state-of-the-art RCA algorithms for ML-intensive applications in edge-cloud continuum under various observability cadences, fault severities, and elasticity behaviors. Based on these findings, we provide recommendations for further development of RCA algorithms for microservice-based pipelines from various perspectives.
  • Utilizing Linux for a network address translation solution
    (2025-12-19) Routila, Maija
    School of Science | Master's thesis
    The problem of Internet Protocol version 4 (IPv4) address exhaustion has required solutions such as Network Address Translation (NAT) to be created. NAT is used to translate private network IPv4 addresses into public IPv4 addresses within a network device like a router or gateway. While providing a solution to address exhaustion, it also has caused issues. Some of these issues have to do with Virtual Private Networks (VPN). VPNs like IP Security (IPsec) in certain scenarios use two IP headers: an outer and an inner packet header. The inner IP header contains the original IP address, which may be a private IPv4 address due to a NAT device between the two endpoints. When multiple clients from different private networks communicate, there is a possibility that several clients share the same IPv4 address, causing routing conflicts and security risks. This thesis proposes a NAT-based solution for the issues caused by overlapping address spaces on a Linux-based VPN gateway device. The proposed solution is manually implemented into a device within a virtual environment using the Linux XFRM framework for IPsec processing and the Nftables framework for packet management. The implementation is done using the user space tools. The proposed solution supports multiple NAT pools for different traffic, multiple tunnels, and fixes the issues caused by having multiple clients with the same IPv4 addresses present.
  • Application of supervised machine learning in molecular chemistry
    (2025-12-18) Paavola, Pihla
    School of Science | Master's thesis
    This thesis investigates how supervised machine learning can be applied in molecular chemistry. The thesis starts out with giving the most relevant background found in the literature and for understanding the experiments. Then two experiments are described. The first experiment focuses on predicting pKa values based on experimental data. The second experiment predicts N2 structural isomers total energies and energy differences based on calculated data. The results of the experiments are that multiple ML models show predictive capabilities, but the accuracy is not yet on a practically usable level. The second experiment also has issues with the datasets being too small, which lead to overfitting. The overall conclusion is that machine learning has a lot of potential for molecular chemistry purposes, but the methods should be developed to be more accurate, understandable and reliable. Access to large datasets should also be improved for supervised machine learning to be more usable for molecular chemistry.
  • Bayesian optimization of multivariate multi-output biotechnological processes
    (2025-12-22) Tiainen, Vilma
    School of Science | Master's thesis
    Many industrial biotechnology applications require the optimization of complex multi-output processes, often with limited data and costly, laborious experiments. Hence we have developed and evaluated a customizable Bayesian optimization (BO) framework, BAIO2, using AI surrogates for multivariate, multi-output biotechnological applications. We focus on BO frameworks and potential surrogate models for two application cases: media optimization and genetic engineering for production of desired molecules using micro-organisms. Predictive analyses and simulation experiments are used to assess surrogate model performance and guide experimental design. Performance of BAIO2 is experimentally evaluated for the media optimization case to reach a target lipid distribution. The predictive analyses and simulations show that surrogate model performance varies by data size and round: multi-layer perceptrons (MLP) and Gaussian processes (GP) were generally strong with good exploration capabilities, whereas XGBoost regression (XGBR) often excelled in prediction but struggled to explore the search space. Simulations provide further insights into optimal experimental design and model selection. We conclude that BAIO2 is effective for multi-output biotechnological optimization and adaptable to different experimental setups. Future work will expand on a wider variety of acquisition functions and improve uncertainty estimation.
  • Sum–product networks for learning from demonstration with tractable exact inference and Bayesian experimental design
    (2025-12-31) Nguyen, Kien
    School of Science | Master's thesis
    One of the main challenges for robotic systems is to operate reliably in physical environments with latent (unobserved) parameters. Learning from Demonstration (LfD) offers a practical approach to tackle this problem by acquiring policies from expert demonstrations; however, limited data makes learned policies sensitive to data scarcity and distribution shifts. Bayesian Experimental Design (BED) addresses this by actively selecting queries that are maximally informative about latent environment parameters, which typically requires many repeated evaluations of conditional and marginal distributions. This thesis investigates whether Sum-Product Networks (SPNs) can serve as a tractable generative model that supports tractable likelihood evaluation, efficient conditioning, and marginalization over arbitrary subsets of trajectory variables. We study this question through two main proofs of concept: a) Synthetic Van der Pol and Lotka–Volterra systems, and real handwriting trajectories from the LASA dataset; b) a toy Bayesian Experimental Design for Learning from Demonstration tasks, demonstrating that SPNs’ exact inference enables accurate and reliable EIG calculations, leading to more efficient query selection. Across these experiments, the SPN model in this thesis matches the key qualitative structure of the studied nonlinear systems, supports inpainting of missing segments under the considered conditioning patterns, and enables numerically stable EIG computation via exact likelihood and marginal queries for the studied BED settings. Overall, the results provide initial evidence that SPNs can support tractable trajectory-based inference and small-scale BED in Lfd, while scaling to higher-dimensional observations and design spaces remains future work.
  • Data platform design for electricity metered data analytics
    (2025-12-07) Hämäläinen, Ilmari
    School of Science | Master's thesis
    The continuous increase in data volumes managed by organizations has created opportunities for deriving more comprehensive insights. However, many traditional systems are primarily designed for operational functionality and therefore lack either the capacity or scalability to simultaneously handle computationally intensive analytical workloads. This limitation has created a need to offload these workloads into external systems optimized for analytics. This has been the case for Fingrid Datahub, which operates Datahub, a centralized data exchange system for Finland’s electricity retail market. Datahub manages valuable electricity retail market information, including detailed metered data from approximately 4 million metering points in Finland. Consequently, there is an increasing demand to provide reports based on this data for various stakeholders, including market parties, authorities and researchers. However, this large-scale reporting disrupts the system’s core operational data exchange functionality. Therefore, Fingrid Datahub has recognized the need to offload these analytics into an external system. This thesis aims to address the above-mentioned need by designing a software architecture for a data analytics platform, which would extract data from Datahub and facilitate the heavy reporting tasks. A set of key requirements for this platform was identified through analysing the source system and its current reporting, as well as interviewing two core reporting stakeholders. The resulting solution is a data warehouse architecture composed of five core layers, designed to support large-scale metered data aggregations through a dimensional data model. The architecture was validated through a self-evaluation involving requirement coverage analysis and scenario-based assessment, as no alternative validation methods were possible in this work. Although the self-evaluation indicated that the designed architecture should theoretically fulfil the stated requirements and use cases, future work is required to develop a proof-of-concept system that ensures the architecture’s applicability.
  • Workflow and capacity management system for the validation laboratory
    (2025-12-31) Huang, Guting
    School of Science | Master's thesis
    This thesis investigates how workflow and capacity management systems can support the daily operations of MFILAB, the validation laboratory of Murata Finland. StressED, the laboratory’s existing system for managing a specific set of test flows, offers limited visibility into ongoing validation activities, resource usage, and overall laboratory capacity. To address these issues, the study explores how StressED could be extended into a more comprehensive workflow management solution that enhances activity visibility and reduces manual coordination effort, ultimately improving operational efficiency. The study integrates principles from workflow management systems and requirements engineering literature to produce a high-level system design, which is evaluated through the implementation of a proof-of-concept prototype and the application of the Software Architecture Analysis Method (SAAM). User evaluations and scenario-based architectural assessments indicate that the proposed solution fits the current use case to improve test visibility and support more informed decision-making. Nonetheless, limitations related to the lack of full-scale deployment of the prototype in the laboratory, reliance on simplified data during prototype testing, and the constraints of legacy systems highlight the need for further refinement. Consequently, the study provides recommendations for modernizing StressED components, developing the system into a minimum viable product, exploring scheduling algorithms as operational data accumulates, and integrating laboratory equipment and analytical tools to enable future automation. The study demonstrates that the proposed workflow and capacity management design is both feasible and adds value to the lab, establishing a basis for continued development at MFI.
  • The impact of filtering social media push notifications on well-being and social media usage
    (2025-12-31) Uhari, Viivi
    School of Science | Master's thesis
    Social media (SM) use has been shown to affect well-being both positively and negatively, with push notifications contributing negatively by causing distraction and mental fatigue. In addition, social media notifications encourage platform use, potentially reinforcing the negative effects of these platforms. However, prior research has also shown that muting notifications altogether can negatively affect well-being, suggesting a need for more nuanced approaches to managing SM push notifications. This thesis investigated whether filtering SM push notifications based on their importance affects well-being, social media usage, and user experience. Specifically, notifications related to direct messages and offline events were assumed important, based on previous findings regarding the importance of smartphone notifications. An experimental field study was conducted with 16 young adults using Facebook, Instagram, and TikTok. Participants were divided into a control group with default notification settings and a treatment group with filtered notifications. Well-being was measured through surveys administered before, during, and after the two-week experiment, while social media activity was tracked using platform-provided data donations. Interviews were conducted to understand participants’ experiences. The results showed that SM notifications about one-on-one messages were valued most, followed by traditional SM notifications, while most notification types were perceived as unimportant. Filtering notifications did not significantly affect well-being between the groups, but mindfulness increased within the treatment group. Social media usage remained largely unchanged, although the number of searches increased within the treatment group. Additionally, daily reactions were associated with higher stress and lower positive affect on that day. Beyond these findings, the analysis of data donations revealed differences in data quality and structure across platforms, and highlighted the challenges and time required due to missing documentation. Facebook data donations were particularly problematic. Overall, the findings suggest that filtering SM notifications based on perceived importance is a feasible and non-harmful alternative to more restrictive notification management strategies, and that the use of data donations should be carefully considered during study design.
  • Leveraging large language models to identify and analyze repetitive customer service contacts
    (2025-12-25) Reinilä, Jade
    School of Science | Master's thesis
    This master’s thesis examines the feasibility of using large language models (LLMs) to identify and analyze repetitive customer contacts at a Finnish energy company. The study aims to determine whether pre-trained LLMs can automate the detection and categorization of recall-related calls to serve as a tool to support human analysis and decision-making rather than replace it. A significant portion of customer service demand often represents failure demand caused by systemic shortcomings. These types of contacts inflate overall service demand without adding value, ultimately leading to inefficient use of resources. The data consists of 558 Finnish customer service call transcripts of which 299 are recalls. The study applies both qualitative and quantitative analysis using OpenAI’s GPT-4o and evaluates four approaches: minimal prompting, advanced prompting, minimal prompting combined with RAG, and advanced prompting combined with RAG. The results indicate that LLMs can generate structured and actionable insights to a significant extent. Advanced approaches consistently yielded the strongest performance. Advanced prompting alone proved most effective for accurate recalling categorization and for suggesting preventative actions to reduce follow-up calls. When combined with RAG-retrieved context, this approach achieved the highest ratings for generating open-ended explanations for recalling and determining whether a call was linked to a previous customer interaction. Overall, the findings indicate that LLMs offer a practical and scalable solution for analyzing repetitive customer contacts and delivering insights that support operational efficiency improvements.
  • Diagnosing and improving a maintenance contract churn prediction model
    (2025-12-08) Niemi, Veikka
    School of Science | Master's thesis
    In a world with a finite number of people, it is important for any company offering a service to manage retaining existing customers and find ways to keep those customers paying for the service. A customer may cease to do business with a company at any time and end their contract, which is known as churning. With modern data-driven possibilities to approach and address business problems, it is plausible to uncover intricate reasons why churn happens, and utilize that information to attempt to retain customers who are at risk of churning. This thesis explores an existing maintenance contract churn prediction tool and suggests ways to increase the performance of the underlying machine learning model. The existing model is performing suboptimally and the results are underutilized within sales teams due to both inaccurate predictions and questionable explainability. The existing model uses the gradient boosted decision tree based CatBoost framework to construct a model for classifying churners. The first part of the research process is going over the existing solution and fixing logic problems and implementing new features. Then three improvement strategies are suggested. First, an improved time horizon based target definition, that immediately improves the prediction performance. Then, segmenting the data based on contract lengths and training separate models for each segment. Lastly, an experimental alternative model is built to see if model choice is a limiting factor for the performance. Additionally, a thorough prediction error analysis is conducted to gain insights on why and where the model is struggling with predicting churning contracts. As a result, a model with improved accuracy is obtained, along with actionable insights on how to improve the underlying contract-related processes to further maximize the possible gains from the prediction tool.
  • Monivaiheisen tunnistautumisen kehittäminen palse.fi-portaalissa
    (2025-12-31) Kyllönen, Walerius
    School of Science | Master's thesis
    Authentication is a critical part of Internet services. Traditionally authentication is handled with usernames and passwords. However, nowadays passwords are not reliable enough as the only authentication method when handling sensitive information, such as health data. Multi-factor authentication has been proposed as a solution for making authentication stronger. In this thesis, I selected and implemented a new method for multi-factored authentication into the palse.fi-portal that is maintained by Polycon Oy. The portal is for connecting patients, welfare services counties and their service providers. First, a literature survey was performed to discover possible multi-factor authentication schemes. Then, the requirements for the scheme selection were formed. After this, the different schemes were compared with the help of the requirements to select the scheme that was implemented. Next, the scheme was implemented and finally reviewed against the requirements. FIDO2 was selected as the multi-factor authentication scheme. WebAuthn API was implemented to support the FIDO2 authentication method. FIDO2 offers support for wide range of authentication devices, including smartphones and physical security tokens. The implementation fulfilled most of the requirements, with the exception of usability. Usability of FIDO2 depends heavily on the combination of authentication device and platform. The user can influence the usability of this method by selecting a combination with better usability. The implementation of the WebAuthn API does not affect the usability. In ideal conditions, authentication using FIDO2 is fast and easy, but with specific combinations the process can be challenging, especially to an unexperienced user.
  • Improvements in drug-target interaction prediction with multimodal deep learning
    (2025-12-27) Tulkki, Ilari
    School of Science | Master's thesis
    Drug-target interaction (DTI) prediction is an important topic of research in the field of computational chemistry with applications in drug discovery and repurposing. This thesis investigates whether integrating predicted 3D structures of drug-target complexes with sequence-based representations improves DTI prediction accuracy. A bimodal deep learning ensemble, BimodalDTI, is introduced. It consists of three components: two graph neural networks operating on complex structures predicted by the Boltz-1 diffusion model, and a sequence-based model that integrates the ChemBERTa and ProtT5 language models. DTI prediction is formulated as a regression task predicting interaction strength. The models are evaluated in bioactivity imputation and new drug scenarios. BimodalDTI consistently outperforms all its individual components and other baseline models. These results indicate that combining predicted structural information with sequence-based representations improves DTI prediction accuracy.