Input data processing strategies for software reliability growth models
Loading...
URL
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
Department
Major/Subject
Mcode
SCI3084
Language
en
Pages
83 + 16
Series
Abstract
Software reliability growth models produce important software reliability information. They are typically nonlinear regression models that are fit to time - cumulative issue report amount data. A processing phase converts raw data into this kind of fitting data. The processing phase has received little attention in recent reliability model literature. Thus this work aims to gather information on recently used reliability growth model input data processing strategies, discover their effects on model performance, and finally guide the strategy usage. Recent scientific literature was searched for input data processing strategies. Strategy effect information was produced empirically. A software project problem report data set was collected and input data processing strategies were applied to it. Log-logistic, Yamada-Raleigh, and Weibull models were fit to the produced data, and their goodness of fit and predictive accuracy were measured. Finally, these performance metrics were compared and Spearman correlation analyzed. Guidelines were derived from the empirical results. A total of eighteen reliability model input data processing strategies were found in recent literature. They are built out of filtration, transformation, grouping, and other types of processing actions. Some actions are common between strategies. Seven strategies were applied to 85 software projects using an automated analysis tool called STRAIT. The strategy effect varied between software projects. For some projects, results improved while for others they degraded. For some projects this effect was extreme. The project result distribution improved for strategies that modified the analysis time frame to be shorter. Removing open issue reports made a reliability growth trend visible for several projects. The software project properties of project size in kilobytes, contributor amount, development day amount, and initial and processed issue amounts typically have negligible to low Spearman correlations with the model performance results and their changes. Based on the results, it is beneficial to include many data processing strategy options in reliability analysis tools, as there is no strategy that would only improve model results for all projects, and it is difficult to predict results based on project properties. It also appears to be good to remove open problem reports from the report data set and limit analysis to a single project phase instead of the full project period.Ohjelmistojen luotettavuuden kasvumallit ovat tyypillisesti luotettavuustietoa tuottavia epälineaarisia regressiomalleja, jotka sovitetaan ohjelmiston ongelmien määrää kuvaavaan aikasarjadataan. Data tuotetaan prosessointivaiheessa, jonka käsittelyä ei ole alan kirjallisuudessa juuri painotettu. Siksi tämä tutkimus pyrkii keräämään tietoa datan prosessointistrategioista, selvittämään miten ne vaikuttavat kasvumallien suorituskykyyn ja tarjoamaan ohjeita. Strategioita etsittiin viimeaikaisesta tieteellisestä kirjallisuudesta. Strategioiden vaikutuksia tutkittiin empiirisesti soveltamalla niitä useiden ohjelmistoprojektien ongelmaraportteihin, sovittamalla Yamada-Raleighn ja Weibullin mallit sekä Log-Logistinen malli dataan, mittaamalla niiden sovituksen hyvyys ja ennustustarkkuus sekä Spearman-korrelaatiot, ja vertailemalla mittauksia. Kirjallisuudesta löydettiin 18 strategiaa. Ne rakentuvat prosessointivaiheista, jotka voidaan luokitella suodatus-, muokkaus-, ryhmittely- ja muihin vaiheisin. Empiirisessä osassa 7 strategiaa sovellettiin 85 ohjelmistoprojektin raporttidataan automaattisen analyysityökalun avulla. Strategioiden vaikutus mallien suorituskykyyn oli erilainen riippuen ohjelmistoprojektista. Joskus vaikutus oli positiivinen, joskus negatiivinen ja joskus hyvinkin voimakas. Strategiat jotka lyhensivät analyysin aikajännettä siirsivät kaikkien tulosten jakaumaa parempaan suuntaan. Strategiat jotka suodattivat pois avoimet ongelmaraportit paljastivat usein kasvavan luotettavuuden trendin datassa. Ohjelmistoprojektin koko megatavuina, kehittäjien, kehityspäivien, ja alkuperäisten sekä prosessoitujen ongelmaraporttien määrät Spearman-korreloivat mallien suorituskyvyn ja sen muutosten kanssa yleensä heikosti. Tulosten perusteella, luotettavuusanalyyseissä käytettyihin automaattisiin työkaluihin on hyvä sisällyttää useita valittavia datan prosessointivaiheita, sillä yhdenkään strategian vaikutus ei ollut yksiselitteinen. Strategioiden vaikutusta malleihin on myös hankala ennustaa pelkkien projektiominaisuuksien perusteella. Silti, avoimien ongelmaraporttien poistolla ja analyysin aikajänteen rajoittamisella vaikuttaa olevan melko yleisiä positiivisia vaikutuksia. Lisätutkimusta suuremmalla strategioiden ja ohjelmistoprojektien joukolla suositellaan.Description
Supervisor
Fagerholm, FabianThesis advisor
Chren, StanislavOther note
This work includes a digital appendix that contains additional data and tool scripts .
The folder structure of the digital appendix is the following:
A) Empirical study software projects
This folder contains an expanded table of the software projects used for the empirical study.
B) Experimental results from STRAIT
This folder contains several strategy-specific sub-folders. The sub-folders contain the output of the STRAIT analyses for each strategy. The output includes HTML result pages for each software project and a batch analysis report that presents the results in CSV table form.
C) Github project collector script
This folder contains the Python script that was used to find projects via the GitHub API. A Pipfile with the script dependencies is included.
D) Result analysis Jupyter notebook
This folder contains the Jupyter Notebook that was used to analyze the batch results provided by STRAIT. It produces different plots and tables of the results.
E) Visual analysis scatter plots
During the empirical study, a visual analysis with scatter plots was used to find connections between software project properties and SRGM performance. The scatter plots are included in this folder.