Software productivity has always been one of the most critical metrics for measuring software development. However, with the open-source community (e.g., GitHub), new software development models are emerging. The traditional productivity metrics do not provide a comprehensive measure of the new software development models. Therefore, it is necessary to build a productivity measurement model of open source software ecosystem suitable for the open-source community’s production activities. Based on the natural ecosystem, this paper proposes concepts related to the productivity of open source software ecosystems, analyses influencing factors of open source software ecosystem productivity, and constructs a measurement model using these factors. Model validation experiments show that the model is compatible with a large portion of open source software ecosystems in GitHub. This study can provide references for participants of the open-source software ecosystem to choose proper types of ecosystems. The study also provides a basis for ecosystem health assessment for researchers interested in ecosystem quality.
Open Source Software Ecosystem (OSSECO) is a new ecosystem based on two research fields of open source software and software ecosystem. In the past few years, the OSSECO, as a growing research field in software engineering, has attracted researchers’ attention. Manikas [
In fact, in the process of continuous development of OSSECO, productivity has an increasing impact on the health and evolution of OSSECO. The measurement of ecosystem productivity of open source software helps maintain the development efficiency of ecosystem participants and the ecosystem’s stability. How to evaluate and improve ecosystem productivity is a problem that cannot be ignored. Currently, Jansen [
As the most popular open-source platform, more and more well-known open-source projects conduct production activities in GitHub. To address these issues, we use the open-source software ecosystem’s production characteristics to define open-source software ecosystem productivity. Referring to the influencing factors of traditional software productivity and considering the data characteristics of OSSECO in GitHub, we use more than 200,000 data from 10 popular OSSECO in GitHub to analyze the factors affecting the productivity of OSSECO and draw conclusions. On this basis, the Open Source Software Ecosystem Productivity model and Net Productivity model were constructed respectively. Meanwhile, the models’ validity was verified according to 8 ecosystems in GitHub that did not participate in the model construction. Our research mainly involves the following questions:
This paper makes the following contributions: (1) Define the productivity concept of OSSECO by combining natural ecosystem and business ecosystem; (2) Analyze and study the influencing factors of productivity of OSSECO; (3) Construct productivity model of OSSECO and validating the validity of the model.
The remainder of this article is structured as follows. Section 2 discusses background of the OSSECO. Section 3 defines the concepts related to the productivity of the OSSECO. Sections 4 to 5 analyzes the factors that influence the open source software ecosystem and constructs a productivity model of the open source software ecosystem. Section 6 validates the validity of the model. Section 7 presents the threats to validity of our study. Section 8 concludes by summarising the main research findings and outlining future work.
In the last century, researchers believed that the focus of software process improvement was to improve software development productivity. A large number of studies discussed the influencing factors and measurement methods of software development productivity. In 1977, Walston et al. [
Entering the 21st century, the concept of “open source” has gradually been recognized by more and more developers, and the open-source community is developing at an alarming rate. Developers focus more on the quick update and iteration of code. The traditional factors and measures of software products are not suitable for the open-source software ecosystem.
In OSSECO research, researchers mostly regard productivity as a part of Open Source Software Ecosystem Health research. Wahyudin et al. [
More and more researchers regard “productivity” as an essential indicator of OSSECOH. According to the definition of software ecosystem health proposed by Mcgregor et al. [
However, these studies only explain the factors affecting ecosystem productivity at the qualitative level. They do not quantitatively explain the specific results (positive or negative) of these factors on productivity or explain why they affect ecosystem productivity.
As the most popular open-source platform, more and more well-known open-source projects are operating on GitHub. This paper is based on the GitHub open-source platform to conduct open-source ecosystem productivity research. Meanwhile, this section also answers
It is difficult to make a unified definition, the same as the software ecosystem, on open source software ecosystems, although many researchers study it from various aspects. Virtually in every essay, the author proposes a definition that they reckon to be reasonable. What is certain, however, is that the understanding of the open-source software ecosystem is focused on two main areas: (1) Ecosystem perspective: Researchers believe that open-source software ecosystem is a network consists of participants, organizations, and symbiotic companies. Therefore, one research angle is business target; (2) Program-community perspective: Researchers pay more attention to a set of programs and the community’s technology and social influence. For a better understanding, this paper will study from the perspective of the project-community.
The concept of productivity first appeared in the study of natural ecosystems, which generally refers to organisms’ ability to produce material and energy. Later, when measuring the business ecosystem’s vitality, the researchers used the concept of productivity and defined the business ecosystem’s productivity as the ability of interacting organizations and individuals to solve business problems [
Also, there is a concept of net productivity in natural ecosystems. Net productivity refers to the accumulation rate of residual organic matter after removing respiratory consumption in natural ecosystems. It is a significant indicator of the ecosystem and represents the actual production of the ecosystem. In the OSSECO, not all Issues and PR can contribute to the final ecosystem. The PR, which has been merged, is the contributor to the final ecosystem product. So we propose the definition of net productivity of OSSECO by analogy with the natural ecosystem.
This section mainly studies RQ2 and RQ3. In analyzing the factors affecting the productivity of OSSECO, we first use the API provided by GitHub to obtain the original data needed for the experiment and collate the original data set. Then we conducted correlation analysis and sample covariance experiments and finally got the analysis results. The specific steps are shown in
This paper selects ten popular open-source software ecosystems in GitHub as research examples, including bootstrap, awesome-python, rails, node, freeCodeCamp, TensorFlow, vue, oh-my-zsh, electron, flutter, etc. The reasons for selecting these ten ecosystems are: (1) With high popularity and large data scale. All these ecosystems are currently in an active state of development, and the number of starts is more than 30,000; (2) Long life cycle, the earliest was rails released in April 2009, the latest is TensorFlow released in November 2015; (3) The release time and development language of the above, open-source ecosystems are different, so besides the characteristics of ecosystem participants, the characteristics of the ecosystem itself are different. We collected all the ten programs during their lifecycle to determine the impact of program age on ecosystem productivity.
When processing raw data, it is mainly necessary to extract the author, release time, status (Open or Closed), shutdown time, discussion number, and other Issue and PR data. However, during the observation of raw data, we found out that some Issues, visitors’ (those who are not engaged in program development and use) meaningless speech, are invalid and should be left out. Also, new Issue and PR are not generated every day on every program, so we use one month to be a statistics period. Even so, data of 0 still cannot be avoided. Solutions to this problem will be mentioned later. The data set after deleting meaningless data is shown in
Ecosystem | Start time | End time | Issue | PR |
---|---|---|---|---|
awesome-python | 2014–06 | 2019–03 | 134 | 1129 |
bootstrap | 2011–08 | 2019–03 | 18353 | 9706 |
electron | 2013–05 | 2019–03 | 10494 | 7016 |
flutter | 2015–04 | 2019–03 | 18548 | 11701 |
freeCodeCamp | 2014–12 | 2019–03 | 13452 | 21844 |
Node | 2014–11 | 2019–03 | 9677 | 17095 |
oh-my-zsh | 2009–08 | 2019–03 | 2782 | 4928 |
Rails | 2009–04 | 2019–03 | 12480 | 23227 |
tensorflow | 2015–11 | 2019–03 | 16774 | 10420 |
Vue | 2013–09 | 2019–03 | 7815 | 1473 |
Note that by collecting the information amount produced by the system in a time unit, use one month to be a statistics period. Issue number and PR number are used to represent the ecosystem’s information producing ability. The speed of the Issue and PR closing indicates the ecosystem’s problem-solving ability. We consider a problem that has been solved when an Issue or a PR is shut down. The specific calculation method is
Barros [
Based on the questionnaire results, we finally selected the top six influencing factors for related research on the productivity of the OSSECO. These factors were chosen because more than 20 survey participants felt they would have an impact on the productivity of the OSSECO. Also these factors are easier to quantify on GitHub for our subsequent research.
Besides, in natural ecosystems, the factors affecting ecosystems’ net productivity are mainly ecosystem producers and consumers. Analogous to natural ecosystems, we suspect that the factors affecting net productivity in the OSSECO are mostly PR publishers (
In Section 4.1, we mentioned that some data in our dataset is zero. This is because some projects may not have new Issues or PRs in a month. We adopt a zero-inflated negative binomial regression model [
This section mainly conducts experimental analysis based on the factors affecting the productivity of the OSSECO selected in Section 4.1. The specific analysis results are as follows:
There is a strong correlation between the total number of participants and the total number of issues. The lowest correlation is rails with a correlation coefficient of 0.608. The correlation coefficient between the total number of participants and the total number of issues in the other nine ecosystems is above 0.9. The monthly change of Issue number also maintains a strong correlation with the flow of participants. The ecosystem with the lowest correlation coefficient is rails, which is 0.657. The highest correlation coefficient was 0.981 for the freeCodeCamp. The correlation coefficients of the total number of participants and the total number of PR in 10 ecosystems are all above 0.8. The correlation coefficients of the flow of participants and the change of PR are all above 0.65. Therefore, the conclusion drawn in this paper is that the flow of participants has a high impact on the capacity of OSSECO to produce information, and the number of participants directly affects the total amount of information produced by the ecosystem.
Ecosystem | Issue | PR |
---|---|---|
awesome-python | –4.18333 | –56.79654676 |
bootstrap | –140.634 | –358.4186673 |
electron | –60.1554 | –97.97207303 |
flutter | –300.527 | –73.13788044 |
freeCodeCamp | –169.203 | –248.6738278 |
node | –313.414 | –91.20910575 |
oh-my-zsh | –218.244 | –331.0295191 |
rails | –126.141 | –70.34121343 |
tensorflow | –90.9408 | –402.1742851 |
vue | –21541.4 | –149.6525869 |
Since these two groups of data do not meet the Pearson correlation coefficient calculation requirement, sample covariance of the two groups of data is selected to calculate to prove whether the changing trend of the two groups of data is consistent or the opposite. The experimental results are shown in
This conclusion is surprising. To explain this phenomenon, we looked at Issues in the flutter project. We found that when more participants publish comments, other participants in the ecosystem will have to solve more problems related to it to solve the problem more slowly. Of course, it cannot be said that participants’ behavior to post more comments is not suitable for the ecosystem, and more comments can make the Issue be solved with higher quality.
The result is surprising because the number of Star in awesome-python is negatively correlated with productivity. Although the correlation between the two of the remaining nine items is strongly associated (correlation coefficient between the number of stars and the number of Issue in the vue project is even as high as 0.89), the negative correlation still needs to be further explored.
After observing the Issue and PR of awesome-python one by one, we found that although the number of stars in the ecosystem is linearly and steadily increasing, the person who publishes Issue and PR every month is a fixed participant. In order to explain this phenomenon, it is necessary to know more about the nature of the awesome-python ecosystem. It turns out that awesome-python is a python resource list initiated and maintained by vinta. Users will not have too many questions about the project, and the resource list will not change much. It is enough for users just to use it. So even if the number of awesome-python stars is negatively correlated with its productivity, it can still be determined according to the situation of the remaining nine ecosystems that the number of stars has a positive impact on the ability of ecosystem production information.
Ecosystem | Issue | PR |
---|---|---|
awesome-python | –0.01974 | –0.177094527 |
bootstrap | –0.25225 | –0.274429471 |
electron | 0.160635 | 0.466614022 |
freeCodeCamp | –0.01661 | –0.158009826 |
node | 0.0859 | 0.201114512 |
rails | 0.066409 | 0.008779556 |
tensorflow | 0.108249 | –0.082886264 |
vue | 0.267018 | 0.404303975 |
Ecosystem | Issue | PR |
---|---|---|
awesome-python | –221.2716154 | –40.38022356 |
bootstrap | –240.9347433 | –35.19020435 |
electron | –464.3365183 | –60.24960226 |
flutter | –311.8260386 | –14.67524084 |
freeCodeCamp | –547.3907237 | –211.9252707 |
node | –2062.593537 | –153.9466754 |
oh-my-zsh | –1589.669618 | –888.101723 |
rails | –77.04126059 | –25.73836021 |
tensorflow | –138.4922134 | –94.05316346 |
vue | –140.5387838 | –126.7499909 |
The number of participants involved in the ecosystem did not positively affect the Issue and PR’s closing speed. On the contrary, the more participants involved in the ecosystem, the slower the closing speed of the Issue and PR is. This rule can also be found in the scatter diagram of Issue and PR closing speed changing with the number of participants participating in the project.
GitHub allows users to simultaneously participate in multiple ecosystems, which is quite different from traditional software production environments. Although participants’ experience is improving as they participate in more ecosystems, the development of numerous ecosystems at the same time makes participants less focused. Based on this, this paper concludes that the more participants participate in the project, the less attention will be focused on one project, and the ecosystem’s ability to solve problems will be reduced. However, this does not mean that participants’ experience will harm the productivity of the ecosystem. In the follow-up research, a more comprehensive way can be chosen to represent participants’ experiences for further study.
It can be seen from
The reason for these three different trends is that the ten ecosystems selected in this paper are currently at different stages. Awesome- python, electron, vue, and oh-my-zsh are presently in the early stage of development. Users have not discovered many problems with the project, so users will not release many Issues. Meanwhile, the release of PR is more about the participation of core members of the ecosystem, so the number of PR releases is relatively stable. Node, flutter, and TensorFlow are in the active development stage. More and more users are involved in the development of the project. They are willing to put forward more issues to help develop the ecosystem and contribute their PR to the ecosystem. Rails, freeCodeCamp, and bootstrap are in a stable development stage. More Issues have been solved after a period of active development, and the project functions are relatively complete. Much PR is not needed, only a few bugs need to be fixed, and the ecosystem gradually enters a more stable development stage.
However, it can still be determined that no matter what stage the ecosystem is in, the ecosystem’s age will impact its productivity.
The distribution of the number of PR publishers involved in the ecosystem and the number of followers of the publishers in 10 ecosystems are calculated, as shown in
In summary, this paper concludes that when PR participates in more ecosystems or is followed by more people, publishers cannot invest too much energy in an ecosystem. Hence, the number of publishers participating in ecosystems and the number of followers harm ecosystems’ net productivity.
It can be seen that the more people involved in the PR audit in the ecosystem, the lower the probability of PR adoption. Pearson correlation coefficient was used to analyze the correlation between the two groups of data. The analysis result was –0.648, indicating that the two groups of data had a robust negative correlation. This paper concludes that the net productivity of an ecosystem is lower when there are more reviewers. The net productivity of an ecosystem is highest when there are between 10 and 20 reviewers.
Based on the analysis of the above factors, we got
In order to help us understand the results of quantitative analysis better, we interviewed 36 researchers in the OSSECO based on quantitative analysis. The interview results show that the impact of factors such as Age of OSSECO, The degree of popularity, the number of projects involved, and the number of reviewers on productivity has been consistent with our analysis.
Besides, 91.7% of researchers believed that the use of popular development languages would positively impact the productivity of the OSSECO. In contrast, three people believed that language development would not affect the productivity of open-source software. They explained that in project development using GitHub, developers tend to be more willing to participate in projects they are good at and like. However, once they enter the project, the development language will not affect the participants’ production activities. That is to say, development languages will only affect the ecosystem’s ability to attract participants, but not the productivity of the ecosystem. This explains why the popularity of development languages in our analysis does not affect the productivity of the OSSECO.
Secondly, 83.3% of the researchers think that when the participants communicate more, the open-source software ecosystem will have a more vital ability to solve problems. However, our analysis results show that when the participants publish comments, other participants in the ecosystem will have to solve more questions related to this issue, so the speed of the Issue will be slower. That is to say, the communication degree of participants will harm ecosystem productivity. Six researchers accepted the results. Because in the ecosystem they participate in, they also encounter the same situation. A comment often leads to a series of more difficult problems. When the Issue or PR is solved with high quality, it takes more time.
This section answers RQ2 and RQ3. Combined with quantitative and qualitative research, we have determined that the factors that affect ecosystem productivity and the effects of these factors on productivity are both positive and negative.
This section builds an OSSECO productivity model based on the analysis results in Section 4. Since the analysis of the factors affecting productivity in Section 4 deals only with the relationship between the factors and productivity, it does not consider whether the factors affect each other. Therefore, the collinearity among various influencing factors should be considered when constructing the productivity model, and some variables should be eliminated. This also answers
In Section 4, it was found that the number of participants, age of ecosystems, and the degree of popularity all affected the ability of ecosystems to produce information. The number of participants involved in ecosystems and the degree of communication impact the ability of the ecosystem to solve problems.
Firstly, we use multiple linear regression to build the model of information production capacity of the ecosystem. Data such as the number of participants, ecosystem age, and the degree of popularity are added to the multivariate regression model. Then check whether there is a linear relationship between two independent variables. If there is a linear relationship, remove one of the variables, and the coefficient of this parameter is 0 in the final regression equation. After multiple linear regression was performed on all ten ecosystems, ten multiple regression equations were obtained (see in
Ecosystems | Regression equation | A | DP | NP | c (constant) |
---|---|---|---|---|---|
awesome-python | OSEPP = 1.095c – 0.004 | 0 | 0 | 1.095 | –0.004 |
bootstrap | OSEPP = –0.005b + 1.489c + 131.538 | 0 | –0.005 | 1.489 | 131.538 |
electron | OSEPP = 0.928a – 0.002b + 1.413c + 3.529 | 0.928 | –0.002 | 1.413 | 3.529 |
flutter | OSEPP = –0.011b + 2.108c + 154.241 | 0 | –0.011 | 2.108 | 154.241 |
freeCodeCamp | OSEPP = 1.581c + 6.319 | 0 | 0 | 1.581 | 6.319 |
node | OSEPP = –4.237a + 0.005b + 1.152c − 0.12 | –4.237 | 0.005 | 1.152 | –0.120 |
oh-my-zsh | OSEPP = –0.04a + 1.121c + 0.648 | –0.040 | 0 | 1.121 | 0.648 |
rails | OSEPP = –3.640a + 0.01b + 1.643c + 13.426 | –3.640 | 0.010 | 1.643 | 13.426 |
tensorflow | OSEPP = –2.998a + 1.388c + 7.949 | –2.998 | 0 | 1.388 | 7.949 |
vue | OSEPP = –0.001b + 1.352c + 11.450 | 0 | –0.001 | 1.352 | 11.450 |
It can be found that the regression equations of ecosystems are different, and the coefficient of the degree of popularity in 10 ecosystems is small, which does not contribute much to the ecosystem productivity model. At the same time, this paper needs to find a regression model suitable for most ecosystems, so for the coefficients of age and number of participants, this paper uses the real discovery algorithm [
The real discovery algorithm takes the mean or median of data sources as the center point. It uses a similarity (distance) algorithm to calculate the weight of the credibility of each data source for other data. Recalculate the center point’s position according to the weight, iterating several times until the center point is not changing. The process is shown in
In Section 4, it should be noted that ecosystem productivity gradually reaches a peak and stabilizes with age. So the ecosystem age will no longer affect ecosystem productivity after reaching a specific value. The final regression equation is Formula 1.
when
Then we model the ability of the ecosystem to solve problems. The relationship between the degree of communication, the number of participants participating in the ecosystem, and the problem-solving ability of the ecosystem are nonlinear. Therefore, multiple linear regression cannot be directly used for modeling. Combining with the definition of closing speed of Issue in Section 4, we can make a regression analysis on the variables of
OSSEP_V represents the ability of the OSSECO to solve problems, C is the degree of communication (comment number of Issue and PR), and N is the number of participants participating in the ecosystem. Formula 1 and Formula 2 together constitute the productivity model of OSSECO.
In Section 4, the factors affecting the net productivity of the OSSECO have been analyzed. The analysis results show that the more projects PR publishers participate in and the more people follow PR publishers, the lower the net productivity of the ecosystem. At the same time, the net productivity of the ecosystem decreases with more PR reviewers. The influence of PR publisher related factors and reviewers related factors are summarized as shown in
However,
The obtained model is Formula 3
In the formula, OSSENP refers to the net productivity of the OSSECO, that is, the ratio at which PR is adopted. NR refers to the number of people participating in the PR audit in the ecosystem.
This section answers RQ4. In the construction of the productivity model, we eliminated the collinearity variables through multiple linear regression, and finally only retained the
To verify the effectiveness of the productivity model constructed in this paper, we use the ecosystems not involved in the model construction, such as react-nation, backbone, meteor, angular, jquery, axios, express, and puppeter. The dataset for the validation experiments includes all Issue, PR, and STAR data for the above ecosystem from the start of publication until April 2019. This includes the release time, close time, and publisher attributes for Issue and PR, as well as the marker and mark time attributes for star.
The
The difference between the model value and the real value of the high ecosystem productivity is tested. The results show that the
Then, the
Differences between the source | ||||||
---|---|---|---|---|---|---|
Between groups | 693.8388 | 1 | 693.8388 | 1.617302 | 0.204902 | 3.886996 |
Within the group | 88376.08 | 206 | 429.0101 | |||
A total of | 89069.92 | 207 |
Frustratingly, when the
To explain why this is the case, we looked at the specific properties of the PR data for all closed PRs in the meteor. We found that in the early ecosystem release period (before October 2015), only 13.1% of the PR in the ecosystem was merged. The net productivity of this period drags down the net productivity of the ecosystem throughout its life cycle. This is not the case for the 10 ecosystems involved in constructing the open source software ecosystem net productivity model, which is why there is a large gap between the modeled and actual values.
Therefore, the OSSECO productivity model constructed in this paper is good and can be applied to most open-source software ecosystems in the GitHub. However, the ecosystem net productivity model can only be applied to the ecosystem with uniform net productivity distribution. More work needs to be done on the performance of the model.
There are two primary threats to the validity of this study. Firstly, we choose the ecosystem with the highest star number in GitHub to obtain enough real and practical data sets. These ecosystems are mature and can ensure stable productivity. However, it can not ignore that many ecosystems are in the initial stage of development, and how different their characteristics are from those of the ecosystems that construct models. Although young ecosystems and ecosystems with few stars are also selected for model validation, the validation results are ideal. However, the ecosystem characteristics of the model have affected the scope of application of the model. We need to select a broader ecosystem and use the research methods in this paper to build more accurate models to evaluate our research work’s internal effectiveness.
Secondly, the data of GitHub are used in our research in the process of model construction and verification. This is because GitHub open source community is the largest open-source platform at present, and the production data of the ecosystem is sufficiently perfect. However, the same ecosystem often develops in many open source communities. Although the factors we choose to build the model can be measured by relevant data obtained in each open-source platform, the validity of this model in other open source communities has not been verified. In future work, we need to integrate production data from multiple platforms in the same ecosystem to optimize the model further to ensure our work’s external validity.
Firstly, this paper defines the concept of open-source software ecosystem productivity by analogizing the related research and concepts of the natural ecosystem and business ecosystem and analyzes the influencing factors of open source software ecosystem productivity by referring to traditional software research productivity. Then this paper builds an open source software ecosystem productivity model. Based on the GitHub open-source platform, this productivity model measures the number of Issue and PR in the ecosystem and how quickly Issue and PR are resolved (closed). Finally, the validity of the model is verified. It is verified that the ecosystem productivity model constructed in this paper has excellent performance, and the different test results are ideal. However, the net productivity model needs to be further optimized.
Although this paper has done much analysis of the factors affecting the productivity of the open-source software ecosystem and built a productivity model on this basis, however, there are still many shortcomings in this paper. It is necessary to conduct more in-depth and active research in future research work: (1) Firstly, as mentioned in Section 6 of this paper, the net productivity model of the open-source software ecosystem constructed in this paper does not perform well in the ecosystem with an uneven distribution of net productivity. In order to improve the universality, validity, and fault tolerance of the ecosystem net productivity model, more factors affecting net productivity need to be taken into account in subsequent studies. (2) Secondly, the experimental data selected in this paper are all from GitHub, but many production activities of open source software ecosystems are often synchronized on multiple open-source platforms. The final model built by analyzing and modeling only using the data in the GitHub platform can only be applied to GitHub. In future research, we need to consider using data from multiple platforms to study open source software ecosystems’ productivity from a more comprehensive perspective.