User:MWang (WMF)/Draft/Experiment analysis, December 2021

In November 2019, the Growth team added the "newcomer tasks" feature to the newcomer homepage. This feature was deployed in a controlled experiment where users were randomly assigned to either a control group that did not get access to any of the team's features, and a treatment group that did. We published an analysis of the effects of this feature in November 2020 based on data from the first six months after deployment.

We decided to gather new data and do another round of analysis focusing on the four key metrics from the first analysis: activation, retention, productivity, and revert rate. This was done for three reasons:


 * 1) The Guidance feature was not deployed during the original analysis.
 * 2) The original analysis used data from four wikis whereas in 2021 the Growth tools were on a lot more wikis.
 * 3) There were no significant changes to the Newcomer Tasks feature during this time period as the team was focusing on the Add a Link structured task.

We also took the opportunity to dig deeper into the effects of the Growth features by adding in data on whether a user appeared to be editing at the time of registration, and their responses to the Welcome Survey should they choose to respond to it. This provided us with new insight into areas where the features appear to be struggling as well as areas where they appear to be very successful.

Summary of findings
In general, the analysis of key metrics find similar results as in the 2020 analysis except for productivity where we find no change. Specifically, the results are:


 * Newcomers who get the Growth features are more likely to be "activated" (i.e. making a first article edit within 24 hours of registration).
 * We strongly believe they are also more likely to be retained as editors (i.e. returning to the wiki to edit on a different day) as a result of being more active on their first day.
 * The features do not appear to increase or decrease productivity (i.e. number of article edits made), and the treatment and control groups make constructive edits at the same rate (i.e. the revert rate is the same).

We provide more details on the specific results below.

Glossary

 * As of December 2021, almost all Wikipedia wikis have the Growth features. During the data gathering for this analysis, the number of Wikipedia wikis with the features was still limited. We gathered data from 16 wikis from February 2021, 17 wikis from March 2021, and 28 wikis from April 2021.
 * Not all newcomers receive Growth features; 20% of them are randomly chosen to get the default experience. The group with the features is the treatment group and the group with the default experience is the control group. Numbers that come from the default experience are called baseline numbers.
 * Activation is defined as a newcomer making their first edit within 24 hours of registration. The baseline activation rate is the activation rate with the default features, not the Growth features.
 * Constructive activation is defined as a newcomer making their first edit within 24 hours of registration, and that edit not being reverted within 48 hours. The baseline constructive activation rate is the rate for users with the default features, not the Growth features.
 * Retention is defined as a newcomer coming back on a different day in the following two weeks after activation and making another edit. The baseline retention rate is the rate for users with the default features, not the Growth features. We can limit retention to constructive edits in a similar way as we did for activation, and then get a baseline constructive retention rate.
 * Edit volume is the overall count of edits made in a user's first two weeks. The baseline edit volume is the count for users with the default features, not the Growth features.

Detailed findings
In this section we describe the specific impacts we've estimated from the controlled experiment. These are based on 244,060 new accounts registered across the wikis with the Growth features in February, March, and April 2021. For more specifics about our methodology, see "Methodology" below.





Activation
For this analysis, we focus on the Article and Article talk namespaces because 1) Newcomer Tasks is asking users to edit articles, and 2) the 2020 analysis found a significant positive effect on activation.


 * Activation: newcomers who get the Growth features are 2.3% more likely to make a first article edit. Across our dataset, the baseline activation rate in the Control group is 29.7%. In the Treatment group the activation rate is 30.4%, which is a 2.3% relative increase over the baseline.
 * Constructive activation: we find a larger effect of the Growth features when it comes to non-reverted edits in the Article and Article talk namespaces. Here the baseline constructive activation rate is 23.1%. The rate for users getting the Growth features is 23.8%, which is a 5.6% relative increase over the baseline.

Retention
Similarly as we saw in the 2020 experiment, retention is a much rarer occurrence than activation and we have not found significant differences between the Treatment and Control groups as a whole. Instead, we continue to find evidence that retention is strongly associated with the amount of activity a newcomer has on their first day. Since the Growth features significantly increases the likelihood that a newcomer makes an article edit, and we also find no difference between the Treatment and Control groups in the probability that activated users are retained, it follows that we can expect the increase in activation to translate into an increase in retention. For example, the baseline retention rate through constructive edits in the Article and Article talk namespaces is 3.3%. The rate for users getting the Growth features is 3.5%. This is a relative increase of 6.1% over the baseline, but that is not a large enough difference for us to conclude that we see a significantly increased retention in the Treatment group.

Further below, we'll dig into retention for certain subsets of newcomers using an augmented dataset and show that there are certain subpopulations where retention is increased.

Productivity
Whereas our 2020 experiment found a significant increase in newcomer productivity as measured by the number of edits they make in their first two weeks after registration, in our 2021 experiment we find no difference between the Treatment and Control groups. The geometric average number of Article and Article talk edits in the Control group is 1.69, whereas in the Treatment group it is 1.68. If we instead only count constructive Article and Article talk edits, the average in the Control group is 1.36, whereas in the Treatment group it is 1.37. While it's positive to see that users getting the Growth features appear to make constructive article edits on average, these differences are too small to conclude that the groups are different.

Revert rate
Similarly as we found in our 2020 experiment, we find no clear difference in the Treatment and Control groups when it comes to the revert rates of the edits they make in the first two weeks after registration.

Editing at registration
Early on in the life of the current iteration of the Growth team, we did a short analysis to understand what context newcomers were in when they registered their account (see this analytics update from 2018-10-18). We applied a similar approach to our 2021 dataset, and then examined if there were differences in activation for users who were editing at registration.

We find significant differences in constructive article activation between users who appeared to be editing at registration and those who appeared to be reading, as shown in the table below.

In the table above, we can see that a large proportion (26%) of users appear to be editing at the time of registration. We can also see the difference in baseline activation rate, where those who were already editing are much more likely to activate, as one would expect. For users who were editing, getting the Growth features is associated with a relatively small but significant decrease in constructive article activation (-NaN%). We can also see that the Growth features appear to have the opposite effect on the 74% of users who were not editing, where we see a relatively moderate and significant increase in activation (+8.1%).

Users who sign up to read Wikipedia
In our initial report on Welcome Survey responses in Czech and Korean Wikipedia back in December 2018, one of the findings we highlighted was that a lot of users responded to the question of “Why did you create your account today?” saying they signed up to read Wikipedia (29% in Korean, 18% in Czech).

During the current analysis, we decided to augment our dataset with responses from the Welcome Survey to see what we could learn from them. One pattern that stood out to us was that users who responded that they signed up to read also saw a large improvement in activation rate. For example, users who signed up on the desktop website and responded that they had not edited Wikipedia before went from 3.8% constructive article activation in the Control group to 6.4% in the Treatment group (a relative increase of 67%).

In our analysis, we found that users who responded to the survey saying they had edited Wikipedia before had a much higher probability of constructive article activation, which is not surprising. We therefore choose to use this as a splitting factor in our analysis. Secondly, we remove all users who responded that they signed up to read Wikipedia but were also editing at the time of registration as they are contradicting themselves.

We also analyzed constructive article retention for these users and found similar results as before. When we control for the amount of activity done in the first 24 hours, there is no significant difference between the Control and Treatment groups.

In summary, we find results indicating that the Growth Features greatly increase activation in newcomers who respond to the Welcome Survey saying they signed up to read Wikipedia, and that these users will be retained at a similar rate to users in the Control group who were as active as them during the first 24 hours after registration.

Retention for certain subsets
The section above that describes our overall results for retention found that there is no difference between the Treatment and Control groups when it comes to retention. This was found to be the case regardless of whether we control for the amount of activity the user has in their first 24 hours after registration.

When we add in responses to the Welcome Survey, we found two subsets of users where the Treatment group has significantly increased retention relative to the Control group. Both of these groups were desktop registrations, and they were not editing at registration. The difference between them was whether they responded that they had edited Wikipedia before.

For users who responded that they had edited Wikipedia before we found a significant increase in constructive article retention for the Treatment group. The baseline constructive article retention in the Control group for this subset of users is 16.7%. In the treatment group, retention is 18.9%, which is a relative increase of 12.6%. The Growth features also increase activation in this group, meaning that if we instead combine both measurements and calculate retained users based on all registrations we find a larger difference between the groups: a baseline overall retention of 5.8% in the Control group, and a retention of 6.9% in the Treatment group (a relative increase of 18.9%).

For the group of users who responded that they had not edited Wikipedia before, we found no difference between the groups when controlling for the amount of activity during the first 24 hours, but a significant increase in retention when this control is removed. This means that for this subset of users, the Growth features lead to an increase in retention by making users more likely to edit on their first day.

Methodology
The Growth Team initially deployed the newcomer tasks module to the Homepage on Czech, Korean, Vietnamese, and Arabic Wikipedias on November 21, 2019. In the time since then, the Growth features have been brought to many other Wikipedias. In this analysis, we use data from 27 different Wikipedia editions. On all of these, users were randomly assigned to either a treatment or control group. 80% of registrations were assigned to the treatment group, and these users received all Growth features (the Newcomer Homepage, Newcomer Tasks, the Help Panel, etc) by default. The remaining 20% of registrations are assigned to the control group, where users do not have these features accessible by default.

Users can turn the Growth features on and off in their user preferences at any point. If we find indications that they've done so, we exclude them from analysis. We also exclude known test accounts, users who registered through the API (these are mainly app registrations), bot accounts, and accounts that are autocreated.

The dataset for this analysis contains 244,060 accounts registered in February, March, and April 2021. During that period, the features were rolled out on several of these wikis. If that happened, we exclude the month of deployment and instead use the following full month of data. This means that wikis who got the features in April 2021 are not part of the dataset.

Our analysis makes extensive use of multilevel (hierarchical) regression models, using the wiki as the grouping variable. This allows us to account for differences between the wikis in our analysis. For example, our activation models are multilevel logistic regression models, which means that they account for the inherent differences in activation rate between the wikis. We also know that editing activity follows a long tail distribution, and therefore model number of edits made using a zero-inflated negative binomial distribution. This model is also multilevel to allow both zero-inflation and the negative binomial distribution to vary by wiki. Lastly, our revert rate analysis uses a zero-one-inflated beta distribution. This is because revert rates calculated across a time window tends to fall into one of three categories: 1) the user has all of their edits reverted (one-inflation), 2) the user has none of their edits reverted (zero-inflation), and 3) the user has some of their edits reverted (resulting in a beta distribution). We again use a multilevel model so that these are estimated per wiki.