Topic on Talk:Continuous integration/Codehealth Pipeline

Should SonarQube Bot vote?

13
Summary by KHarlan (WMF)

We've changed the bot to use Verified +1 for quality gate pass, and 0 (comment only) for failure. Thanks everyone!

KHarlan (WMF) (talkcontribs)

And if so, how?

We recently transitioned the codehealth pipeline from reporting via a Jenkins pipeline to a bot application on Toolforge that comments (and votes) based on the SonarQube quality gate result.

In the #wikimedia-codehealth channel, we agreed that the bot could leave a +1 CR if the quality gate passed, and a 0 (comment only) if it failed.

Alternatives would be:

  • 0 (comment only) for pass or fail
  • 0 (comment only) for pass and -1 for fail
  • +1 verified for pass and 0 (comment only) for fail

As far as I can tell the "Verified" label is flexible to interpretation. The Gerrit docs say:

> The Verified label was originally invented by the Android Open Source Project to mean 'compiles, passes basic unit tests'. Some CI tools expect to use the Verified label to vote on a change after running.

And for +1 it says "Compiled (and ran tests) successfully." along with "Any +1 enables submit.". That seems a bit problematic but in theory if someone +2'ed something that was verified +1 by the bot but didn't pass the test pipeline, we'd still need the test pipeline to pass in gate-and-submit, so it would be safe enough.

KHarlan (WMF) (talkcontribs)

"Verified" seems a little problematic because the quality gate fails frequently enough (almost always due to insufficient code coverage in a patch) that the dashboard view of gerrit would show loads of red X indicators.

KHarlan (WMF) (talkcontribs)

It looks like "+1 Verified" in our instance means "Checked", so maybe we could use that and it would be meaningful enough.

GLederrey (WMF) (talkcontribs)

A few comments in no particular order:

Gerrit doc also specifies that "By default, a change is submittable when it gets at least one highest vote on each label and has no lowest vote (aka veto vote) on any label". We could create a new label for SonarQube to not get the signal mixed up between our regular builds and the SonarQube analysis. And we could encode higher level logic as custom submit rules.

In the end, we already allow a human to override the tools and tag a build as +2 verified. I don't think we should encourage that, but it is possible. And I think it is a good practice to make it clear that humans are smarter than tools.

In the short term, getting good feedback from SonarQube, in the form of textual comments, which are clear about why a quality gate is failing is much more important than having a vote. In the longer term, we should trust the SonarQube analysis enough to have an active vote both to promote or block a CR from being merged.

If we get a lot of failed quality gates (because of insufficient code coverage or because of other rules), we should review our rules and threshold. Or we should make it clear that we we believe that our current rules are the right one and make sure we improve our code to meet the standard. Having a standard of quality which breaks often means that the standard does not agree with the reality. We need to change either the standard, or the reality.

I think that at the moment, we don't have enough confidence in the Quality Gate result, and a less aggressive 0/+1, or 0/0 strategy make sense. In the long term, I think we should aim to have SonarQube vote -1/+1.

AKlapper (WMF) (talkcontribs)

Sonarqueuebot using CR+1 would make it impossible to find patches which have received a positive review by humans (and ping someone who dares to potentially +2). Code Review backlog. Hence I'd prefer using the Verified label in Gerrit. (I would not see a problem with CR−1 though.)

KHarlan (WMF) (talkcontribs)

I think you can do `-reviewedby:kharlan+sonarqubebot@wikimedia.org` in the search query to filter out those reviews. You can also do "Show comments only" in the UI to filter out messages from SonarQube Bot but that doesn't help with hiding the CR+1.

> Sonarqueuebot using CR+1 would make it impossible to find patches which have received a positive review by humans (and ping someone who dares to potentially +2)

My hope would be that the presence of the CR+1 might result in more patches merged. But maybe Verified can help with that too.

AKlapper (WMF) (talkcontribs)

Hmm, true that for Gerrit, and probably also for wikimedia.biterg.io if I find the name of the reviewedby key in the Gerrit scheme, but this would require communicating this knowledge: Using CR-1 would basically require documenting another bullet point in Community metrics#Behavior that might surprise you when it comes to wikimedia.biterg.io.

KHarlan (WMF) (talkcontribs)

If we get a lot of failed quality gates (because of insufficient code coverage or because of other rules), we should review our rules and threshold.

Extensions which have codehealth pipeline don't have the phpunit coverage pipeline job anymore, and that used to report build failed if the coverage failed to increase. It didn't vote, it just showed up as a build succeeded or failed, and said if coverage went up (not how much). In contrast SonarQube is set to check for 80% coverage on new code by default, so that's asking a lot more. Further it wants to see that coverage for JavaScript code whereas the previous job only looked at PHPUnit. And finally, our implementation only uses unit tests for calculating code coverage (see this thread). It doesn't seem possible to tell SonarQube to only assess PHP code for coverage metrics.

GLederrey (WMF) (talkcontribs)

Does this mean that we should lower the threshold from 80% to something more manageable? Note that "new code" in SonarQube means code over the last 30 days (configurable, it could be the previous version or a specific date), which means that the coverage does not need to increase systematically on each commit, but the overall trend should be over 80% coverage.

I remember that there was a way to configure quality gates to fail if coverage decreases (not using a specific threshold), but I can't find it in the interface. Maybe that was only with the hosted version? But that would have allowed something closer to what we had up to now.

Up to now, we did not fail on low javascript unit test coverage. Is it something we want to keep? Or does it make sense to also encourage / enforce unit testing of JS code? We can configure exclusions to coverage report, but I think it applies globally, not just to the quality gates.

KHarlan (WMF) (talkcontribs)
Note that "new code" in SonarQube means code over the last 30 days (configurable, it could be the previous version or a specific date),

I'm not so sure about that. Looking at https://sonarcloud.io/dashboard?id=mediawiki-extensions-GrowthExperiments&branch=558550-11 it seems to indicate that "new code" are the new lines of code added to these two files https://sonarcloud.io/component_measures?branch=558550-11&id=mediawiki-extensions-GrowthExperiments&metric=new_coverage&view=list in the patch.

Up to now, we did not fail on low javascript unit test coverage. Is it something we want to keep? Or does it make sense to also encourage / enforce unit testing of JS code?

I can't find a way to ignore JS coverage as part of the quality gate. If you find a way in the SonarQube UI please let me know.

It does make sense to encourage unit testing JS code, but so much of our code is written in a way that is hard to write node-qunit tests for. See also T212521 and the patch(es) there. Maybe that will change with the upcoming frontend architecture work.

MModell (WMF) (talkcontribs)

From google's code review guidelines:

"In general, reviewers should favor approving a CL once it is in a state where it definitely improves the overall code health of the system being worked on, even if the CL isn’t perfect."

I think that's generally a good standard for code review and I think it makes sense in this context as well. But generally, I don't think that tools should override the judgement of people. The status of code coverage is just one thing that can help decide if the code under review is improving the health of the system being worked on. I'd say ideally it should vote V+1 if coverage increases substantially, 0 if coverage remains nearly constant and _maybe_ V-1 if code coverage actually decreases substantially. It seems problematic that it considers 30 days of code instead of only the changes in the current gerrit review.

Tgr (WMF) (talkcontribs)

I'd avoid giving a -1 unless there is something definitely wrong with the patch. (I'm not a fan of -1 for style violations either, but that's another discussion.) For +1, V+1 does not make the code submittable (we currently use it for non-trusted tests, ie. automated tests on code written by someone not on the CI whitelist, in which case all tests which would execute that code are skipped and only linting and such is done), I would use that if anything (or a new label, if it's worth the effort) and reserve CR for human judgement. Also because currently we use CR for finding patches which are easy to merge, or filtering out already reviewed patches, and automated code quality tests are not useful for either purpose.

Jdforrester (WMF) (talkcontribs)

we currently use it for non-trusted tests, ie. automated tests on code written by someone not on the CI whitelist, in which case all tests which would execute that code are skipped and only linting and such is done

Not for several months. V+1 isn't used by CI's main tools now.