Talk:Continuous integration/Codehealth Pipeline

Jump to navigation Jump to search

About this board

"Overriding methods should do more than simply call the same method in the super class" doesn't check visibility

1
Daimona Eaytoy (talkcontribs)

See here: the method is redeclared to widen visibility from private to public.

The description on sonarcloud is, as usually, very interesting: > "Overriding a method just to call the same method from the super class without performing any other actions is useless and misleading. The only time this is justified is in final overriding methods, where the effect is to lock in the parent class behavior."

Welp, sounds like they missed another compliant case.

Reply to ""Overriding methods should do more than simply call the same method in the super class" doesn't check visibility"

"Remove this redundant jump" giving bad advice

1
Daimona Eaytoy (talkcontribs)

See here. Simplified example:

foreach ( $foo as $bar ) {
   if ( foo1() ) {
      doSomething1();
   } elseif ( foo2() ) {
      continue; // <-- faulty line
   } else {
      doSomething2();
   }
}

The continue line is reported by SonarQube as being non-compliant with message "Remove this redundant jump.". On sonarcloud.io there's an interesting description for this issue: "jump statements that direct the control flow to the original direction are just a waste of keystrokes."

I have a very different idea of "waste of keystrokes". Jump statements make code like that more resilient and reliable. Without the continue, some day I might add some instructions after the conditional (but inside the loop) without any tool telling me of my mistake. So no, this is not a waste of keystrokes.

Reply to ""Remove this redundant jump" giving bad advice"

Code coverage for extensions

10
André Costa (WMSE) (talkcontribs)

Looks like all extensions are getting a 0.0% coverage rating. Since there are other jobs for checking coverage, could that report simply be suppressed, at least for extensions? (Fixing the underlying issue is also fine but feels like more work).

KHarlan (WMF) (talkcontribs)

@André Costa (WMSE) you have to sort from the other direction, see here. The extensions with 0 % coverage either 1) were part of an initial import and have not been added to the codehealth pipeline in CI (we had capacity issues so we didn't add all extensions) or 2) they don't have unit tests, only integration tests, and we only use unit tests for calculating the coverage.

André Costa (WMSE) (talkcontribs)

@KHarlan (WMF) Thanks, I missed that there were some extensions among the services.

I actually came to this issue from a patch which got a weird coverage report, but I guess the reason might then be that the master branch hasn't been re-analysed yet (since the initial import) meaning coverage comparisons fail in a weird way.

André Costa (WMSE) (talkcontribs)
KHarlan (WMF) (talkcontribs)

hi @André Costa (WMSE), looks like it is scenario 2 that I referenced earlier -- the tests in that extension are not extending MediaWikiUnitTestCase. (Those tests also need to be in `tests/phpunit/unit`.) See also task T230423

André Costa (WMSE) (talkcontribs)
KHarlan (WMF) (talkcontribs)

Yes, the documentation is in need of updating for sure. I added a small entry there but please feel free to expand or modify based on your experience. Thanks!

DannyS712 (talkcontribs)
KHarlan (WMF) (talkcontribs)
DannyS712 (talkcontribs)

Ah, I didn't realize that was missing - thanks

Reply to "Code coverage for extensions"

Functions should not contain too many return statements

3
KHarlan (WMF) (talkcontribs)

In task T256942, there is a request to deactivate the "Functions should not contain too many return statements" rule. I've deactivated it (for PHP analysis) while we discuss whether it's useful to re-enable.

Personally, I sometimes find it helpful as a reminder that a method is doing too many things; other times I ignore it as it doesn't make sense. I don't have an opinion on whether we should keep it enabled or disabled for the MediaWiki PHP profile.

DannyS712 (talkcontribs)

Its a personal coding style choice and often more returns can make something more readable, not less; eg for any possible error that is found, return early, rather than wrapping the rest of the code in a giant `if` block to ensure there isn't an error found. Fully support deactivating the inline comments from the bot, though if it wants to take the returns into account and note them in its general comment for the patch set that would be not as annoying imo

KHarlan (WMF) (talkcontribs)

> Fully support deactivating the inline comments from the bot, though if it wants to take the returns into account and note them in its general comment for the patch set that would be not as annoying imo

Yes, that's an option. When we were on Gerrit 2.15, I started prototyping that because robot comments were barely supported in Gerrit 2.15. Because you can't use HTML in your comments, you'd end up with the current general soanrqube bot comment followed by something like:

I guess this isn't so bad; the overall goal is for a human to at least scan the comment to see if it's something worth addressing which the previous implementation didn't do (I imagine a lot of people just ignored the "SonarQube build failed" message). Do others have feedback on this?

Reply to "Functions should not contain too many return statements"

Duplicate lines density was miscalculated

1
Huji (talkcontribs)
Reply to "Duplicate lines density was miscalculated"

Code coverage statistics are misleading

2
TJones (WMF) (talkcontribs)

Here's an example of a recent SonarQube report, which is shown on Gerrit as failing because only 67.9% of "new code" (not just from my patch) is covered by tests. However, after merging this patch, it estimates that about 74.1% of code will be covered. Digging deeper, I found this report, which shows 96.5% coverage on my new code. I feel like the 67.9% number is effectively making my patch responsible for the all the uncovered code in the repo. Is that what we want to do? I don't really feel qualified to write tests for some of the code outside the analysis/ directory I'm working in with this patch, so I could never get this to pass—and it might be a pain for my reviewers to review a bunch of completely unrelated tests at the same time, too.

I think the new code coverage metric (96.5%, in this case) should be the pass/fail metric, while the before and after repo-level coverage (67.9% → 74.1%, in this case) would be very informative, but not part of the pass fail (or maybe pass only if it increases). A link to the new code report would also do more in terms of encouraging me to improve the coverage of my new code, and the existing code that I'm working with and possibly more familiar with.

All that said, I love the SonarQube reports and I am very happy to have them available. I'd just like the snapshot shown on Gerrit to be more relevant and informative. Thanks!

Update: Ugh, I just realized that the 96.5% coverage report that I linked to is for that directory, which in this case happens to line up with my patch, but in general it is not that easy to get the info. Bummer. If coverage for the current patch is not available, showing the "67.9% → 74.1%" metric would still be useful and let us know we are moving in the right direction. (And if coverage for just the current patch is available, please tell me where to find it!)

GLederrey (WMF) (talkcontribs)

Having a good quality gate on code coverage is hard! At the moment new code is defined as code created over the last 30 days. We can also configure that as code since the previous commit, which would give feedback on just your work. The draw back is that there are a lot of cases where it make sense for an individual patch to have low coverage (some code makes less sense to test than other). Having a sliding window over 30 days is a compromise, trying to care about the overall quality of a project, but with a focus on new code vs old code that we're never going to touch (and so we should not invest in improving it).

I think we should be able to get the info about coverage of a specific patch, but I could not find it in the UI (at least not without reconfiguring the quality gates to only care about changes since the previous commit).

Reply to "Code coverage statistics are misleading"
Bawolff (talkcontribs)
KHarlan (WMF) (talkcontribs)

Well, that's a first. I'm not sure what would have caused this.

Should SonarQube Bot vote?

13
Summary by KHarlan (WMF)

We've changed the bot to use Verified +1 for quality gate pass, and 0 (comment only) for failure. Thanks everyone!

KHarlan (WMF) (talkcontribs)

And if so, how?

We recently transitioned the codehealth pipeline from reporting via a Jenkins pipeline to a bot application on Toolforge that comments (and votes) based on the SonarQube quality gate result.

In the #wikimedia-codehealth channel, we agreed that the bot could leave a +1 CR if the quality gate passed, and a 0 (comment only) if it failed.

Alternatives would be:

  • 0 (comment only) for pass or fail
  • 0 (comment only) for pass and -1 for fail
  • +1 verified for pass and 0 (comment only) for fail

As far as I can tell the "Verified" label is flexible to interpretation. The Gerrit docs say:

> The Verified label was originally invented by the Android Open Source Project to mean 'compiles, passes basic unit tests'. Some CI tools expect to use the Verified label to vote on a change after running.

And for +1 it says "Compiled (and ran tests) successfully." along with "Any +1 enables submit.". That seems a bit problematic but in theory if someone +2'ed something that was verified +1 by the bot but didn't pass the test pipeline, we'd still need the test pipeline to pass in gate-and-submit, so it would be safe enough.

KHarlan (WMF) (talkcontribs)

"Verified" seems a little problematic because the quality gate fails frequently enough (almost always due to insufficient code coverage in a patch) that the dashboard view of gerrit would show loads of red X indicators.

KHarlan (WMF) (talkcontribs)

It looks like "+1 Verified" in our instance means "Checked", so maybe we could use that and it would be meaningful enough.

GLederrey (WMF) (talkcontribs)

A few comments in no particular order:

Gerrit doc also specifies that "By default, a change is submittable when it gets at least one highest vote on each label and has no lowest vote (aka veto vote) on any label". We could create a new label for SonarQube to not get the signal mixed up between our regular builds and the SonarQube analysis. And we could encode higher level logic as custom submit rules.

In the end, we already allow a human to override the tools and tag a build as +2 verified. I don't think we should encourage that, but it is possible. And I think it is a good practice to make it clear that humans are smarter than tools.

In the short term, getting good feedback from SonarQube, in the form of textual comments, which are clear about why a quality gate is failing is much more important than having a vote. In the longer term, we should trust the SonarQube analysis enough to have an active vote both to promote or block a CR from being merged.

If we get a lot of failed quality gates (because of insufficient code coverage or because of other rules), we should review our rules and threshold. Or we should make it clear that we we believe that our current rules are the right one and make sure we improve our code to meet the standard. Having a standard of quality which breaks often means that the standard does not agree with the reality. We need to change either the standard, or the reality.

I think that at the moment, we don't have enough confidence in the Quality Gate result, and a less aggressive 0/+1, or 0/0 strategy make sense. In the long term, I think we should aim to have SonarQube vote -1/+1.

AKlapper (WMF) (talkcontribs)

Sonarqueuebot using CR+1 would make it impossible to find patches which have received a positive review by humans (and ping someone who dares to potentially +2). Code Review backlog. Hence I'd prefer using the Verified label in Gerrit. (I would not see a problem with CR−1 though.)

KHarlan (WMF) (talkcontribs)

I think you can do `-reviewedby:kharlan+sonarqubebot@wikimedia.org` in the search query to filter out those reviews. You can also do "Show comments only" in the UI to filter out messages from SonarQube Bot but that doesn't help with hiding the CR+1.

> Sonarqueuebot using CR+1 would make it impossible to find patches which have received a positive review by humans (and ping someone who dares to potentially +2)

My hope would be that the presence of the CR+1 might result in more patches merged. But maybe Verified can help with that too.

AKlapper (WMF) (talkcontribs)

Hmm, true that for Gerrit, and probably also for wikimedia.biterg.io if I find the name of the reviewedby key in the Gerrit scheme, but this would require communicating this knowledge: Using CR-1 would basically require documenting another bullet point in Community metrics#Behavior that might surprise you when it comes to wikimedia.biterg.io.

KHarlan (WMF) (talkcontribs)

If we get a lot of failed quality gates (because of insufficient code coverage or because of other rules), we should review our rules and threshold.

Extensions which have codehealth pipeline don't have the phpunit coverage pipeline job anymore, and that used to report build failed if the coverage failed to increase. It didn't vote, it just showed up as a build succeeded or failed, and said if coverage went up (not how much). In contrast SonarQube is set to check for 80% coverage on new code by default, so that's asking a lot more. Further it wants to see that coverage for JavaScript code whereas the previous job only looked at PHPUnit. And finally, our implementation only uses unit tests for calculating code coverage (see this thread). It doesn't seem possible to tell SonarQube to only assess PHP code for coverage metrics.

GLederrey (WMF) (talkcontribs)

Does this mean that we should lower the threshold from 80% to something more manageable? Note that "new code" in SonarQube means code over the last 30 days (configurable, it could be the previous version or a specific date), which means that the coverage does not need to increase systematically on each commit, but the overall trend should be over 80% coverage.

I remember that there was a way to configure quality gates to fail if coverage decreases (not using a specific threshold), but I can't find it in the interface. Maybe that was only with the hosted version? But that would have allowed something closer to what we had up to now.

Up to now, we did not fail on low javascript unit test coverage. Is it something we want to keep? Or does it make sense to also encourage / enforce unit testing of JS code? We can configure exclusions to coverage report, but I think it applies globally, not just to the quality gates.

KHarlan (WMF) (talkcontribs)
Note that "new code" in SonarQube means code over the last 30 days (configurable, it could be the previous version or a specific date),

I'm not so sure about that. Looking at https://sonarcloud.io/dashboard?id=mediawiki-extensions-GrowthExperiments&branch=558550-11 it seems to indicate that "new code" are the new lines of code added to these two files https://sonarcloud.io/component_measures?branch=558550-11&id=mediawiki-extensions-GrowthExperiments&metric=new_coverage&view=list in the patch.

Up to now, we did not fail on low javascript unit test coverage. Is it something we want to keep? Or does it make sense to also encourage / enforce unit testing of JS code?

I can't find a way to ignore JS coverage as part of the quality gate. If you find a way in the SonarQube UI please let me know.

It does make sense to encourage unit testing JS code, but so much of our code is written in a way that is hard to write node-qunit tests for. See also T212521 and the patch(es) there. Maybe that will change with the upcoming frontend architecture work.

MModell (WMF) (talkcontribs)

From google's code review guidelines:

"In general, reviewers should favor approving a CL once it is in a state where it definitely improves the overall code health of the system being worked on, even if the CL isn’t perfect."

I think that's generally a good standard for code review and I think it makes sense in this context as well. But generally, I don't think that tools should override the judgement of people. The status of code coverage is just one thing that can help decide if the code under review is improving the health of the system being worked on. I'd say ideally it should vote V+1 if coverage increases substantially, 0 if coverage remains nearly constant and _maybe_ V-1 if code coverage actually decreases substantially. It seems problematic that it considers 30 days of code instead of only the changes in the current gerrit review.

Tgr (WMF) (talkcontribs)

I'd avoid giving a -1 unless there is something definitely wrong with the patch. (I'm not a fan of -1 for style violations either, but that's another discussion.) For +1, V+1 does not make the code submittable (we currently use it for non-trusted tests, ie. automated tests on code written by someone not on the CI whitelist, in which case all tests which would execute that code are skipped and only linting and such is done), I would use that if anything (or a new label, if it's worth the effort) and reserve CR for human judgement. Also because currently we use CR for finding patches which are easy to merge, or filtering out already reviewed patches, and automated code quality tests are not useful for either purpose.

Jdforrester (WMF) (talkcontribs)

we currently use it for non-trusted tests, ie. automated tests on code written by someone not on the CI whitelist, in which case all tests which would execute that code are skipped and only linting and such is done

Not for several months. V+1 isn't used by CI's main tools now.

Javascript code coverage collection broke between 18th and 20th December 2018

5
Michael Große (WMDE) (talkcontribs)

There must have been some configuration change to the SonarCube configuration between 2019-12-18 and 2019-12-20 that causes our javascript code coverage collection to fail:


Patch Set 2 at 2019-12-18: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/558150#message-0d1eb9d153d132b84a5888510886209927bf23f5

Patch Set 3 at 2019-12-20: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/558150#message-aa2cef394e340d66bfc2dcd0a5a809cef82e4ce1


Also note how the estimated total drops from 3.7% to 0.0% :/


Any idea what is going on? @KHarlan (WMF)

Jdforrester (WMF) (talkcontribs)

Timeline fits with 19e934d1b3a9e5677fa75a7a9e1b426adb5a47c5 / f1c8efea9604e66881c9f0800a745aec63726efe.

KHarlan (WMF) (talkcontribs)

I'm not sure how those changes would have impacted this.

JavaScript coverage is supposed to generated by `npm run-script test:unit`, which Wikibase's package.json doesn't have, so I'm not sure how coverage was gathered before.

Michael Große (WMDE) (talkcontribs)

Oh, this is embarrassing. It seems we (WMDE) broke this ourselves in 91ba57 ? I really should have thought of checking this first :/

Thank you for pointing me to npm run-script test:unit!


I'm very sorry for all the noise 🙈


PS: I noticed that the code health pipeline is also run for WIP builds, but the result is not added as a comment in Gerrit. See https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/562833 and https://integration.wikimedia.org/ci/job/mwext-codehealth-patch/10147/

Is that intentional?

KHarlan (WMF) (talkcontribs)

No worries!

There are some quirks with how the bot posts to gerrit reviews (sometimes we get a 403 -- maybe that's related to the WIP status?), I will file an issue and look into it more.

Conditional branch coverage indication

2
EGardner (WMF) (talkcontribs)

Having visual indicators for which conditional branches within a method are covered by tests (green) and which are not (red) is really helpful. I find that this encourages thoroughness and helps in remembering all the different situations that need to be addressed in a test suite.

You can see an example of such a report here: https://sonarcloud.io/component_measures?branch=515002&id=mediawiki-extensions-WikibaseMediaInfo&metric=new_coverage&selected=mediawiki-extensions-WikibaseMediaInfo%3Aresources%2Fstatements%2FNewQualifierWidget.js&view=list

KHarlan (WMF) (talkcontribs)