Topic on User talk:Brooke Vibber/Compacting the revision table round 2

User entry table and revision table

4
Jtmorgan (talkcontribs)

Most of the time, when I query the revision table (or revision_userindex), I'm grouping by or joining on user_ids. Having to join to a additional table (user_entry) sounds like it will make this process more labor intensive. What do we gain by removing user_id from this table?

Jdforrester (WMF) (talkcontribs)
  • Different IPs become distinguishable without having to examine user_text in a special case.
  • Renaming accounts (currently a very DB-intensive action that can't happen for accounts with "too many" edits even for stewards) becomes a much simpler operation.
  • It makes possible any future change to how IPs are handled (e.g. offer them the ability to "upgrade" to an account and claim their last few edits)
This, that and the other (talkcontribs)
Different IPs become distinguishable without having to examine user_text in a special case.
This seems like a extremely minor benefit, if it is even one at all. In what part of the MW codebase do we need to distinguish different IPs without looking at their actual IP addresses? (I can think of counting the number of distinct IPs to have edited a page, but that information wouldn't be very useful, and I can't imagine why we would be doing that.)
Renaming accounts (currently a very DB-intensive action that can't happen for accounts with "too many" edits even for stewards) becomes a much simpler operation.
This could be achieved by simply making rev_user_text NULL when rev_user is nonzero, and requiring a join onto user to fetch user names.
It makes possible any future change to how IPs are handled (e.g. offer them the ability to "upgrade" to an account and claim their last few edits)
Surely this would already be possible if we wanted to do it.
Jynus (talkcontribs)

But worse case scenario (only if the join interferes with the indexes, which isn't necessarily true), it means 2 requests with negligibly roundtrip, both on much smaller tables. Even current search on large tables are very slow >0.01 seconds, just becase the buffer pool cannot be used efficiently. I expect that to be much better if the "pool" of ips is smaller (not 1 per revision). Plus the advantages are real, the conter-point is "it could be done in another way".

In any case, performance claims should be backed on actual measurements. There is no denial that the tables will be smaller that way, and that is *potentialy* a large win in maintenance an performance. It has to be measured quantitatively rather than qualitatively.

Reply to "User entry table and revision table"