User talk:Brooke Vibber/Compacting the revision table round 2

About this board

Jynus (talkcontribs)

In which cases are sha1 used now? If not common, maybe it could be partitioned vertically, also making easier to change the hashing method in the future? Not sure about this, just a suggestion as soon as I saw the sha1 still there.

Roan Kattouw (WMF) (talkcontribs)

AFAIK sha-1 is unused and was only added because it seemed like a good thing to have that we might use for something in the future. One use case that came up is revert detection, but I think we might be better off with an explicit field for that, which I'll start a separate topic for.

Graham87 (talkcontribs)

I use it for comparing the current Wikipedia database with old dumps to find missing edits. See my user subpage on the topic. This work would not have been possible without *some* sort of hash field ... the actual hashing method used would be immaterial, AFAIK.

Reply to "sha_256 instead?"

Revert detection / flagging and other edit review data

4
Roan Kattouw (WMF) (talkcontribs)

Some people (cc @Halfak (WMF)) have been talking for a while about storing various bits of data relevant to edit review and vandalism fighting. There's an ancient bug asking for the "patrolled" and "bot" flags to be stored in the revision table rather than the recentchanges table so that this data doesn't go away after 30 days.

In addition to the patrol flag (or instead of it, on wikis where it's not enabled for historical reasons), the other bit of information that's important for coordinating patrol/review work is whether a revision has been reverted already. We don't currently have any standardized way of even determining this (researchers tend to DIY their revert detection AFAIK), and we also don't have a place to store the result for others to reuse. Currently you can kind of do something with rev_sha1, but with that moving to the content table you wouldn't be able to get a hash for an entire revision easily. If this were built into MediaWiki, we might be able to take the user's action (e.g. using the rollback button or an undo link) into account as well.

So this would mean adding something like rev_reverted as a boolean or rev_reverted_by as an integer (pointing to the rev_id that reverts it), as well as rev_bot and rev_patrolled (both booleans). I'm sensitive to not wanting to make the revision table wider than it needs to be, so if people feel like this is too much, perhaps we could move these things (along with rev_minor_edit) out to a revision_flags table or something?

Brooke Vibber (talkcontribs)

I think I replied to this briefly in passing on a task, but my inclination is yes, a revision_flags or more specific revert_tracking table sounds most appropriate for this sort of thing, where we're sorta tracking meta-information about the actions.

Roan Kattouw (WMF) (talkcontribs)

Cool, thanks. A separate table keyed by rev_id would also not be blocked on this big schema change, which is another advantage. We could start building it quite soon.

Roan Kattouw (WMF) (talkcontribs)

@Daniel Kinzler (WMDE) points out that change tags could be used for revert tracking and possibly other things, but that we'd probably want to optimize the schema for tags first (which he said @Ladsgroup is looking into)

Reply to "Revert detection / flagging and other edit review data"

On hashes and comments

1
Jynus (talkcontribs)

I have positioned against the use of hashes as identifiers (not in general). Hashes are not arbitrary ids- they are completely coupled to the content, and they have many disadvantages compared to autoincrements (when used as internal identifiers). I have commented extensively about that here: https://phabricator.wikimedia.org/T158724#3063877

That doesn't mean we shouldn't use hashes- we can have a comment table with (auto_increment id PK, hash, blob) and an index on the hash for easy location "fake hash index", then we do `SELECT * FROM comments WHERE hash = hash($text) and blob = $text` and we will have the same efficiency (it still uses the index on hash) while only comparing 1 full blob. Fast, and collision resistant.

BTW, we can create this table ASAP, and it would have a really good compression ratio for InnoDB.

Reply to "On hashes and comments"
Summary by Jdforrester (WMF)
Daniel Kinzler (WMDE) (talkcontribs)
Brooke Vibber (talkcontribs)
Brooke Vibber (talkcontribs)
Reply to "IRC meeting notes"

Proposed comment table

7
MZMcBride (talkcontribs)

Is comment.comment_id intended to match to revision.rev_id? If so, can this be made explicit in the proposal? If not, how are comments intended to be matched to revisions?

It looks like you're currently proposing keeping comment.comment_text as varbinary(767). (I'm not sure Wikimedia wikis are using this larger value on bigger, older wikis.) Several people have requested a blob or a pointer to a blob instead so that they could write Git-like commit messages for edits. I guess we're not planning to support that? Related task: phabricator:T6715.

Would this comment table eventually take over for logging.log_comment?

Brooke Vibber (talkcontribs)

It's a separate id, allowing both for reuse of comment rows and use of the comment table by other things like uploads and logs as well, though I haven't rolled that into the update plans yet.

definitely should update to a blob so can enforce and change the size limit in php side.

Halfak (WMF) (talkcontribs)

When would we re-use comment rows? It seems that there's a 1:1 relationship between revisions and comments.

Anomie (talkcontribs)

Leaving the comment empty is common. On talk pages, variations on "reply" as the comment are pretty common too, so multiple replies in the same section may end up using the same comment. Bots and scripts often use the same comment for many similar edits.

Roan Kattouw (WMF) (talkcontribs)

But how would we detect that a comment was reused? Their IDs are auto-increment IDs, not hashes, so we couldn't efficiently find out if the comment that the user just submitted is already in the comment table somewhere. When multiple revisions / log entries / etc with the same comment are created at the same time, we would know, but when does that ever happen? Maybe for null revisions associated with log entries (like page moves and protections), I suppose?

Anomie (talkcontribs)

Presumably if we decided to reuse comments we'd add an appropriate index (possibly on a hash column) to the comment table so we could do a query for it.

Anomie (talkcontribs)

revision.rev_comment_id points to comment.comment_id.

Reply to "Proposed comment table"
Summary by Jdforrester (WMF)

Discuss in Topic:Tkyw4mg4sak6wxo1 instead.

Roan Kattouw (WMF) (talkcontribs)

In your list of indexes, I saw some comments like "is this used directly?" and "this looks weird". I agree with you in that I can't imagine what uses those indexes, but I've also been surprisingly wrong about that before. Perhaps we could ask the DBAs to get us index usage statistics? @Jynus has gotten those for me before.

Roan Kattouw (WMF) (talkcontribs)

... and I see that there's a "Revision indexes" thread below where uses of those indexes have already been identified :)

user_entry and IPs and hex

4
Legoktm (talkcontribs)

In phab:T156318#3037271 there is a proposal to create a table that mostly adds a hex version of the IP addresses for easy searching of contributions in an IP range (what CheckUser already does). Is this something that should be considered in this revision schema re-do as it's basically an aggregate of the revision table?

CC @MusikAnimal

MusikAnimal (talkcontribs)

Yes, glad you brought this up :) With the revision re-do it would be wonderful if we could include a IP hex column or whatever you feel is best, such that we could easily query for IP ranges. That way when the time comes, we can adapt the Special:RangeContributions code to work off of it. In the interim we are hoping to move forward with our new table as planned, under the assumption the revision schema re-do is not going to happen anytime soon.

Legoktm (talkcontribs)

Oh yes, very much agreed on not waiting for this to implement the table for RangeContributions.

Brooke Vibber (talkcontribs)

If you need the index tied in with revision/timestamp order, then I think it'll work best as a separate table. That keeps the specialized indexes isolated, so they don't bloat the main revision table.

Reply to "user_entry and IPs and hex"

Bikeshedding the name "user_entry"

3
Halfak (WMF) (talkcontribs)

So, I'm taking minor issue with the name because it re-uses the word "user" which we've historically reserved for registered accounts. I'd love if we could just change the table name "user" to "account", but that would require a deeper refactoring.

So instead, I propose that we use the term "actor" instead of "user_entry". An "actor" is anyone who can perform some action. IP editors can perform edits. Registered editors can perform edits and other types of actions that appear in "logging". They both share the property that they are empowered to "act".

This is just one possible change, but generally, I'm strongly in favor of taking the "user" out of "user_entry" -- whatever the new table name is.

Jdforrester (WMF) (talkcontribs)

Seems fine. I don't really care. :-) ±1 from me.

Brooke Vibber (talkcontribs)

I like it. Done. :)

Halfak (WMF) (talkcontribs)

So, I saw that Brian VIBBER had mentioned elsewhere the idea of having the logging table reference the newly proposed comments table. It seems that this might strain the metaphors here a little bit and duplicate some fields. I wanted to throw out an alternative idea.

Currently, I we'd have something like this:

  • revision
    • rev_comment_id (comment.comment_id)
    • rev_user_entry (user_entry.ue_id)
    • rev_timestamp
    • ...
  • logging
    • log_comment_id (comment.comment_id)
    • log_user_entry (user_entry.ue_id)
    • rev_timestamp
    • ...
  • user_entry
    • ue_id
    • ...
  • comment
    • comment_id
    • ...

Alternatively, we could follow the like-with-like policy and have an event table that has the common attributes across revision and logging that looks like this:

  • event
    • event_id
    • event_type ("log" or "rev" or "new" or whatever)
    • event_comment (comment.comment_id)
    • event_user_entry (user_entry.ue_id)
    • event_timestamp
    • ...
  • revision
    • rev_event
    • ...
  • logging
    • log_event
    • ...
  • comment
    • comment_id
    • ...
  • user_entry
    • ue_id
Anomie (talkcontribs)

On the down side, that event table would be taller than revision. It looks like you'd save one timestamp, one tinyint (xxx_deleted), one int (xxx_page), and two bigints (xxx_comment and xxx_user_entry) in exchange for xxx_event (int or bigint?) out of revision and logging, and a possibility to reuse the event row when an edit creates both a log entry and a dummy revision.

Glancing at indexes, we might have to lose the indexes on logging that combine log_type and user and/or timestamp, which would make it hard to have a query fetch logs of a particular log_type ordered by timestamp (i.e. Special:Log) without filesorting.

Reply to "Event table?"