User talk:Brooke Vibber/Compacting the revision table round 2

About this board

sha_256 instead?

3 comments • 15:45, 8 December 2017 6 years ago

3

Jynus (talkcontribs)

In which cases are sha1 used now? If not common, maybe it could be partitioned vertically, also making easier to change the hashing method in the future? Not sure about this, just a suggestion as soon as I saw the sha1 still there.

Reply 21:25, 6 March 2017 7 years ago

Roan Kattouw (WMF) (talkcontribs)

AFAIK sha-1 is unused and was only added because it seemed like a good thing to have that we might use for something in the future. One use case that came up is revert detection, but I think we might be better off with an explicit field for that, which I'll start a separate topic for.

Reply 21:38, 8 March 2017 7 years ago

Graham87 (talkcontribs)

I use it for comparing the current Wikipedia database with old dumps to find missing edits. See my user subpage on the topic. This work would not have been possible without *some* sort of hash field ... the actual hashing method used would be immaterial, AFAIK.

Reply 15:45, 8 December 2017 6 years ago

Reply to "sha_256 instead?"

Revert detection / flagging and other edit review data

4 comments • 20:14, 10 May 2017 6 years ago

4

Roan Kattouw (WMF) (talkcontribs)

Some people (cc @Halfak (WMF)) have been talking for a while about storing various bits of data relevant to edit review and vandalism fighting. There's an ancient bug asking for the "patrolled" and "bot" flags to be stored in the revision table rather than the recentchanges table so that this data doesn't go away after 30 days.

In addition to the patrol flag (or instead of it, on wikis where it's not enabled for historical reasons), the other bit of information that's important for coordinating patrol/review work is whether a revision has been reverted already. We don't currently have any standardized way of even determining this (researchers tend to DIY their revert detection AFAIK), and we also don't have a place to store the result for others to reuse. Currently you can kind of do something with rev_sha1, but with that moving to the content table you wouldn't be able to get a hash for an entire revision easily. If this were built into MediaWiki, we might be able to take the user's action (e.g. using the rollback button or an undo link) into account as well.

So this would mean adding something like rev_reverted as a boolean or rev_reverted_by as an integer (pointing to the rev_id that reverts it), as well as rev_bot and rev_patrolled (both booleans). I'm sensitive to not wanting to make the revision table wider than it needs to be, so if people feel like this is too much, perhaps we could move these things (along with rev_minor_edit) out to a revision_flags table or something?

Reply 21:55, 8 March 2017 7 years ago

Brooke Vibber (talkcontribs)

I think I replied to this briefly in passing on a task, but my inclination is yes, a revision_flags or more specific revert_tracking table sounds most appropriate for this sort of thing, where we're sorta tracking meta-information about the actions.

Reply 20:00, 10 May 2017 6 years ago

Roan Kattouw (WMF) (talkcontribs)

Cool, thanks. A separate table keyed by rev_id would also not be blocked on this big schema change, which is another advantage. We could start building it quite soon.

Reply 20:09, 10 May 2017 6 years ago

Roan Kattouw (WMF) (talkcontribs)

@Daniel Kinzler (WMDE) points out that change tags could be used for revert tracking and possibly other things, but that we'd probably want to optimize the schema for tags first (which he said @Ladsgroup is looking into)

Reply 20:14, 10 May 2017 6 years ago

Reply to "Revert detection / flagging and other edit review data"

On hashes and comments

One comment • 08:41, 9 March 2017 7 years ago

1

Jynus (talkcontribs)

I have positioned against the use of hashes as identifiers (not in general). Hashes are not arbitrary ids- they are completely coupled to the content, and they have many disadvantages compared to autoincrements (when used as internal identifiers). I have commented extensively about that here: https://phabricator.wikimedia.org/T158724#3063877

That doesn't mean we shouldn't use hashes- we can have a comment table with (auto_increment id PK, hash, blob) and an index on the hash for easy location "fake hash index", then we do `SELECT * FROM comments WHERE hash = hash($text) and blob = $text` and we will have the same efficiency (it still uses the index on hash) while only comparing 1 full blob. Fast, and collision resistant.

BTW, we can create this table ASAP, and it would have a really good compression ratio for InnoDB.

Reply 08:41, 9 March 2017 7 years ago

Reply to "On hashes and comments"

RFC meeting log

One comment • 23:56, 8 March 2017 7 years ago

1

Summary by Jdforrester (WMF)

See Topic:Tl3xtaeq79lrbqvn

Daniel Kinzler (WMDE) (talkcontribs)

For the record, here's the log of today's RFC meeting on this topic: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-08-22.03.html

23:01, 8 March 2017 7 years ago

IRC meeting notes

2 comments • 23:03, 8 March 2017 7 years ago

2

Brooke Vibber (talkcontribs)

Great feedback everybody -- I'll make more updates and we'll circle back for another discussion in a week or two!

Meeting summary (full logs linked from there):

https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-02-15-22.01.html

Reply Edited 23:09, 15 February 2017 7 years ago

Brooke Vibber (talkcontribs)

March 8 followup meeting notes at https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-08-22.03.html

Reply 23:03, 8 March 2017 7 years ago

Reply to "IRC meeting notes"

Proposed comment table

7 comments • 22:32, 8 March 2017 7 years ago

7

MZMcBride (talkcontribs)

Is comment.comment_id intended to match to revision.rev_id? If so, can this be made explicit in the proposal? If not, how are comments intended to be matched to revisions?

It looks like you're currently proposing keeping comment.comment_text as varbinary(767). (I'm not sure Wikimedia wikis are using this larger value on bigger, older wikis.) Several people have requested a blob or a pointer to a blob instead so that they could write Git-like commit messages for edits. I guess we're not planning to support that? Related task: phabricator:T6715.

Would this comment table eventually take over for logging.log_comment?

Reply 02:23, 14 February 2017 7 years ago

Brooke Vibber (talkcontribs)

It's a separate id, allowing both for reuse of comment rows and use of the comment table by other things like uploads and logs as well, though I haven't rolled that into the update plans yet.

definitely should update to a blob so can enforce and change the size limit in php side.

Reply 18:05, 14 February 2017 7 years ago

Halfak (WMF) (talkcontribs)

When would we re-use comment rows? It seems that there's a 1:1 relationship between revisions and comments.

Reply 18:21, 7 March 2017 7 years ago

Anomie (talkcontribs)

Leaving the comment empty is common. On talk pages, variations on "reply" as the comment are pretty common too, so multiple replies in the same section may end up using the same comment. Bots and scripts often use the same comment for many similar edits.

Reply 14:54, 8 March 2017 7 years ago

Roan Kattouw (WMF) (talkcontribs)

But how would we detect that a comment was reused? Their IDs are auto-increment IDs, not hashes, so we couldn't efficiently find out if the comment that the user just submitted is already in the comment table somewhere. When multiple revisions / log entries / etc with the same comment are created at the same time, we would know, but when does that ever happen? Maybe for null revisions associated with log entries (like page moves and protections), I suppose?

Reply 21:42, 8 March 2017 7 years ago

Anomie (talkcontribs)

Presumably if we decided to reuse comments we'd add an appropriate index (possibly on a hash column) to the comment table so we could do a query for it.

Reply 22:32, 8 March 2017 7 years ago

Anomie (talkcontribs)

revision.rev_comment_id points to comment.comment_id.

Reply 14:23, 14 February 2017 7 years ago

Reply to "Proposed comment table"

Index usage

2 comments • 22:27, 8 March 2017 7 years ago

2

Summary by Jdforrester (WMF)

Discuss in Topic:Tkyw4mg4sak6wxo1 instead.

Roan Kattouw (WMF) (talkcontribs)

In your list of indexes, I saw some comments like "is this used directly?" and "this looks weird". I agree with you in that I can't imagine what uses those indexes, but I've also been surprisingly wrong about that before. Perhaps we could ask the DBAs to get us index usage statistics? @Jynus has gotten those for me before.

22:01, 8 March 2017 7 years ago

Roan Kattouw (WMF) (talkcontribs)

... and I see that there's a "Revision indexes" thread below where uses of those indexes have already been identified :)

Edited by Jdforrester (WMF) 22:27, 8 March 2017 7 years ago

user_entry and IPs and hex

4 comments • 19:25, 8 March 2017 7 years ago

4

Legoktm (talkcontribs)

In phab:T156318#3037271 there is a proposal to create a table that mostly adds a hex version of the IP addresses for easy searching of contributions in an IP range (what CheckUser already does). Is this something that should be considered in this revision schema re-do as it's basically an aggregate of the revision table?

CC @MusikAnimal

Reply 08:20, 1 March 2017 7 years ago

MusikAnimal (talkcontribs)

Yes, glad you brought this up :) With the revision re-do it would be wonderful if we could include a IP hex column or whatever you feel is best, such that we could easily query for IP ranges. That way when the time comes, we can adapt the Special:RangeContributions code to work off of it. In the interim we are hoping to move forward with our new table as planned, under the assumption the revision schema re-do is not going to happen anytime soon.

Reply 19:45, 1 March 2017 7 years ago

Legoktm (talkcontribs)

Oh yes, very much agreed on not waiting for this to implement the table for RangeContributions.

Reply 20:14, 1 March 2017 7 years ago

Brooke Vibber (talkcontribs)

If you need the index tied in with revision/timestamp order, then I think it'll work best as a separate table. That keeps the specialized indexes isolated, so they don't bloat the main revision table.

Reply 19:25, 8 March 2017 7 years ago

Reply to "user_entry and IPs and hex"

Bikeshedding the name "user_entry"

3 comments • 19:19, 8 March 2017 7 years ago

3

Halfak (WMF) (talkcontribs)

So, I'm taking minor issue with the name because it re-uses the word "user" which we've historically reserved for registered accounts. I'd love if we could just change the table name "user" to "account", but that would require a deeper refactoring.

So instead, I propose that we use the term "actor" instead of "user_entry". An "actor" is anyone who can perform some action. IP editors can perform edits. Registered editors can perform edits and other types of actions that appear in "logging". They both share the property that they are empowered to "act".

This is just one possible change, but generally, I'm strongly in favor of taking the "user" out of "user_entry" -- whatever the new table name is.

18:41, 7 March 2017 7 years ago

Jdforrester (WMF) (talkcontribs)

Seems fine. I don't really care. :-) ±1 from me.

19:30, 7 March 2017 7 years ago

Brooke Vibber (talkcontribs)

I like it. Done. :)

19:18, 8 March 2017 7 years ago

Event table?

2 comments • 15:13, 8 March 2017 7 years ago

2

Halfak (WMF) (talkcontribs)

So, I saw that Brian VIBBER had mentioned elsewhere the idea of having the logging table reference the newly proposed comments table. It seems that this might strain the metaphors here a little bit and duplicate some fields. I wanted to throw out an alternative idea.

Currently, I we'd have something like this:

revision
- rev_comment_id (comment.comment_id)
- rev_user_entry (user_entry.ue_id)
- rev_timestamp
- ...
logging
- log_comment_id (comment.comment_id)
- log_user_entry (user_entry.ue_id)
- rev_timestamp
- ...
user_entry
- ue_id
- ...
comment
- comment_id
- ...

Alternatively, we could follow the like-with-like policy and have an event table that has the common attributes across revision and logging that looks like this:

event
- event_id
- event_type ("log" or "rev" or "new" or whatever)
- event_comment (comment.comment_id)
- event_user_entry (user_entry.ue_id)
- event_timestamp
- ...
revision
- rev_event
- ...
logging
- log_event
- ...
comment
- comment_id
- ...
user_entry
- ue_id

Reply Edited 18:35, 7 March 2017 7 years ago

Anomie (talkcontribs)

On the down side, that event table would be taller than revision. It looks like you'd save one timestamp, one tinyint (xxx_deleted), one int (xxx_page), and two bigints (xxx_comment and xxx_user_entry) in exchange for xxx_event (int or bigint?) out of revision and logging, and a possibility to reuse the event row when an edit creates both a log entry and a dummy revision.

Glancing at indexes, we might have to lose the indexes on logging that combine log_type and user and/or timestamp, which would make it hard to have a query fetch logs of a particular log_type ordered by timestamp (i.e. Special:Log) without filesorting.

Reply 15:13, 8 March 2017 7 years ago

Reply to "Event table?"