Topic on Talk:Reading/Web/EventLogging best practices

Tbayer (WMF) (talkcontribs)

Marking this as a draft because some of these points need a bit of review before we can rely on this document for general usage.

E.g. I have just removed the advice to "Work on the assumption that at 2 million records a database table becomes unusable, so if you want 6 months work of data 5 events per second would get you there". Large tables can become a problem, but in that form the statement is plain wrong. Currently there exist about 140 EventLogging tables with more than 2 million rows (SELECT table_name, TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA ='log' ORDER BY TABLE_ROWS DESC;), and 16 with more than 100 million records. Many of these large tables are enjoying a happy, productive life. Of course 100 million may be too much to query at once, but as I understand it, all EL tables are indexed by timestamp, so that you can reduce computational effort by restricting queries to certain timespans.

Jdlrobson (talkcontribs)

This was advice from Nuria Ruiz on the mailing list. I'd suggest you discuss with her. IMO these large tables are generally unusable for analysis - but it depends on the type of analysis you are doing. If you want quick answers 2 million records is not going to get you there. @Milimetric (WMF) might have some thoughts...

Tbayer (WMF) (talkcontribs)

OK - I don't think we need to dig further into this right now, as the statement is patently untrue as written (see above). Maybe something was lost in translation, but it's also worth recalling that there were some erroneous assumptions floating around at the time which have since been corrected with the help of our database experts (see e.g. https://phabricator.wikimedia.org/T123595 ).

Of course it is always a good idea to limit data collection to the amount needed, but people have been doing successful analysis on much larger EL tables for years. To pick an arbitrary example, this resultwas based on a query of 4.7 million rows (and still lacked resolution for some "smaller" countries).

Tbayer (WMF) (talkcontribs)

PS: even the basic math in the given example was wrong (6 months * 30 days * 24 hours * 3600 seconds * 5 events gives almost 80 million events, not 2 million).

Milimetric (WMF) (talkcontribs)

Agreed that 2 million rows is totally fine. It's hard to come up with rules of thumb for "too large" but a good general rule is to not collect any more data than you need.

Reply to "Draft"