Architecture meetings/RFC review 2013-11-20

Wednesday, November 20, 2013 at 10:00 PM UTC at

Requests for Comment to review

 * Requests for comment/Page deletion
 * Requests for comment/Configuration database 2
 * Requests for comment/Text extraction

Summary and logs
The content below is an export of the original MeetBot notes using Pandoc.

Meeting summary
  Details of this RFC review meeting are available at https://www.mediawiki.org/wiki/Architecture_meetings/RFC_review_2013-11-20 (qgil, 22:03:40)

 Page deletion (qgil, 22:05:30)  https://www.mediawiki.org/wiki/Requests_for_comment/Page_deletion (qgil, 22:05:43) : https://bugzilla.wikimedia.org/show_bug.cgi?id=55398 proposed new field; AaronSchulz suggests we go with new table (Leucosticte, 22:06:55) &lt;TimStarling&gt; maybe it should present a single option (a table), and maybe give a few more details about how that will work. Then we'll accept it (qgil, 22:22:57) ACTION: Leucosticte to expand on &quot;new table&quot; design, optionally start on prototype (TimStarling, 22:23:31)</li></ol>

</li> Configuration database 2 (qgil, 22:26:44)  https://www.mediawiki.org/wiki/Requests_for_comment/Configuration_database_2 (qgil, 22:26:53)</li> &lt;TimStarling&gt; neither JSON nor CDB are really appropriate backends for a web interface (qgil, 22:34:11)</li> https://www.mediawiki.org/wiki/Thread:Talk:Requests_for_comment/Configuration_database/Storing_setting_on_wiki_pages (^d, 22:38:32)</li> https://www.mediawiki.org/wiki/Thread:Talk:Requests_for_comment/Configuration_database/Storing_setting_on_wiki_pages (qgil, 22:38:51)</li> &lt;TimStarling&gt; I think we need to talk about requirements (qgil, 22:49:18)</li> ACTION: legoktm and other interested devs to develop requirements list on wiki (TimStarling, 22:49:20)</li></ol>

</li> Text extraction (qgil, 22:49:48)  https://www.mediawiki.org/wiki/Requests_for_comment/Text_extraction (qgil, 22:50:00)</li> &lt;MaxSem&gt; this RFC is asking for decisions on 2 things: core vs. extension and whether to store the extracts in page_props, unconditionally (qgil, 22:50:23)</li> extracts are definitely too big for the core DB, need some other storage backend (TimStarling, 22:57:52)</li> ACTION: MaxSem and other interested devs to discuss storage backend options on RFC (TimStarling, 23:00:17)</li> ACTION: ^d to comment on RFC sharing experience with similar problem in ElasticSearch (TimStarling, 23:00:52)</li></ol> </li></ol>

Meeting ended at 23:04:06 UTC (full logs).

Action items

 * 1) Leucosticte to expand on &quot;new table&quot; design, optionally start on prototype
 * 2) legoktm and other interested devs to develop requirements list on wiki
 * 3) MaxSem and other interested devs to discuss storage backend options on RFC
 * 4) ^d to comment on RFC sharing experience with similar problem in ElasticSearch

Action items, by person

 * ^d
 * 1) ^d to comment on RFC sharing experience with similar problem in ElasticSearch
 * 2) legoktm
 * 3) legoktm and other interested devs to develop requirements list on wiki
 * 4) Leucosticte
 * 5) Leucosticte to expand on &quot;new table&quot; design, optionally start on prototype
 * 6) MaxSem
 * 7) MaxSem and other interested devs to discuss storage backend options on RFC

People present (lines said)

 * 1) TimStarling (71)
 * 2) ^d (35)
 * 3) AaronSchulz (34)
 * 4) MaxSem (27)
 * 5) legoktm (22)
 * 6) qgil (20)
 * 7) aude (14)
 * 8) Leucosticte (14)
 * 9) ori-l (10)
 * 10) vvv (5)
 * 11) bd808 (5)
 * 12) Krenair (4)
 * 13) robla (3)
 * 14) meetbot-wm` (2)
 * 15) drdee (1)

Full log
22:03:23 &lt;qgil&gt; #startmeeting 22:03:23 &lt;meetbot-wm`&gt; Meeting started Wed Nov 20 22:03:23 2013 UTC. The chair is qgil. Information about MeetBot at https://bugzilla.wikimedia.org/46377. 22:03:23 &lt;meetbot-wm`&gt; Useful Commands: #action #agreed #help #info #idea #link #topic. 22:03:31 &lt;TimStarling&gt; who added page deletion to the list at https://www.mediawiki.org/wiki/Architecture_meetings/RFC_review_2013-11-20 ? 22:03:40 &lt;qgil&gt; #info Details of this RFC review meeting are available at https://www.mediawiki.org/wiki/Architecture_meetings/RFC_review_2013-11-20 22:03:41 &lt;Leucosticte&gt; I did 22:03:59 &lt;qgil&gt; TimStarling, which should be the first topic? 22:04:17 &lt;TimStarling&gt; we also have Configuration database 2, and we have legoktm online for that one 22:04:26 &lt;legoktm&gt; :D 22:04:31 &lt;TimStarling&gt; and text extraction, and we have maxSem online for that? 22:04:36 &lt;MaxSem&gt; yup 22:05:18 &lt;TimStarling&gt; ok, let's start with page deletion 22:05:30 &lt;qgil&gt; #topic Page deletion 22:05:35 &lt;TimStarling&gt; Leucosticte: you have made some edits to this RFC recently? 22:05:43 &lt;qgil&gt; #link https://www.mediawiki.org/wiki/Requests_for_comment/Page_deletion 22:05:50 &lt;Leucosticte&gt; yes, the two options being considered were new field and new table 22:06:23 &lt;qgil&gt; Everybody: please rememberto add &quot;#info&quot;at the beginning of sentences you want to be written in the meeting minutes. Thank you! 22:06:55 &lt;Leucosticte&gt; #info: https://bugzilla.wikimedia.org/show_bug.cgi?id=55398 proposed new field; AaronSchulz suggests we go with new table 22:07:25 &lt;^d&gt; I like this rfc and all I've read is the intro. 22:07:43 &lt;Leucosticte&gt; oh, I guess only use #info on major items, right? 22:07:46 &lt;TimStarling&gt; so you have just fleshed out these two options a bit? 22:08:01 &lt;Leucosticte&gt; pretty much. Platonides was the one who originally put the two options forward 22:08:13 * AaronSchulz actually coded the table option ages ago in branch 22:08:24 &lt;qgil&gt; Leucosticte, you're doing fine :) 22:08:29 &lt;^d&gt; Question: has any thought been given to oversight migration? 22:08:30 &lt;AaronSchulz&gt; not that any of that would be usable now ;) 22:08:38 &lt;MaxSem&gt; It took us quite a time to get rev_deleted right, and page_delted will be much, much harder due to widespread use 22:08:47 &lt;^d&gt; We've still not migrated oversight data to rev_del, and whatever new idea should allow oversight to be migrated too. 22:08:56 &lt;Krenair&gt; I have a patch in Gerrit for that. 22:08:57 &lt;legoktm&gt; ^d: I think Krenair was working on a script to convert oversight data to revdel 22:09:17 &lt;Krenair&gt; It got too complex and I haven't bothered working on it for months. 22:09:19 &lt;TimStarling&gt; Leucosticte: are you offering to write the code for this, or are you just interested in the design? 22:09:59 &lt;TimStarling&gt; AaronSchulz: a branch in git or subversion? 22:10:06 &lt;AaronSchulz&gt; svn 22:10:10 &lt;Leucosticte&gt; TimStarling: offering to write code, but it looks like Aaron already started on the code, if we're going with the table option. 22:10:16 &lt;AaronSchulz&gt; this was the old days when brion had to review everything 22:10:29 &lt;TimStarling&gt; that's going back a way :) 22:10:47 &lt;AaronSchulz&gt; like I said, it would be from scratch to do that now 22:10:52 &lt;TimStarling&gt; Leucosticte may just want to look at your diff for ideas 22:11:44 * aude waves 22:12:15 * Krenair waves back 22:12:29 &lt;drdee&gt; is this an RFC that should be discussed at the Architecture Meeting in January? 22:12:39 &lt;TimStarling&gt; so everyone's preferred option is adding an archived_page table? 22:12:48 &lt;^d&gt; I'm on the fence. 22:13:15 &lt;MaxSem&gt; yep 22:13:21 &lt;TimStarling&gt; if revisions are left in the revision table, that makes deletion and undeletion fairly efficient 22:13:39 &lt;MaxSem&gt; which is a separate whee 22:13:42 &lt;TimStarling&gt; but I guess that is a property of both proposals 22:14:08 &lt;Leucosticte&gt; TimStarling, I preferred field since it would be nice to keep page IDs the same across deletions and recreations, and because deletion just is really just hiding pages from certain viewers, so the page may as well stay in the table -- that's my thinking, but the new table would be more secure by default; Aaron's right about that 22:14:09 &lt;AaronSchulz&gt; right 22:14:25 &lt;AaronSchulz&gt; you can still preserve page IDs either way 22:14:45 &lt;TimStarling&gt; same way we preserve revision IDs currently 22:14:49 &lt;MaxSem&gt; also, adding page_deleted to many queries would require a lot of index tuning and potentially more indexes 22:15:26 &lt;TimStarling&gt; how would Special:Contributions work? 22:16:06 &lt;TimStarling&gt; I guess there is a join on page already 22:16:26 &lt;Leucosticte&gt; oh, well if the page IDs stay the same then new field doesn't have a lot of advantages listed, other than &quot;All page_ids live in one table.&quot; Were there other advantages? 22:16:30 &lt;TimStarling&gt; would we be looking at merging the code of Special:Contributions and Special:DeletedContributions? 22:17:06 &lt;AaronSchulz&gt; they could share more code at least 22:17:17 * ^d wonders what the schema would look like if a 1.5-style complete refactor happened...throw out all assumptions and do things the Right Way. 22:17:39 &lt;vvv&gt; And how many extensions will break as a result?.. 22:18:10 &lt;MaxSem&gt; ^d, and a migration couple orders of magnitude harder... 22:18:11 &lt;MaxSem&gt; :P 22:18:42 &lt;^d&gt; Hey I was just wondering, not advocating :p 22:18:49 &lt;TimStarling&gt; would there be multiple archived_page rows per title? 22:18:58 &lt;TimStarling&gt; oh, sorry 22:19:01 &lt;TimStarling&gt; that is in the RFC document 22:19:29 &lt;TimStarling&gt; ^d: it would probably be a field 22:20:16 &lt;TimStarling&gt; but a table has some advantages for b/c 22:20:34 &lt;TimStarling&gt; anyway, should we accept this or ask for more detail? 22:21:28 &lt;TimStarling&gt; maybe it should present a single option (a table), and maybe give a few more details about how that will work 22:21:34 &lt;TimStarling&gt; then we'll accept it 22:21:36 &lt;TimStarling&gt; sound good? 22:21:42 &lt;^d&gt; Yeah I can agree to that. 22:22:01 &lt;^d&gt; I'm not unconvinced on the new table, I think with some more arguments you'll have me sold. 22:22:57 &lt;qgil&gt; #info &lt;TimStarling&gt; maybe it should present a single option (a table), and maybe give a few more details about how that will work. Then we'll accept it 22:23:31 &lt;TimStarling&gt; #action Leucosticte to expand on &quot;new table&quot; design, optionally start on prototype 22:23:55 &lt;Leucosticte&gt; so you need info on how Special:Contributions will work, or did you figure that out? 22:24:05 * AaronSchulz can only find the patch on https://bugzilla.wikimedia.org/show_bug.cgi?id=11402 22:25:18 &lt;qgil&gt; TimStarling, let us know when you wnt to move to a next topic, and which one it will be 22:25:36 &lt;TimStarling&gt; Leucosticte: maybe just give some indicative SQL showing the kind of query Special:Contributions and Special:DeletedContributions would do 22:26:09 * AaronSchulz doubts the former would need much change 22:26:32 &lt;TimStarling&gt; qgil: on to Configuration database 2 22:26:44 &lt;qgil&gt; #topic Configuration database 2 22:26:53 &lt;qgil&gt; #link https://www.mediawiki.org/wiki/Requests_for_comment/Configuration_database_2 22:26:58 &lt;Leucosticte&gt; TimStarling: okay. so, ^d, did you have any more detail regarding what you were wondering aloud, or is that pretty much only an option we're going with if new table ends up being rejected after more details are given -- okay I guess we'll discuss later/elsewhere 22:27:36 &lt;TimStarling&gt; Leucosticte: obviously a change to deletion is needed, we're not going to reject the whole concept 22:28:09 &lt;TimStarling&gt; it's just a matter of getting the right level of detail before coding starts, so that you don't end up wasting too much time on back-and-forth in code review 22:28:39 &lt;Leucosticte&gt; ah, okay 22:28:44 &lt;AaronSchulz&gt; Leucosticte: you want to show the Special:Undelete SQL smaple in the rfc too 22:29:15 &lt;Leucosticte&gt; okay. 22:29:18 &lt;AaronSchulz&gt; and don't make it a separate page like I did in that patch ;) 22:29:19 &lt;^d&gt; Also, I'm curious what happens when a page is deleted &amp; recreated. Not once, many times (it happens) 22:29:22 &lt;AaronSchulz&gt; ok, config now 22:30:10 &lt;AaronSchulz&gt; the rfc could say more about storage 22:30:23 &lt;TimStarling&gt; legoktm: you know that performant is not a word, right? 22:30:31 &lt;TimStarling&gt; just saying 22:30:34 &lt;AaronSchulz&gt; at some point, you need config files on each apache 22:30:34 &lt;Leucosticte&gt; it's in wiktionary 22:30:40 &lt;legoktm&gt; TimStarling: oops. 22:30:46 &lt;TimStarling&gt; yeah, lots of stupid things are in wiktionary 22:30:47 * AaronSchulz isn't sure where DataStore would really fit it 22:31:03 &lt;Leucosticte&gt; i thought the same thing when I saw it 22:31:10 &lt;AaronSchulz&gt; legoktm: it's used all over the web 22:31:21 &lt;AaronSchulz&gt; and in person...that is generally how things become new words 22:31:44 &lt;legoktm&gt; AaronSchulz: Really there just needs to be some kind of key-value storage. JSON was an easy choice since its human readable, and easy to use. CDB was also included if JSON isn't fast enough 22:32:16 &lt;TimStarling&gt; neither JSON nor CDB are really appropriate backends for a web interface 22:32:21 &lt;AaronSchulz&gt; right, but when people say key/value storage it's easy to get hung up on things like Cassandra, Mongo ect that are neither here nor there 22:32:38 &lt;MaxSem&gt; DataStore! 22:32:40 &lt;AaronSchulz&gt; could could be a blob in swift really...we don't even need key value 22:32:53 &lt;^d&gt; DataStore requires you having a connection to the database. 22:32:54 &lt;AaronSchulz&gt; well, one key, that would be for all the config 22:33:07 &lt;legoktm&gt; TimStarling: What would you suggest instead? 22:33:11 &lt;vvv&gt; It is not clear to me how this proposal interacts with extensions, especially ones which are not aware of new configuration backend 22:33:12 &lt;AaronSchulz&gt; but that is just for canonical storage, the apaches still need a copy locally 22:33:27 &lt;TimStarling&gt; legoktm: MySQL 22:33:42 &lt;bd808&gt; json is not the greatest config file format since it doesn't support comments 22:33:47 &lt;AaronSchulz&gt; b/c definitely needs mentioning in the rfc 22:33:50 &lt;legoktm&gt; vvv: you would still be able to use globals for backwards compatability 22:34:08 &lt;legoktm&gt; vvv: and extensions could use a hook to add their own configuration options 22:34:09 &lt;robla&gt; oh dear, bd808 is starting to hint at yaml 22:34:11 &lt;qgil&gt; #info &lt;TimStarling&gt; neither JSON nor CDB are really appropriate backends for a web interface 22:34:14 * robla grabs popcorn 22:34:21 &lt;^d&gt; Heh, so basically we're back to my old config proposal then. 22:34:24 &lt;^d&gt; Database. 22:34:39 &lt;legoktm&gt; Is using a database fast enough? 22:34:39 &lt;vvv&gt; YAML is what I've seen commonly used for those purposes 22:34:48 &lt;robla&gt; I believe Wikia has a db based solution. OwynD? 22:34:53 &lt;AaronSchulz&gt; legoktm: not without a local shared-nothing cache 22:35:14 &lt;AaronSchulz&gt; really lots of ways of doing canonical storage + local caching could work 22:35:17 * bd808 thinks robla is projecting just because I happen to maintain the best PHP YAML extension 22:35:21 &lt;AaronSchulz&gt; including using MySQL 22:35:46 &lt;aude&gt; legoktm: we stick php into local settings to do tricks, sometimes, like add a hook for a specific wiki 22:35:56 &lt;aude&gt; how would such things be handeld with the config db 22:35:56 &lt;TimStarling&gt; we need to have multiple backends merged together, right? 22:35:58 &lt;aude&gt; ? 22:36:15 &lt;TimStarling&gt; and for WMF, we only want a subset of configuration variables to be web-editable 22:36:17 &lt;AaronSchulz&gt; aude: yeah, config management works a lot better with declarative conf 22:36:30 &lt;legoktm&gt; aude: You would just stick that into LocalSettings.php like is currently done 22:36:33 &lt;TimStarling&gt; for things that are not web-editable, we want a source file in version control 22:36:36 &lt;TimStarling&gt; with comments 22:36:46 &lt;AaronSchulz&gt; things like $wgJobClasses would have some trickiness since extensions dynamically add on to it 22:36:56 &lt;aude&gt; legoktm: after extract globals or whatnot? 22:36:57 &lt;TimStarling&gt; for things that are web editable, we want logs with usernames, and diffs and such like 22:37:04 &lt;Krenair&gt; I was about to say - I'm wondering how web-editable things are going to work with extensions 22:37:26 &lt;TimStarling&gt; yeah, that's another thing, extensions are not mentioned in this new RFC 22:37:27 &lt;legoktm&gt; TimStarling: I was thinking that everything is stored in the backend, and we use userrights to restrict what gets edited on the web 22:37:51 &lt;^d&gt; re: extensions &amp; how to present a UI &amp; so forth, Daniel Kinzler had an idea on my last rf. 22:37:52 &lt;legoktm&gt; Krenair: what do you mean? 22:37:52 &lt;^d&gt; *rfc 22:38:01 &lt;legoktm&gt; aude: yes 22:38:18 &lt;TimStarling&gt; legoktm: you can't give the web user the ability to edit a file in version control 22:38:22 &lt;AaronSchulz&gt; if there was a convention that dynamically altered config vars were only a function of the core config plus the site json/whatever config plus extension presence I guess one could compile and distribute the config 22:38:32 &lt;^d&gt; https://www.mediawiki.org/wiki/Thread:Talk:Requests_for_comment/Configuration_database/Storing_setting_on_wiki_pages 22:38:51 &lt;qgil&gt; #link https://www.mediawiki.org/wiki/Thread:Talk:Requests_for_comment/Configuration_database/Storing_setting_on_wiki_pages 22:39:23 &lt;AaronSchulz&gt; if it was resolved down, you could even have the config code be isolated from the rest of MW and other things could read the config for different wikis 22:39:26 &lt;legoktm&gt; TimStarling: right. everything would be stored in the backend, just not everything would be web editable 22:40:01 &lt;AaronSchulz&gt; you also wouldn't need the horribleness of things like getCachedConfigVar in JobQueueGroup 22:40:06 &lt;legoktm&gt; ^d: main problem with that is that its not easy to access another wiki's page text. there's also a security issue with non-public configs 22:40:26 &lt;AaronSchulz&gt; any rfc must deal with fetching foreign wiki configuration and extension altered config 22:40:34 &lt;legoktm&gt; AaronSchulz: sorry, i'm lost. what are you referring to? 22:40:41 &lt;^d&gt; Gives us an excuse to have an interwiki transclusion rfc ;-) 22:40:56 &lt;TimStarling&gt; well, the existing system gives us that for core conf vars 22:41:05 &lt;TimStarling&gt; we just forgot to account for extensions 22:41:09 &lt;^d&gt; Well, that's part of what I was advocating. 22:41:15 &lt;^d&gt; Start with the current system. 22:41:20 &lt;^d&gt; And incrementally make it better. 22:41:27 &lt;TimStarling&gt; the current system is moderately complex 22:41:32 &lt;vvv&gt; One of the issues I discovered some time ago, is that you can't really tell anything about extension config unless extensions are aware of your system 22:41:47 &lt;TimStarling&gt; it's not just a key/value store, it has post-processing, like $stdLogo 22:41:54 &lt;AaronSchulz&gt; we already resolve config down for caching in wmf-config sort of 22:42:04 &lt;ori-l&gt; what sort of abstraction would tie different json configuration blobs together? how would you navigate between them? 22:42:25 &lt;ori-l&gt; a 10k-line file is horrible, but you can follow it by scrolling up and down 22:42:39 &lt;legoktm&gt; ori-l: the current proposal just has it stored in one big json file 22:42:44 &lt;legoktm&gt; each organized by &quot;category&quot; 22:42:56 &lt;ori-l&gt; that doesn't seem that compelling to me 22:43:10 &lt;legoktm&gt; site related vars, RL related vars, etc. 22:43:46 &lt;ori-l&gt; JSON is very limited 22:43:52 &lt;TimStarling&gt; the current system has ways to apply settings to groups of wikis 22:43:53 &lt;ori-l&gt; you can't reference or compose values 22:44:01 &lt;^d&gt; TimStarling: Which is one thing you have to keep. 22:44:15 &lt;legoktm&gt; TimStarling: my proposal included using db lists 22:44:19 &lt;^d&gt; (And I always thought was the hardest part to get right if you do things from scratch) 22:45:09 &lt;TimStarling&gt; so we are talking about a global web UI? 22:45:37 &lt;ori-l&gt; legoktm: ? 22:45:49 &lt;TimStarling&gt; would there be a local UI, for non-WMF wikis? 22:46:03 &lt;^d&gt; I think talking about a UI is putting the cart before the horse when we can't even come to agreement on architecture. 22:46:13 &lt;ori-l&gt; I don't think so 22:46:17 &lt;legoktm&gt; I was really only thinking of a local UI 22:46:22 &lt;legoktm&gt; but there probably should be a global one 22:47:14 &lt;TimStarling&gt; I think we need to talk about requirements 22:47:22 &lt;bd808&gt; Why not just put everything in the db except enough bootstrap config to find the database? 22:47:42 &lt;ori-l&gt; What would that solve? 22:47:50 &lt;TimStarling&gt; I think our discussions on architecture are fairly directionless because we don't have clear requirements 22:48:25 &lt;TimStarling&gt; so, I would like to make that an action item and move on to the last RFC in the last 10 minutes of the meeting 22:48:28 &lt;ori-l&gt; I want to flag an additional issue 22:48:31 &lt;ori-l&gt; oh, go ahead. 22:48:42 &lt;bd808&gt; ori-l: Just trying to think of how editing would work I guess. Serializing to disk seems gross. 22:49:18 &lt;qgil&gt; #info &lt;TimStarling&gt; I think we need to talk about requirements 22:49:20 &lt;TimStarling&gt; #action legoktm and other interested devs to develop requirements list on wiki 22:49:27 &lt;legoktm&gt; ok 22:49:48 &lt;qgil&gt; #topic Text extraction 22:49:53 &lt;MaxSem&gt; Okay, so basically this RFC is asking for decisions on 2 things: core vs. extension and whether to store the extracts in page_props, unconditionally 22:50:00 &lt;qgil&gt; #link https://www.mediawiki.org/wiki/Requests_for_comment/Text_extraction 22:50:23 &lt;qgil&gt; #info &lt;MaxSem&gt; this RFC is asking for decisions on 2 things: core vs. extension and whether to store the extracts in page_props, unconditionally 22:50:44 &lt;TimStarling&gt; MaxSem: what is the total data size? 22:51:14 &lt;TimStarling&gt; we can't store much in page_props, increasing core database size is quite expensive 22:51:33 &lt;MaxSem&gt; HTML extracts are rendered pages minus 30-50% 22:51:47 &lt;AaronSchulz&gt; no core DB bloat please :) 22:51:49 &lt;TimStarling&gt; so, enormous? 22:51:50 &lt;MaxSem&gt; text extracts are much shorter than wikitext 22:51:50 &lt;^d&gt; Can we cache them $somewhere? 22:52:07 &lt;^d&gt; If we know we'll get a decent hit rate. 22:52:08 &lt;aude&gt; seems odd to mix that with other stuff in page props 22:52:22 &lt;vvv&gt; Wouldn't that fit well into whatever backend for parser cache we are currently using? 22:52:24 &lt;aude&gt; it's big enough to go in it's own place 22:52:35 &lt;TimStarling&gt; roughly how many bytes per page? 22:53:10 &lt;MaxSem&gt; not sure about the stats 22:53:30 &lt;^d&gt; You know where we could stash them...elasticsearch. 22:53:41 &lt;^d&gt; It would be kind of perfect for that. 22:53:48 &lt;MaxSem&gt; what for? 22:54:24 &lt;MaxSem&gt; ES is a search engine first of all, KV storage is btter done with something else 22:54:28 &lt;TimStarling&gt; MaxSem: the extension thing, you would have a WikitionaryExtract or WikimediaExtract extension or something like that? 22:54:29 &lt;^d&gt; We've already got text extracts (more or less) living in elasticsearch. 22:54:53 &lt;MaxSem&gt; TimStarling, say WIkimediaExtracts 22:54:57 &lt;aude&gt; it would be good not to have dependency on elastic search, but possibly as a &quot;store&quot; option 22:55:12 &lt;MaxSem&gt; but I was thinking about making it doable with config settings only 22:55:23 &lt;^d&gt; MaxSem: Indeed. You can search for a single document's extract ;-) 22:55:37 &lt;^d&gt; I think it should be in core, tbh 22:55:51 &lt;MaxSem&gt; ^d, I'm more interested in batch retrieval than in searching 22:56:03 &lt;TimStarling&gt; I was under the misapprehension that extracts were from truncated versions of articles 22:56:14 &lt;TimStarling&gt; that is not the case, there is no truncation, right? 22:56:26 &lt;MaxSem&gt; you can request a lede extract 22:56:31 &lt;^d&gt; MaxSem: Also doable. Give me pages [1,500] 22:56:31 &lt;MaxSem&gt; or first N sentences 22:56:45 &lt;TimStarling&gt; but the whole thing is stored somewhere? 22:57:05 &lt;MaxSem&gt; currently, it's cached in memcached 22:57:12 &lt;^d&gt; Meh, we only store current revisions though. 22:57:17 &lt;^d&gt; -10 :( 22:57:29 &lt;bd808&gt; Would stored parsoid DOMs make this easier? Just an xslt transform? 22:57:39 &lt;MaxSem&gt; but on-demand rendering doesn't allow batch retrieval 22:57:52 &lt;TimStarling&gt; #info extracts are definitely too big for the core DB, need some other storage backend 22:58:23 &lt;MaxSem&gt; TimStarling, extension DB like AFT and Echo? 22:58:42 &lt;TimStarling&gt; we don't have time today to go through all the options 22:58:55 &lt;aude&gt; MaxSem: i assume we'd have per-content type implementations of formatting, or even the option to opt-out for certain content types? 22:59:13 &lt;MaxSem&gt; aude, I'm only interested in articles 22:59:16 &lt;aude&gt; e.g. extracts for wikidata entities would be pretty different or perhaps not make sense 22:59:23 &lt;aude&gt; MaxSem: ok, so wikitext content 22:59:24 &lt;TimStarling&gt; it is quite a similar problem to what we are doing with ElasticSearch 22:59:31 &lt;MaxSem&gt; if someone has a different use case... patches welcome!:P 22:59:42 &lt;TimStarling&gt; let me do a couple of action items... 22:59:47 &lt;^d&gt; Indeed, which is what lead to HtmlFormatter making its way to core. 23:00:17 &lt;TimStarling&gt; #action MaxSem and other interested devs to discuss storage backend options on RFC 23:00:52 &lt;TimStarling&gt; #action ^d to comment on RFC sharing experience with similar problem in ElasticSearch 23:01:08 &lt;MaxSem&gt; anything on core vs. extension?:) 23:01:25 &lt;TimStarling&gt; it's fine by me to have WikimediaExtracts 23:01:30 &lt;^d&gt; Core makes it easier for me to reuse more for Cirrus 23:01:48 &lt;^d&gt; But if everyone else thinks extension I can live with it. 23:02:04 &lt;qgil&gt; Anything else? 23:02:08 &lt;TimStarling&gt; maybe you should say on the RFC what exactly will be in WikimediaExtracts and why it is needed 23:02:22 &lt;MaxSem&gt; I was proposing basics in core with possibly WMF-specific stuff in an ext 23:02:30 &lt;aude&gt; and make sure to mention which content type(s) :) 23:03:38 &lt;aude&gt; and that other content types could &quot;hook&quot; into this and implement the extracts in a different format 23:03:39 &lt;TimStarling&gt; ok, all done? 23:03:41 &lt;MaxSem&gt; aude, I liek getText moar than getContent, will use it until it disappears:P 23:03:53 &lt;aude&gt; hah 23:04:02 &lt;qgil&gt; Thank you everyone! The next meeting is scheduled on 2013-12-04, same time -- https://www.mediawiki.org/wiki/Architecture_meetings 23:04:06 &lt;qgil&gt; #endmeeting