Extension talk:ParserFunctions/String functions/Archive

sub + len
Sounds good. How about adding a "pos" (first position of substring in string) to the set? Example, splitting a string at the first colon / space / slash can be interesting. --&#160;Omniplex&#160;(w:t) 04:43, 16 May 2006 (UTC)


 * Unfortunately, all such functions have complexity of O(mn), where m and n are the sizes of the substring and string, respectively. I could impose a hard limit on the substring size in order to reduce the complexity, but such a limitation seems arbitrary and awkward.  Alternatively, I could limit searches to a single character, but this could severely hamper the function's usefulness in non-ASCII environments.  In short, I'm at a loss as to how to proceed.  Suggestions are welcome. --Algorithm 02:18, 17 May 2006 (UTC)


 * It's some decades ago when I had to know what O(mn) stands for, but if m is the length of the needle and n the length of the haystack you'd start at most n-m+1 comparisons. Actually comparing m bytes (ignoring potential UTF-8 optimizations) is the worst case, but unlikely, often you'd get "no match" much earlier. Reduced to ASCII and m > 2 the chance that you need more than two bytes are already near 1/(128*128), let's say below 1/10000. Overall I'd guess that 2*(n-m+1) is very pessimistic, about 2n. Is O(nm) possible, in theory? I see where you might need m at a certain position, but then the next position can't take m again, or can it? --&#160;Omniplex&#160;(w:t) 02:43, 17 May 2006 (UTC)


 * The cases in which it matters are few, but they're still important, as they are exploitable. For example, you could try to find "11111111111111111111111111" inside (assuming no linebreaks):

11111111111111111111111110111111111111111111111111101111111111111111111111111011111111111111111111 11111011111111111111111111111110111111111111111111111111101111111111111111111111111011111111111111 11111111111011111111111111111111111110111111111111111111111111101111111111111111111111111011111111 11111111111111111011111111111111111111111110111111111111111111111111101111111111111111111111111011 11111111111111111111111011111111111111111111111110111111111111111111111111101111... (ad nauseum)
 * This process would eventually return false, but it would have to check far more than O(n) possibilities. --Algorithm 07:42, 17 May 2006 (UTC)


 * That example is skewed, you've reduced it to (in essence) two characters instead of my 128 for ASCII. To get below a probability 1/10000 you need 2**14 (= about 14 comparisons) instead of 128**2 (= about 2 comparisons) until no match is very likely (99.999%).
 * For your example I'd expect about 14n comparisons as already too pessimistic. Checking that wild theory with your example I get:
 * M = 26, N = 488, no match after 6303 comparisons. And 6303 is 12.91*488 near the expected 14n, less than mn = 26n. --&#160;Omniplex&#160;(w:t) 00:59, 18 May 2006 (UTC)


 * You could get m(n-m+1) from checking "111111111111111111111111112" against a string of only ones, when you didn't expect the string to contain only ones. Any result worse than that is impossible, but that's pretty bad. --Brilliand 04:03, 6 February 2007 (UTC)

Thinking about that experimental result, maybe n*m/c is realistic, where c is the number of common characters. Uncommon characters are irrelevant, they only accelerate the no match conditions. Maybe c = 100 makes sense in practice, with that we'd get n*m/100. So if you limit m to 100 or less characters you should arrive at n comparisons as average case. --&#160;Omniplex&#160;(w:t) 01:20, 18 May 2006 (UTC)

Hello, heard of the Boyer-Moore string search algorithm? The longer the key the faster the search. Unless you want to reinvent the wheel, Wikipedia is a good place to start. 59.112.41.191 17:56, 19 June 2006 (UTC)


 * Heard: yes, studied or implemented: no. Cute, thanks for the link. Firmly in O(n) territory for this. But probably #pos: uses an existing function, not this compare backwards and jump optimization. --&#160;Omniplex&#160;(w:t) 16:19, 20 June 2006 (UTC)

#urlencode
The new #pos: is nice, if it's enabled here together with #len: and #sub:.

For #urlencode: the difference from the new magic word urlencode: would be interesting, I know sh*t about PHP, it has apparently two similar functions, one styling itself as "raw", and that's not your #urlencode: (?)

For #urldecode: I've no clue where that could be ever useful, can they together by chance protect the infamous "{", "|", "}", and "=" in templates? --&#160;Omniplex&#160;(w:t) 18:13, 21 May 2006 (UTC)


 * #urlencode should be absolutely identical to the new urlencode: function. They both use the same PHP function.  As for #urldecode, it will indeed protect those characters, but #urlencode won't be able to encrypt all of them.  Hence, if you want a "|" in the output, put %7C in its place and #urldecode it when it's no longer dangerous. (Note that this has its own issues; namely, you can't put raw "+"s or "%"s into encoded output and expect them to be unaffected by the decode.) --Algorithm 22:18, 21 May 2006 (UTC)


 * Ganglieri's (sp?) proposal on Talk:ParserFunctions was interesting, a function keeping all (top level) "|" as is. As if it accepts only one argument (ignoring "|"), actually accepting zero or more arguments returning the concatenation of the zero or more result strings separated by "|". Something that's better than my template:! ( edit•talk•links•history ) kludge. Probably #urldecode: %7C isn't the solution, unless it works as Wiki table markup like . --&#160;Omniplex&#160;(w:t) 16:44, 22 May 2006 (UTC)


 * FYI: 6219, maybe this also hits #urlencode:. --&#160;Omniplex&#160;(w:t) 11:50, 6 June 2006 (UTC)

Character substitution/replace
Hi,

I thought about adding very simple replace function that would all to substitutin/replace string very easily (unlike the "subst"). I though about and wrote the following function: function runReplace( &$parser, $inStr = , $ReplaceFrom = , $ReplaceTo = '' ) {   	$inStr = str_replace($ReplaceFrom,$ReplaceTo,$inStr); return $inStr; }

I think that its good candidate for generic function. I used it to convert windows paths ("\\server\folder1\folder2\file name with space.doc" to file:// URI: "file:\\server\folder1\folder2\file%20name%20with%20space.doc". URL encode is not good because if replaces the "\").

Thanks, Meir :->


 * I also think that it is a useful function with a simple code. --jsimlo 13:46, 18 July 2006 (UTC)

Wikipedia
Is there any plan on implementing this StringFunctions on Wikimedia projects? Borgx 08:13, 6 June 2006 (UTC)
 * I would really find some of them useful creating templates. --81.231.179.17 11:28, 23 July 2006 (UTC)


 * So would I. Especially since we're indexing articles by the second letter at sv-wikt. //Shell 16:10, 14 October 2006 (UTC)


 * I agree. 72.139.119.165 19:03, 17 October 2006 (UTC)


 * Please implement it in wikinews, too. It can be used for automatically generated teasers. --84.178.44.163 15:06, 3 December 2006 (UTC)


 * StringFunctions would be invaluable for increasing the complexity of template calls to decrease needed work by humans on Wikipedia. I can think of a concrete example: automatic template categorization. Under the categorization system, items are by default categorized by letter in a category based on their title. This is appropriate, but the system fails to recognize that titles starting with "The" should be categorized as though their titles were "Title, The". While this is irrelevant while ordinarily titling a page, since humans will add the category, it is relevant for categories embedded within a template, since the template is not even directly edited for a single article. StringFunctions would allow an article to be automatically sorted under such a proper title, by checking only the first 4 characters for the string "The " (note the space character, added so that articles with titles like "Thespian" are not mistreated). This functionality exists under StringFunctions. I hope it is implemented throughout Wikimedia as a useful tool. Nihiltres 05:02, 17 February 2007 (UTC)


 * Is anybody going to answer the above questions? When will these functions be implemented on Wikimedia projects?
 * See 6455. Hillgentleman | 書 |2007年03月25日( Sun ), 15:02:37 15:02, 25 March 2007 (UTC)
 * Just to chime in my support, I too would find changes to respect whitespaces very useful for the exact reason that Nihiltres mentions above... using something like  where title = "The Facts of Life" leads to alphabetization under " Facts of Life" instead of "Facts of Life".  71.137.18.65 05:15, 27 March 2007 (UTC)
 * Unfortunatelly, this is (AFAIK) not possible directly in StringFunctions. The parameters are being parsed and handled by the global wiki Parser, which is also responsible for the trimming. StringFunctions receive all parameters already trimmed, unless you use &lt;nowiki> workaround. --jsimlo(talk 07:13, 27 March 2007 (UTC)
 * Is it possible to use the nowiki workaround in the second parameter (the one specifying the string you're searching for) rather than the third parameter (the string you're putting in as a replacement)? Until then, I think I've figured out a work around... To change "The Facts of Life" to "Facts of Life":  It's a bit cumbersome, but I think it will work... 71.137.18.65 05:05, 28 March 2007 (UTC)
 * And what about  instead? --jsimlo(talk 20:59, 28 March 2007 (UTC)
 * This seems to be a simple problem using the #sub function and a couple of ParserFunctions: Nihiltres 07:07, 16 April 2007 (UTC)

Proposals
There is still no reaction on several requests for this extension. I guess it is mainly because there is no one to handle it. Correct me, if I am wrong.. ;) Otherwise, I would like to improve the extension and satisfy some of the reasonable requests here, if there is no one against it. Please, add comments and votes for adding the proposed functions below into the main code. Thank you. --jsimlo(talk 11:36, 10 September 2006 (UTC)

#explode
A tokenizer that would split a string into pieces and return one of them, exposing the php explode function? Usage like this:  would give you the second sub-name of the current sub-sub-article. E.g.  would result in. Code:

function runExplode (&$parser, $inStr = , $inDiv = , $inPos = 0) {      $tokens = explode ($inDiv, $inStr); if (!isset ($tokens[intval ($inPos)])) return ""; return $tokens[intval ($inPos)]; }

Votes:
 * Add per nom. --jsimlo(talk 11:36, 10 September 2006 (UTC)

#replace
Replaces each occurence of a needle in the haystack, exposing the php str_replace function. Usage like this:  would replace all shashes in the current page name with colons. E.g.  would result in. Code:

function runReplace ( &$parser, $inStr = , $ReplaceFrom = , $ReplaceTo = '' )   { return str_replace ($ReplaceFrom, $ReplaceTo, $inStr); }

Votes:
 * Add per nom. --jsimlo(talk 11:36, 10 September 2006 (UTC)

Would like to have the following chunk into the replace function: if ($ReplaceForm = '') {       $ReplaceForm = ' '; } The system assumes that passing a blank character means that nothing will get passed, so with this the function can search for the space character. Unless someone can think of a better way of doing it. Denomales 03:09, 22 October 2006 (UTC)
 * Change Request
 * Fixed in ver. 1.6. --jsimlo(talk 08:48, 26 October 2006 (UTC)


 * A way to use $ReplaceTo = ' ', but preserving the possiblity of using $ReplaceTo = '', would be useful too.--Patrick 13:15, 24 January 2007 (UTC)


 * I found a way: To use a space as to-string, put it in nowiki tags.--Patrick 13:28, 24 January 2007 (UTC)

#pad
Padds input string to a specified width, aligning it to left, right or center. Usage like this:  would result in. Code:

function runStrPad (&$parser, $inStr = '', $inLen = 0, $inWith = ' ', $inDirection = 'left') {       switch (strtolower ($inDirection)) {       case 'left': default: $direction = STR_PAD_LEFT; break; case 'center': $direction = STR_PAD_BOTH; break; case 'right': $direction = STR_PAD_RIGHT; break; }       return str_pad ($inStr, intval ($inLen), $inWith, $direction); }

Votes:
 * Add per nom. --jsimlo(talk 11:36, 10 September 2006 (UTC)
 * Remove. This function is extremely abusable in DoS attacks. Ex:  --67.171.232.100 00:08, 29 October 2006 (UTC)
 * Comment. What about limiting the length value instead, like it was limited with #pos function? --jsimlo(talk 09:54, 30 October 2006 (UTC)
 * Comment. Since MediaWiki 1.8 their is a padleft and padright core function which does quite the same, I think. See Help:Magic_words. --Majoran 05:17, 30 November 2006 (UTC)
 * Remove or limit See the page - pad is O(10^n) (remember, n is the number of digits, not the value itself) and replace is O(n^2). Suggest limiting the "from" and "to" values of replace and the "delimiter" of explode to 30 characters in length (as with pos - I would support anything from 10 to 30 characters, I'd vote against anything outside this range) and the length value of pad to 99. I'd also recommend limiting "value" or "string" for everything EXCEPT len and sub to some length limit in the range of .5K - 5K, preferably 1K. Even O(n) is a DOS attack if it is easy to make n be a 100K page templated from somewhere. And finally, even with all these limits, it would be easy to make a "replace" call that returned a 30K-long string - there needs to be a further limit targeted to the output of replace (and possibly urlencode?). For ease of programming (just check limits then expose the PHP, rather than building your own function), this could be that len(to)*len(value)/len(from) can't be over twice the limit on len(value) - that is, the "to" can only be twice as long as the "from" unless you're sure that your "value" string is a fraction of the length limit. The "len" in question is byte-length, ie 2 for each standard-codepage unicode char and 1 for each ascii. --Homunq
 * Okay, then I shall update the code ASAP. --jsimlo(talk 16:08, 24 January 2007 (UTC)
 * Done.. :))) --jsimlo(talk 13:31, 30 January 2007 (UTC)

Do these functions work?
I have jsut one question.
 * Do these functions work?

I am asking because all of my tests with #sub: and #len: have failed. --70.49.162.137 20:55, 24 August 2006 (UTC)


 * Yes, if you have them installed. They are, however, not installed here: . --jsimlo 21:56, 24 August 2006 (UTC)


 * How do I install them? 70.49.117.62 23:28, 27 August 2006 (UTC)


 * My appologies, I have not realized that there is no word about the installation in the article. I have just added such section of Installation. If any troubles should occur, leave a note here. --jsimlo 12:49, 28 August 2006 (UTC)

Update for 1.8.0
I added the following to get the code from throwing errors after upgrading to MediaWiki 1.8.0, and after a quick test everything seems to be working well.--Raran 17:31, 11 October 2006 (UTC)

$wgHooks['LanguageGetMagic'][] = 'wfStringFunctionsLanguageGetMagic';

function wfStringFunctionsLanguageGetMagic( &$magicWords, $langCode ) { switch ( $langCode ) { default: $magicWords['len']		= array( 0, 'len' ); $magicWords['pos']		= array( 0, 'pos' ); $magicWords['rpos']		= array( 0, 'rpos' ); $magicWords['sub']		= array( 0, 'sub' ); $magicWords['pad']		= array( 0, 'pad' ); $magicWords['replace']	= array( 0, 'replace' ); $magicWords['explode']	= array( 0, 'explode' ); $magicWords['urlencode']	= array( 0, 'urlencode' ); $magicWords['urldecode']	= array( 0, 'urldecode' ); }   return true; }

Documentation and Testing
As far as I can tell, replace and pad are not working, or at the very least the usage info is wrong. Also, several of these functions have NO docs. Do you think someone who is using them could, at least, edit the main page with some more complete information? Thanks! --Billwsmithjr 15:44, 9 November 2006 (UTC)

trim
Maybe a #trim function could be useful, to get rid of whitespaces. (Available Workaround:  ) --Majoran 04:57, 30 November 2006 (UTC)

Doesn't work on 1.6.8, requires modification to work
This extension adds function like len and so on, but parser.php looks for #len (maybe, it was changed in later versions?). So, the modification is following:  to. At least, works fine for me.


 * This is the same problem as with ParserFunctions for MediaWiki below 1.7. See section about Installation. Most of the stuff there applies here as well, only the names are changed. I guess those workarounds should be added to the StringFunctions as well, shouldn't they? --jsimlo(talk 21:27, 4 December 2006 (UTC)


 * We could do a simple trick like

$prefix = $wgVersion < 1.7 ? '#' : ''; $wgParser->setFunctionHook ( $prefix.'len', ...
 * to make it portable. I don't know in which version wiki switched from  to   style, so put 1.7 for example.
 * As for ParserFunction, it's generally a bad idea to use PHP5-specific stuff when one doesn't really need it :-\
 * As for ParserFunction, it's generally a bad idea to use PHP5-specific stuff when one doesn't really need it :-\

PCREs
Also, I'd love to have preg_-family functions here for splitting and replacing tokens by PCREs. I can write those functions myself, just asking if there're any objections to have them in this extension.


 * I guess this will be a looong fight to it. PCREs are a bit slow. Look all over this talk page - it is all about whether something is exploitable by a DOS attack or not. --jsimlo(talk 21:27, 4 December 2006 (UTC)


 * They are not so slow :) At least, not slower than POSIX regexes. Perl itself has only regexes to manage strings. Anyway, if page is cached they're not executed; otherwise, db queries will take much more time than those functions. But I don't insist on adding them, of course.

#explode
I'd like to add to #explode the ability for the position argument to be negative, returning tokens from the end of the string. This is similar to several of the other functions using negative indexes. Currently there is no way to know how many tokens there will be, so you can't figure out what the index of the last token is otherwise; this at least lets you grab tokens at the end. Vash 18:46, 29 March 2007 (UTC)
 * Sounds reasonable and simple. Will add asap. --jsimlo(talk 08:12, 30 May 2007 (UTC)

Bugzilla: add on wikimedia
The bug requesting these to be added for the big sites (wikipedia, etc.) is 6455. If you really want these functions, go (register and) vote for that bug. If you have an objection, that bug would also be an appropriate place to express it.--201.216.136.95 18:23, 10 January 2007 (UTC)


 * Is there a public Wiki with StringFunctions installed?--Hillgentleman | 書 |2007年02月26日( Mon ), 19:53:30


 * has StringFunctions and DynamicFunctions. Polonium 21:27, 13 March 2007 (UTC)
 * explode seems not to be functioning over there. I came up with a bad substitute:  convert the string into unicode, edit the monobook so that "%" are automatically converted into "|", then cut and paste.---Hillgentleman | 書 |2007年03月22日( Thu ), 10:06:56 10:06, 22 March 2007 (UTC)


 * Problems on mediawiki 1.10; I can’t get StringFunctions going on mediawiki 1.10. is this a general problem or just mine?? thx --Bartleby 15:50, 18 July 2007 (UTC)

Locating a pipe in text
I'm trying to write an #explode expression on my wiki that uses a pipe ( "|" ) as the delimiter. It seems that this is impossible, though, as StringFunctions for whatever reason does not consider the | template (which contains only "|" ) a pipe. I can't use | to look for a pipe with #replace, #pos, or any StringFunction I've tried. Any help would be greatly appreciated! -69.122.203.50 22:11, 7 April 2007 (UTC)
 * Nevermind, got it! I'm using &#124; now to find pipes. -69.122.203.50 22:14, 7 April 2007 (UTC)

Page name variations
I am trying to figure out how to get all page name variations in order to list redirects for Dynamic Page List. If I have a page called "cutscene", I can easily get prefix variations like "cutscenes", "cutscened", "cutscened", etc, but how would I get "cut scene", "cut scenes", "cutscening", etc? I'm thinking string functions would be how but I'm not much of a programmer... -Eep² 06:26, 10 August 2007 (UTC)


 * Hello? I also need to be able to get other variations like "cities" from "city" and a way to remove the last "s" and replace it with "ies" too. How would I do these things with StringFunctions? —Eep² 14:01, 20 August 2007 (UTC)


 * Yes, I read you. Unfortunatelly, I have no idea about how to implement such things. What you are talking about is based on some kind of a dictionary and gramar rules. Such things differ from language to language. Though, you can replace parts of strings and retreive substrings, so you should be able to create "cut scene" from "cutscene". I am not sure about making "cities" from "city". How would you treat word "stress"? -- jsimlo(talk 09:14, 23 August 2007 (UTC)


 * How would I get "cut scene" from "cutscene"? As for stress, that's easy: just add an "es" to make "stresses" or an "or" to make "stressor". A per-word(s) style is fine but I just need to know how to do it programmatically. For "cities", how would I remove/replace the last 3 letters and add (or replace them with) a "y"? —Eep² 14:03, 23 August 2007 (UTC)
 * The first works only for "city", the second for every word ending in -ies, the third shows how to get "cut scene" from "cutscene" (it works only with this word):
 * y
 * y
 * 
 * Arath 17:19, 25 August 2007 (UTC)


 * Thanks! —Eep² 22:36, 26 August 2007 (UTC)

Documentation style and wording
Hi, sorry about reverting your entire work, but it seemed to me to be better this way because:
 * I highly disagree about replacing words returns with ->. I believe the later is much less readable and comprehendable by common users. Somehow, I think that wordy and lengthy documentations are better than packed short notations. Why? Because a lengthy documentation is simple, light and clear for the brain to process. Short notations need to be parsed and translated why reading.. :))
 * Lengthy documentation is not simple, light, and clear--that's why it's lengthy, which implies more complex, heavy, and wordy (cluttered). —Eep² 18:47, 27 August 2007 (UTC)
 * When you say the same amount of information in a lengthy style, the result is light. When you say the same in shorter way, the result is heavy. What I wanted to say is, that I am against of using short notations instead of words. E.g. do not use ->, use results in or returns. -- jsimlo(talk 20:17, 27 August 2007 (UTC)


 * I find your version of #sub quite the same (in sence of information), but more complicated (in sence of order). I would prefer to follow GNU docs rules while writting these docs. Important describtion first; notes, borders, side-effects and deviations later.
 * GNU has doc rules too? What's next, GNU rules on how to brush your teeth? I prefer contextual documentation where any notes are directly relevant to that parameter instead of at the bottom, requiring reference back to the parameter in question. —Eep² 18:47, 27 August 2007 (UTC)
 * Yes. Every bigger project has some rules. And GNU projects are usually big and working one. So, look, every usable docs should contain ALL relevant info, that may become necessary. However, a lot of such info can be quite boring to most of the common users. Therefore, projects like GNU, Microsoft, Sun and many others have a style in which they first lightly describe the overall main purpose; then describe the common purpose of each parameter; and only then describe the notes and remarks that are not very common, but may come handy, when specific troubles arise. Then the reader may read, till he is satisfied with his questions. He does not need to read through all the docs just because there are some side-efects that may happen somewhere. Btw: Browse through docs of mediawiki software. You can find there, what I am talking about. -- jsimlo(talk 20:17, 27 August 2007 (UTC)


 * There is no need to use &lt;tt> for links to php.net site.
 * The monospace is for PHP functions, not links. —Eep² 18:47, 27 August 2007 (UTC)


 * I do not find your examples to be the common examples of usage. They are rather special cases of what can be done, if wanted. Each example introduces a hidden question of "What the author wanted to say with this example?" Keep that in mind when creating examples. Always have a clear and simple vision of what you are trying to say and then ask yourself whether the readers will be able to follow you. Confusing examples can make the docs hard to read and understand. -- jsimlo(talk 11:25, 27 August 2007 (UTC)
 * And all the Žmržlina examples are common? Come on... Besides, a simple description for the examples I provided can easily answer any "hidden questions". —Eep² 18:47, 27 August 2007 (UTC)
 * Well, in the sence of multibyte utf-8 characters, yes, I think they are. Why? Because the Ž is a multibyte character. So I assume, that when the reader reads the example, he follows the clue of Ž being a multibyte character, which is anyway correctly counted as a one single character. -- jsimlo(talk 20:17, 27 August 2007 (UTC)

Again, I did not mean to offend you and argue about petty things. I simply read your version and got confused. So I asked my self: Is it just me, or is it the text I am reading. Well. After a while I decided the later would be the problem and I have reverted you. Then I realized and reproduced some of the good work you have done.

Since I am the current developer and maintainer of the extension, I believe I should be able to continue my work (updating the code and docs) whenever necessary. I can not do that when I get confused. But I also admit that my version is not the perfect one. -- jsimlo(talk 20:17, 27 August 2007 (UTC)