Extension talk:ParserFunctions/String functions/Archive

sub + len
Sounds good. How about adding a "pos" (first position of substring in string) to the set? Example, splitting a string at the first colon / space / slash can be interesting. --&#160;Omniplex&#160;(w:t) 04:43, 16 May 2006 (UTC)


 * Unfortunately, all such functions have complexity of O(mn), where m and n are the sizes of the substring and string, respectively. I could impose a hard limit on the substring size in order to reduce the complexity, but such a limitation seems arbitrary and awkward.  Alternatively, I could limit searches to a single character, but this could severely hamper the function's usefulness in non-ASCII environments.  In short, I'm at a loss as to how to proceed.  Suggestions are welcome. --Algorithm 02:18, 17 May 2006 (UTC)


 * It's some decades ago when I had to know what O(mn) stands for, but if m is the length of the needle and n the length of the haystack you'd start at most n-m+1 comparisons. Actually comparing m bytes (ignoring potential UTF-8 optimizations) is the worst case, but unlikely, often you'd get "no match" much earlier. Reduced to ASCII and m > 2 the chance that you need more than two bytes are already near 1/(128*128), let's say below 1/10000. Overall I'd guess that 2*(n-m+1) is very pessimistic, about 2n. Is O(nm) possible, in theory? I see where you might need m at a certain position, but then the next position can't take m again, or can it? --&#160;Omniplex&#160;(w:t) 02:43, 17 May 2006 (UTC)


 * The cases in which it matters are few, but they're still important, as they are exploitable. For example, you could try to find "11111111111111111111111111" inside (assuming no linebreaks):

11111111111111111111111110111111111111111111111111101111111111111111111111111011111111111111111111 11111011111111111111111111111110111111111111111111111111101111111111111111111111111011111111111111 11111111111011111111111111111111111110111111111111111111111111101111111111111111111111111011111111 11111111111111111011111111111111111111111110111111111111111111111111101111111111111111111111111011 11111111111111111111111011111111111111111111111110111111111111111111111111101111... (ad nauseum)
 * This process would eventually return false, but it would have to check far more than O(n) possibilities. --Algorithm 07:42, 17 May 2006 (UTC)


 * That example is skewed, you've reduced it to (in essence) two characters instead of my 128 for ASCII. To get below a probability 1/10000 you need 2**14 (= about 14 comparisons) instead of 128**2 (= about 2 comparisons) until no match is very likely (99.999%).
 * For your example I'd expect about 14n comparisons as already too pessimistic. Checking that wild theory with your example I get:
 * M = 26, N = 488, no match after 6303 comparisons. And 6303 is 12.91*488 near the expected 14n, less than mn = 26n. --&#160;Omniplex&#160;(w:t) 00:59, 18 May 2006 (UTC)


 * You could get m(n-m+1) from checking "111111111111111111111111112" against a string of only ones, when you didn't expect the string to contain only ones. Any result worse than that is impossible, but that's pretty bad. --Brilliand 04:03, 6 February 2007 (UTC)

Thinking about that experimental result, maybe n*m/c is realistic, where c is the number of common characters. Uncommon characters are irrelevant, they only accelerate the no match conditions. Maybe c = 100 makes sense in practice, with that we'd get n*m/100. So if you limit m to 100 or less characters you should arrive at n comparisons as average case. --&#160;Omniplex&#160;(w:t) 01:20, 18 May 2006 (UTC)

Hello, heard of the Boyer-Moore string search algorithm? The longer the key the faster the search. Unless you want to reinvent the wheel, Wikipedia is a good place to start. 59.112.41.191 17:56, 19 June 2006 (UTC)


 * Heard: yes, studied or implemented: no. Cute, thanks for the link. Firmly in O(n) territory for this. But probably #pos: uses an existing function, not this compare backwards and jump optimization. --&#160;Omniplex&#160;(w:t) 16:19, 20 June 2006 (UTC)

#urlencode
The new #pos: is nice, if it's enabled here together with #len: and #sub:.

For #urlencode: the difference from the new magic word urlencode: would be interesting, I know sh*t about PHP, it has apparently two similar functions, one styling itself as "raw", and that's not your #urlencode: (?)

For #urldecode: I've no clue where that could be ever useful, can they together by chance protect the infamous "{", "|", "}", and "=" in templates? --&#160;Omniplex&#160;(w:t) 18:13, 21 May 2006 (UTC)


 * #urlencode should be absolutely identical to the new urlencode: function. They both use the same PHP function.  As for #urldecode, it will indeed protect those characters, but #urlencode won't be able to encrypt all of them.  Hence, if you want a "|" in the output, put %7C in its place and #urldecode it when it's no longer dangerous. (Note that this has its own issues; namely, you can't put raw "+"s or "%"s into encoded output and expect them to be unaffected by the decode.) --Algorithm 22:18, 21 May 2006 (UTC)


 * Ganglieri's (sp?) proposal on Talk:ParserFunctions was interesting, a function keeping all (top level) "|" as is. As if it accepts only one argument (ignoring "|"), actually accepting zero or more arguments returning the concatenation of the zero or more result strings separated by "|". Something that's better than my ! kludge. Probably #urldecode: %7C isn't the solution, unless it works as Wiki table markup like . --&#160;Omniplex&#160;(w:t) 16:44, 22 May 2006 (UTC)


 * FYI: 6219, maybe this also hits #urlencode:. --&#160;Omniplex&#160;(w:t) 11:50, 6 June 2006 (UTC)

Character substitution/replace
Hi,

I thought about adding very simple replace function that would all to substitutin/replace string very easily (unlike the "subst"). I though about and wrote the following function: function runReplace( &$parser, $inStr = , $ReplaceFrom = , $ReplaceTo = '' ) {   	$inStr = str_replace($ReplaceFrom,$ReplaceTo,$inStr); return $inStr; }

I think that its good candidate for generic function. I used it to convert windows paths ("\\server\folder1\folder2\file name with space.doc" to file:// URI: "file:\\server\folder1\folder2\file%20name%20with%20space.doc". URL encode is not good because if replaces the "\").

Thanks, Meir :->


 * I also think that it is a useful function with a simple code. --jsimlo 13:46, 18 July 2006 (UTC)

Wikipedia
Is there any plan on implementing this StringFunctions on Wikimedia projects? Borgx 08:13, 6 June 2006 (UTC)
 * I would really find some of them useful creating templates. --81.231.179.17 11:28, 23 July 2006 (UTC)


 * So would I. Especially since we're indexing articles by the second letter at sv-wikt. //Shell 16:10, 14 October 2006 (UTC)


 * I agree. 72.139.119.165 19:03, 17 October 2006 (UTC)


 * Please implement it in wikinews, too. It can be used for automatically generated teasers. --84.178.44.163 15:06, 3 December 2006 (UTC)


 * StringFunctions would be invaluable for increasing the complexity of template calls to decrease needed work by humans on Wikipedia. I can think of a concrete example: automatic template categorization. Under the categorization system, items are by default categorized by letter in a category based on their title. This is appropriate, but the system fails to recognize that titles starting with "The" should be categorized as though their titles were "Title, The". While this is irrelevant while ordinarily titling a page, since humans will add the category, it is relevant for categories embedded within a template, since the template is not even directly edited for a single article. StringFunctions would allow an article to be automatically sorted under such a proper title, by checking only the first 4 characters for the string "The " (note the space character, added so that articles with titles like "Thespian" are not mistreated). This functionality exists under StringFunctions. I hope it is implemented throughout Wikimedia as a useful tool. Nihiltres 05:02, 17 February 2007 (UTC)


 * Is anybody going to answer the above questions? When will these functions be implemented on Wikimedia projects?
 * See 6455. Hillgentleman | 書 |2007年03月25日( Sun ), 15:02:37 15:02, 25 March 2007 (UTC)
 * Just to chime in my support, I too would find changes to respect whitespaces very useful for the exact reason that Nihiltres mentions above... using something like  where title = "The Facts of Life" leads to alphabetization under " Facts of Life" instead of "Facts of Life".  71.137.18.65 05:15, 27 March 2007 (UTC)
 * Unfortunatelly, this is (AFAIK) not possible directly in StringFunctions. The parameters are being parsed and handled by the global wiki Parser, which is also responsible for the trimming. StringFunctions receive all parameters already trimmed, unless you use &lt;nowiki> workaround. --jsimlo(talk 07:13, 27 March 2007 (UTC)

Proposals
There is still no reaction on several requests for this extension. I guess it is mainly because there is no one to handle it. Correct me, if I am wrong.. ;) Otherwise, I would like to improve the extension and satisfy some of the reasonable requests here, if there is no one against it. Please, add comments and votes for adding the proposed functions below into the main code. Thank you. --jsimlo(talk 11:36, 10 September 2006 (UTC)

#explode
A tokenizer that would split a string into pieces and return one of them, exposing the php explode function? Usage like this:  would give you the second sub-name of the current sub-sub-article. E.g.  would result in. Code:

function runExplode (&$parser, $inStr = , $inDiv = , $inPos = 0) {      $tokens = explode ($inDiv, $inStr); if (!isset ($tokens[intval ($inPos)])) return ""; return $tokens[intval ($inPos)]; }

Votes:
 * Add per nom. --jsimlo(talk 11:36, 10 September 2006 (UTC)

#replace
Replaces each occurence of a needle in the haystack, exposing the php str_replace function. Usage like this:  would replace all shashes in the current page name with colons. E.g.  would result in. Code:

function runReplace ( &$parser, $inStr = , $ReplaceFrom = , $ReplaceTo = '' )   { return str_replace ($ReplaceFrom, $ReplaceTo, $inStr); }

Votes:
 * Add per nom. --jsimlo(talk 11:36, 10 September 2006 (UTC)

Would like to have the following chunk into the replace function: if ($ReplaceForm = '') {       $ReplaceForm = ' '; } The system assumes that passing a blank character means that nothing will get passed, so with this the function can search for the space character. Unless someone can think of a better way of doing it. Denomales 03:09, 22 October 2006 (UTC)
 * Change Request
 * Fixed in ver. 1.6. --jsimlo(talk 08:48, 26 October 2006 (UTC)


 * A way to use $ReplaceTo = ' ', but preserving the possiblity of using $ReplaceTo = '', would be useful too.--Patrick 13:15, 24 January 2007 (UTC)


 * I found a way: To use a space as to-string, put it in nowiki tags.--Patrick 13:28, 24 January 2007 (UTC)

#pad
Padds input string to a specified width, aligning it to left, right or center. Usage like this:  would result in. Code:

function runStrPad (&$parser, $inStr = '', $inLen = 0, $inWith = ' ', $inDirection = 'left') {       switch (strtolower ($inDirection)) {       case 'left': default: $direction = STR_PAD_LEFT; break; case 'center': $direction = STR_PAD_BOTH; break; case 'right': $direction = STR_PAD_RIGHT; break; }       return str_pad ($inStr, intval ($inLen), $inWith, $direction); }

Votes:
 * Add per nom. --jsimlo(talk 11:36, 10 September 2006 (UTC)
 * Remove. This function is extremely abusable in DoS attacks. Ex:  --67.171.232.100 00:08, 29 October 2006 (UTC)
 * Comment. What about limiting the length value instead, like it was limited with #pos function? --jsimlo(talk 09:54, 30 October 2006 (UTC)
 * Comment. Since MediaWiki 1.8 their is a padleft and padright core function which does quite the same, I think. See Help:Magic_words. --Majoran 05:17, 30 November 2006 (UTC)
 * Remove or limit See the page - pad is O(10^n) (remember, n is the number of digits, not the value itself) and replace is O(n^2). Suggest limiting the "from" and "to" values of replace and the "delimiter" of explode to 30 characters in length (as with pos - I would support anything from 10 to 30 characters, I'd vote against anything outside this range) and the length value of pad to 99. I'd also recommend limiting "value" or "string" for everything EXCEPT len and sub to some length limit in the range of .5K - 5K, preferably 1K. Even O(n) is a DOS attack if it is easy to make n be a 100K page templated from somewhere. And finally, even with all these limits, it would be easy to make a "replace" call that returned a 30K-long string - there needs to be a further limit targeted to the output of replace (and possibly urlencode?). For ease of programming (just check limits then expose the PHP, rather than building your own function), this could be that len(to)*len(value)/len(from) can't be over twice the limit on len(value) - that is, the "to" can only be twice as long as the "from" unless you're sure that your "value" string is a fraction of the length limit. The "len" in question is byte-length, ie 2 for each standard-codepage unicode char and 1 for each ascii. --Homunq
 * Okay, then I shall update the code ASAP. --jsimlo(talk 16:08, 24 January 2007 (UTC)
 * Done.. :))) --jsimlo(talk 13:31, 30 January 2007 (UTC)

Do these functions work?
I have jsut one question.
 * Do these functions work?

I am asking because all of my tests with #sub: and #len: have failed. --70.49.162.137 20:55, 24 August 2006 (UTC)


 * Yes, if you have them installed. They are, however, not installed here: . --jsimlo 21:56, 24 August 2006 (UTC)


 * How do I install them? 70.49.117.62 23:28, 27 August 2006 (UTC)


 * My appologies, I have not realized that there is no word about the installation in the article. I have just added such section of Installation. If any troubles should occur, leave a note here. --jsimlo 12:49, 28 August 2006 (UTC)

Update for 1.8.0
I added the following to get the code from throwing errors after upgrading to MediaWiki 1.8.0, and after a quick test everything seems to be working well.--Raran 17:31, 11 October 2006 (UTC)

$wgHooks['LanguageGetMagic'][] = 'wfStringFunctionsLanguageGetMagic';

function wfStringFunctionsLanguageGetMagic( &$magicWords, $langCode ) { switch ( $langCode ) { default: $magicWords['len']		= array( 0, 'len' ); $magicWords['pos']		= array( 0, 'pos' ); $magicWords['rpos']		= array( 0, 'rpos' ); $magicWords['sub']		= array( 0, 'sub' ); $magicWords['pad']		= array( 0, 'pad' ); $magicWords['replace']	= array( 0, 'replace' ); $magicWords['explode']	= array( 0, 'explode' ); $magicWords['urlencode']	= array( 0, 'urlencode' ); $magicWords['urldecode']	= array( 0, 'urldecode' ); }   return true; }

Documentation and Testing
As far as I can tell, replace and pad are not working, or at the very least the usage info is wrong. Also, several of these functions have NO docs. Do you think someone who is using them could, at least, edit the main page with some more complete information? Thanks! --Billwsmithjr 15:44, 9 November 2006 (UTC)

trim
Maybe a #trim function could be useful, to get rid of whitespaces. (Available Workaround:  ) --Majoran 04:57, 30 November 2006 (UTC)

Doesn't work on 1.6.8, requires modification to work
This extension adds function like len and so on, but parser.php looks for #len (maybe, it was changed in later versions?). So, the modification is following:  to. At least, works fine for me.


 * This is the same problem as with ParserFunctions for MediaWiki below 1.7. See section about Installation. Most of the stuff there applies here as well, only the names are changed. I guess those workarounds should be added to the StringFunctions as well, shouldn't they? --jsimlo(talk 21:27, 4 December 2006 (UTC)


 * We could do a simple trick like

$prefix = $wgVersion < 1.7 ? '#' : ''; $wgParser->setFunctionHook ( $prefix.'len', ...
 * to make it portable. I don't know in which version wiki switched from  to   style, so put 1.7 for example.
 * As for ParserFunction, it's generally a bad idea to use PHP5-specific stuff when one doesn't really need it :-\
 * As for ParserFunction, it's generally a bad idea to use PHP5-specific stuff when one doesn't really need it :-\

PCREs
Also, I'd love to have preg_-family functions here for splitting and replacing tokens by PCREs. I can write those functions myself, just asking if there're any objections to have them in this extension.


 * I guess this will be a looong fight to it. PCREs are a bit slow. Look all over this talk page - it is all about whether something is exploitable by a DOS attack or not. --jsimlo(talk 21:27, 4 December 2006 (UTC)


 * They are not so slow :) At least, not slower than POSIX regexes. Perl itself has only regexes to manage strings. Anyway, if page is cached they're not executed; otherwise, db queries will take much more time than those functions. But I don't insist on adding them, of course.

Bugzilla: add on wikimedia
The bug requesting these to be added for the big sites (wikipedia, etc.) is 6455. If you really want these functions, go (register and) vote for that bug. If you have an objection, that bug would also be an appropriate place to express it.--201.216.136.95 18:23, 10 January 2007 (UTC)


 * Is there a public Wiki with StringFunctions installed?--Hillgentleman | 書 |2007年02月26日( Mon ), 19:53:30


 * has StringFunctions and DynamicFunctions. Polonium 21:27, 13 March 2007 (UTC)
 * explode seems not to be functioning over there. I came up with a bad substitute:  convert the string into unicode, edit the monobook so that "%" are automatically converted into "|", then cut and paste.---Hillgentleman | 書 |2007年03月22日( Thu ), 10:06:56 10:06, 22 March 2007 (UTC)