Topic on Project:Support desk

How to allow crawling of my mediawiki pages

8
DksDev064 (talkcontribs)

MediaWiki 1.24.1
PHP 5.6.5 (cgi-fcgi)
MySQL 5.6.22-log

The wiki is set up and working just fine, with a few extensions for our company (WYSIWYG, LDAP authentication, and ApprovedRevs), but we would also like to incorporate the wiki pages into our intranet search. We use SharePoint Server 2010 and are trying to crawl using FAST Search for SharePoint, but it keeps erroring out, saying that the pages have the NOINDEX meta tag.

I have looked into changing $wgNamespaceRobotPolicies to a blank array, as well as setting $wgDefaultRobotPolicy to 'index, follow' (although that may be the defeault), but I must admit I am not certain what these two option changes even do.

Any advice or suggestions on how to allow this wiki to be crawled would be appreciated.

Ciencia Al Poder (talkcontribs)

Have you inspected the generated HTML to see if it actually outputs NOINDEX meta tags, connecting anonymously? Maybe your MediaWiki is configured to not allow read access to users not logged-in, and the crawler just sees error pages about not being logged in.

DksDev064 (talkcontribs)

The HTML meta tag does not have NOINDEX, but elsewhere in the source, there is a section called mw.config.set where both "INDEX" and "NOINDEX" are listed among other things like "TOC" and "NOTOC" -- is this just a list of available magic words?

I have these settings enabled

$wgGroupPermissions['*']['edit'] = false;
$wgGroupPermissions['user']['edit'] = true;
$wgGroupPermissions['*']['createaccount'] = false;
$wgWhitelistRead = array( "Main Page", "Special:Userlogin", "-", "MediaWiki:Monobook.css" );
$wgGroupPermissions['*']['read'] = false;

TheDJ (talkcontribs)

"is this just a list of available magic words?" - Yes

Sharepoint is so stupid, i wouldn't be surprised if that is actually the noindex that sharepoint is complaining about btw. Should be easy to test by setting up a static html page with just the work NOINDEX somewhere in the <head> of the page.

DksDev064 (talkcontribs)

We set up a test page and took out the noindex there and it crawled. Still getting an error of this nature: 'The URL was permanently moved. ( URL redirected to ... )' but the noindex issue does seem to be this. Not sure how to resolve the permanently moved issue now because that should be a problem of case sensitivity, but I am following proper procedure as far as I know

Ciencia Al Poder (talkcontribs)

I'd say the problem is that your site can't be read anonymously, as I said on my previous message:

 $wgGroupPermissions['*']['read'] = false;
DksDev064 (talkcontribs)

changing this to "true" does not seem to have fixed my problem, but then again I seem to be having more than one problem (see above)

Planetenxin (talkcontribs)

Have you been able to solve your problem? In SharePoint 2010 you need to define a crawl rool with "match case" option enabled. I'm having an issue with SharePoint 2013 at the moment not crawling my MediaWiki. In SharePoint 2013 MS has removed the "match case" option from crawl rules...

Reply to "How to allow crawling of my mediawiki pages"