Topic on Project:Support desk

Disallowing /w/index.php? in robots.txt - will it stop crawling the whole site?

5 comments • 13:58, 7 May 2018 5 years ago

5

Summary by Biologically

Use of robots.txt for the MediaWiki sites (and more) installed in the web root directory (in this case public_html) explained.

Biologically (talkcontribs)

If I write this in the robots.txt file -

Disallow: /w/index.php?

Will it stop crawling in the whole site? In other words, will it drop the site from indexing in Google?

16:27, 5 May 2018 5 years ago

Bawolff (talkcontribs)

Depends on your url setup. Also you need to specify user-agent.

If you have pages listed as /wiki/page_name_here or '/w/index.php/page_name_here than the answer is no.

If you really want noindexing, you should probably just do

User-agent: *
Disallow: /

(or if there are other things on your domain, Disallow: /w/

18:42, 6 May 2018 5 years ago

Biologically (talkcontribs)

Thank you so much for explaining. My site is in the web root folder (public_html in apache) and my short-URL contains http://site_name.com/all/page_name_here structure where I used "all" in place of "wiki" as in your example.

I also don't want to completely no-index the site, so although my site is in "/" (web root or public_html) directory, I probably can't use (can you please confirm) -

User-agent: *

Disallow: /

as compared to -

User-agent: *

Disallow: /w/

that is used in most wikis because they installed the site in /w/ directory.

So, using -

User-agent: *

Disallow: /index.php?

Would it completely block my site from being crawled?

05:04, 7 May 2018 5 years ago

Bawolff (talkcontribs)

If you just want to allow your all directory, you can do something like

User-agent: *
Disallow: /
Allow: /all/

Blocking /index.php? should block all non normal page views (normal page views are still possible via /index.php/page_here)

13:41, 7 May 2018 5 years ago

Biologically (talkcontribs)

Thank you.

13:58, 7 May 2018 5 years ago