Topic on User talk:Ammarpad/Flow archive

How to Exclude NoIndex Pages From Sitemap? The noindex-tag is added via $wgNamespaceRobotPolicies or $wgDefaultRobotPolicies in LocalSettings.php (NOT via the on-page magic word: __NOINDEX__)

5
Summary by MarkAHershberger
Goodman Andrew (talkcontribs)

The following patches for the maintenance script generateSitemap.php from https://gerrit.wikimedia.org/r/c/620746 works only for the behavior switch magic word (), but does not remove pages marked 'noindex' via the LocalSettings.php from the sitemap.

I think there might be a solution to this because, if there wasn't, Wikipedia would have a problem excluding talkpages from its sitemap, which I think it doesn't: https://en.wikipedia.org/wiki/Wikipedia:Controlling_search_engine_indexing

Now, the wiki in question is by default noindex. Pages that are to be index have {{INDEX}} added to them but the entire wiki is noindex by default, because: $wgDefaultRobotPolicies = true; in LocalSettings.php. Thus the desire sitemap solution is to generate sitemap for pages that has __INDEX__ or {{INDEX}} in them or that indicate 'index' in the HTML output of the page.

diff --git a/maintenance/generateSitemap.php b/maintenance/generateSitemap.php
index 6060567..bc5e865 100644
--- a/maintenance/generateSitemap.php
+++ b/maintenance/generateSitemap.php

@@ -305,15 +305,27 @@
 	 * @return IResultWrapper
 	 */
 	private function getPageRes( $namespace ) {
-		return $this->dbr->select( 'page',
+		return $this->dbr->select(
+			[ 'page', 'page_props' ],
 			[
 				'page_namespace',
 				'page_title',
 				'page_touched',
-				'page_is_redirect'
+				'page_is_redirect',
+				'pp_propname',
 			],
 			[ 'page_namespace' => $namespace ],
-			__METHOD__
+			__METHOD__,
+			[],
+			[
+				'page_props' => [
+					'LEFT JOIN',
+					[
+						'page_id = pp_page',
+						'pp_propname' => 'noindex'
+					]
+				]
+			]
 		);
 	}
 
@@ -335,7 +347,13 @@
 			$fns = $contLang->getFormattedNsText( $namespace );
 			$this->output( "$namespace ($fns)\n" );
 			$skippedRedirects = 0; // Number of redirects skipped for that namespace
+			$skippedNoindex = 0; // Number of pages with __NOINDEX__ switch for that NS
 			foreach ( $res as $row ) {
+				if ( $row->pp_propname === 'noindex' ) {
+					$skippedNoindex++;
+					continue;
+				}
+
 				if ( $this->skipRedirects && $row->page_is_redirect ) {
 					$skippedRedirects++;
 					continue;
@@ -380,6 +398,10 @@
 				}
 			}
 
+			if ( $skippedNoindex > 0 ) {
+				$this->output( "  skipped $skippedNoindex page(s) with __NOINDEX__ switch\n" );
+			}
+
 			if ( $this->skipRedirects && $skippedRedirects > 0 ) {
 				$this->output( "  skipped $skippedRedirects redirect(s)\n" );
 			}
MarkAHershberger (talkcontribs)

Phabricator is a more appropriate place for this discussion.

Goodman Andrew (talkcontribs)

@MarkAHershberger: I don't know how to use phabricator. Could you please help me transfer this thread there?

Ammarpad (talkcontribs)
Goodman Andrew (talkcontribs)