Project:Sandbox

Purpose

This alert is used to detect MySQL daily backup failure. And this alert is triggered in MySQL instance level.

Runners

We take two runners to finally trigger the ICM incidents;

1.     The passive runner "MySqlBackupFaiureAlert"

Code path:  $/SQL Azure/Proactive Analytics/Dev-OnPrem/Src/MdsRunners/MdsRunners/Runners/ManagedBackup/MySqlBackupFaiureAlert.cs

Function: This runner will query the critical error log of backup failure and create new healthy properties if there are MySQL backup failures;

2.     The passive runner "OSSDBMSBackUpRouting"

Code path: $/SQL Azure/Proactive Analytics/Dev-OnPrem/Src/MdsRunners/MdsRunners/Runners/ManagedBackup/OSSDBMSBackUpRouting.cs

Function: This runner will create ICM incident if there is active healthy properties of MySQL backup failures

Query Run By the Passive Runner

AlrRdmsCriticalEvent


 * where Text contains "Failed to take daily backup in the planned hour"


 * where AppTypeName == "Worker.PAL.MySQL"


 * project TIMESTAMP, LogicalServerName, AppTypeName, AppName , ClusterName , NodeName, Text


 * summarize by LogicalServerName, AppTypeName, AppName , ClusterName , NodeName, Text

Mitigation

'''A known issue of Maria DB PITR runner test is in the progress of mitigation, if you found incident triggered by elastic server whose name pattern is "elasticserver-mariadb-b-**-pitr-src- -**-pitr", please assign to Shuode Li(shuodl). And if the name pattern is "elasticserver-mysql-b-**-pitr-src- -**-pitr", transfer to MySQL DRI team'''

General mitigation steps:
Currently MySQL daily backup will lock the whole databases when executing the backup process. In the backup process, customer's service could be impacted.

Step 1: Check the exception when daily backup in the MDS table MonRdmsInstanceAgent

Sample Query to find the backup logs

https://sqladhoc.kusto.windows.net:443/sqlazure1 [Run in Kusto.Explorer] [Run in Kusto.WebExplorer]

MonRdmsInstanceAgent


 * extend Message = message_systemmetadata


 * where TIMESTAMP  > ago(1d)


 * where AppTypeName == "Worker.PAL.MySQL" and LogicalServerName == "mysqlfailovertest"


 * where Message contains "Backup" or Message contains "backup"

Step 2: If the customer's service is unhealthy due to backup, restart the MySQL service (SOP0007: RESTARTING A SERVER USING CAS COMMANDS).

Mitigation steps for Special Scenario 1:
If you find the backup error message is like below:

[MySqlBackupActor(ccf67ce6c97c)].TakeDailyBackupAsync: Failed to take backup once with excption. Instance Name : ccf67ce6c97c, Backup Time: 11/18/2016 01:56:20, Exception: System.NullReferenceException: Object reference not set to an instance of an object. at Microsoft.MySqlaaS.InstanceAgent.SqlModel.BackupModel.<>c__DisplayClass6_0.b__0(SnapshotEntity s) in d:\_work\13\s\src\App\Worker.MySQL\MySqlaaS.InstanceAgent\SqlModel\BackupModel.cs:line 162 at System.Linq.Enumerable.WhereListIterator`1.MoveNext at System.Linq.Enumerable.Count[TSource](IEnumerable`1 source) at Microsoft.MySqlaaS.InstanceAgent.SqlModel.BackupModel.d__6.MoveNext in d:\_work\13\s\src\App\Worker.MySQL\MySqlaaS.InstanceAgent\SqlModel\BackupModel.cs:line 162 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult at Microsoft.MySqlaaS.InstanceAgent.SqlInstanceManagement.MySqlBackupActor.d__14.MoveNext in d:\_work\13\s\src\App\Worker.MySQL\MySqlaaS.InstanceAgent\SqlInstanceManagement\MySqlBackupActor.cs:line 139.

The mitigation steps:
Step 1: Access the MySQL storage blob container for this service by the TSG SOP0014: How to access the MySQL data files

Step 2: Find the error snapshot zip files in this blob container

Note:

a.       The snapshot zip files should under the folder "backup".

b.       Check the snapshot zip file's metadata. If it has no metadata of "Snapshots" as below, it should be an error snapshot

c.

Step 3: delete this error snapshot

Mitigation steps for Special Scenario 2:
If you find the backup error message like below:

[AzureShareFilesCompressModel(AzureFileShareModel(MySqlBackupRestoreModel(f2bacb532053)))].Compress: Zip action error.

System.AggregateException: One or more errors occurred. ---> System.AggregateException: One or more errors occurred. ---> System.InvalidOperationException: The file mysql/firewall_rules.CSV.Copy modified after 03/01/2017 01:57:20 +00:00

at Microsoft.MySqlaaS.Common.Storage.AzureShareFilesCompressModel.d__16.MoveNext in D:\_work.0\6\s\src\App\Worker.MySQL\MySqlaaS.Common\Storage\AzureShareFilesCompressModel.cs:line 262

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at Microsoft.MySqlaaS.Common.Storage.AzureShareFilesCompressModel.d__15.MoveNext in D:\_work.0\6\s\src\App\Worker.MySQL\MySqlaaS.Common\Storage\AzureShareFilesCompressModel.cs:line 245

--- End of inner exception stack trace ---

--- End of inner exception stack trace ---

<---

This is because the backup actor find firewall_rules.CSV.Copy file changed while taking the zip package. This is a temporary file for socket duplicator and will recreate frequency. The file should be zipped and this is a bug of MySQL backup actor. The fix submitted but not deploy yet.

This will not cause customer impact.

Before the change deployed, please just ignore this error and wait for the next backup iteration.