Fundraising tech/Failmail zoo

A guide to the species of failmail encountered in our systems

Fail Mail (civi1001) run-job: XXX failed with code 1
This phylum of failmail is generated by the process-control script. The subject line indicates which job failed, and with which exit code, but for any other details you will have to log on to civi1001 and read the log indicated in the body of the failmail.

Get recipient data from Silverpop failed with code 1
This happens fairly often with error message 'Login Failed'. Should we be retrying logins? Or is this usually during Silverpop's planned maintenance?

Fredge Multiqueue Consumer failed with code 1
This indicates either a DB access problem, or a coding error on our part. Look in the log referenced in the failmail. If it's a DB access issue, check with FR-ops and turn off the fredge queue consumer if necessary.

If it's an issue with one of the messages, like a badly formatted or missing field, determine the source of the message from the source_ fields and fix the offending component.

Aborting! Previous process (XXX) is still alive. Remove lockfile manually if in error: /tmp/ .lock
This indicates that a job has run longer than expected. When process-control starts a job and finds that the previous installment of the job is still running (as indicated by the lockfile still being pressent), it aborts the next run and sends this email. This email can be suppressed by adding "allow-overtime: true" to the job's yaml configuration file. When allow-overtime is true, the next job still won't run, but process-control won't send this email.

Error validating Amazon message. Firewall problem?
Our IPN listener receives messages from Amazon Simple Notification Services when donors send an Amazon Pay donation. Each notification contains a URL for a certificate, and part of the Amazon Pay SDK's logic retrieves that cert and uses it to check the signature. This failmail means something went wrong getting the cert.

There is an IP address in this failmail, but it's not the one we're trying to connect to and failing. The IP logged for all Amazon SNS messages is the SOURCE of the notification, not the IP address we try to contact to get the cert. To find the offending IP address, you will have to ssh into thulium / frpig1001 and try to fetch the cert from the URL mentioned in the log.

For example:

Fundraising ops will have to make sure that IP address is whitelisted in IPtables on thulium / frpig1001.

FIXME: this species tends to arrive in herds of thousands, mercilessly trampling inboxes until someone mangles the SmashPig config files to disable failmail

INVALID_MESSAGE Cannot create contribution, civi error!
This is usually due to a DB lock issue. Our queue consumer creates the contact and their contribution within a single transaction involving upwards of a dozen tables. Depending on what else is happening in Civi, this can result in a deadlock and the contact insert can fail. In this case, the queue message is generally shunted to the 'damaged' table. Follow the link at the bottom of the failmail and click 'resend' at the bottom of the damaged message UI to try the insert again.

MISSING_PREDECESSOR Message -XXXXXX indicates a pending DB entry with order ID YYYY, but none was found
Amazon and Astropay send IPN messages indicating payment completion, but without much donor info. The IPN listeners forward these messages along to the donations queue with a key 'completion_message_id' indicating how to find the rest of the information in the pending DB. The donations queue consumer looks for this row in the pending table, combines the pending info with the IPN message, and deletes the pending row.

Only sometimes, the pending row isn't found. Maybe it was deleted early? Maybe the donation came in from outer space (or Alexa) and there never was a pending row? Maybe the pending queue consumer is turned off? When this happens, the message is kicked to the damaged table with a requeue date. After enough retries, it's left on the damaged table with no requeue date and the queue consumer sends this email.

TODO: instead of the donations queue consumer doing this recombination, it could happen in a job on a new jobs-amazon queue. That job could fall back to an Amazon API call if it doesn't find the info in the pending table (https://phabricator.wikimedia.org/T194317).

BAD_AUDIT_LINE (paypal-audit)
Usually due to a missing subscr_id. This is a thing that randomly breaks every few months in the audit-file-producing machinery at PayPal, and confounds our process because we use the subscr_id to associate the donation with the parent subscription.

To deal with it:
 * 1) Email our tech contact at PayPal asking them to re-generate the file (filename should be included
 * 2) Use the one_off process-control job to delete the file from the /var/spool/audit/paypal/incoming or /var/spool/audit/paypal/completed directory
 * 3) The next nightly audit download & parse job should get the fixed file from PayPal

TODO: email once per file (or per parser run) with a summary of errors. Once per error is too darn much!