Fundraising tech/Failmail zoo

A guide to the species of failmail encountered in our systems

Fail Mail (civi1001) run-job: XXX failed with code 1
This phylum of failmail is generated by the process-control script. The subject line indicates which job failed, and with which exit code, but for any other details you will have to log on to civi1001 and read the log indicated in the body of the failmail.

If the nightly Silverpop EXPORT jobs fail, this could lead to bulk email complaints. Someone should log into the Civi box and run-job silverpop_daily (if the build export files fails) or run-job silverpop_emails_upload_files (if that job failed).

Get recipient data from Silverpop failed with code 1
This happens fairly often with error message 'Login Failed'. Should we be retrying logins? Or is this usually during Silverpop's planned maintenance?

Fredge Multiqueue Consumer failed with code 1
This indicates either a DB access problem, or a coding error on our part. Look in the log referenced in the failmail. If it's a DB access issue, check with FR-ops and turn off the fredge queue consumer if necessary.

If it's an issue with one of the messages, like a badly formatted or missing field, determine the source of the message from the source_ fields and fix the offending component.

Aborting! Previous process (XXX) is still alive. Remove lockfile manually if in error: /tmp/ .lock
This indicates that a job has run longer than expected. When process-control starts a job and finds that the previous installment of the job is still running (as indicated by the lockfile still being pressent), it aborts the next run and sends this email. This email can be suppressed by adding "allow-overtime: true" to the job's yaml configuration file. When allow-overtime is true, the next job still won't run, but process-control won't send this email.

Error validating Amazon message. Firewall problem?
Our IPN listener receives messages from Amazon Simple Notification Services when donors send an Amazon Pay donation. Each notification contains a URL for a certificate, and part of the Amazon Pay SDK's logic retrieves that cert and uses it to check the signature. This failmail means something went wrong getting the cert.

There is an IP address in this failmail, but it's not the one we're trying to connect to and failing. The IP logged for all Amazon SNS messages is the SOURCE of the notification, not the IP address we try to contact to get the cert. To find the offending IP address, you will have to ssh into thulium / frpig1001 and try to fetch the cert from the URL mentioned in the log.

For example:

Fundraising ops will have to make sure that IP address is whitelisted in IPtables on thulium / frpig1001.

FIXME: this species tends to arrive in herds of thousands, mercilessly trampling inboxes until someone mangles the SmashPig config files to disable failmail

INVALID_MESSAGE Cannot create contribution, civi error!
This is usually due to a DB lock issue. Our queue consumer creates the contact and their contribution within a single transaction involving upwards of a dozen tables. Depending on what else is happening in Civi, this can result in a deadlock and the contact insert can fail. In this case, the queue message is generally shunted to the 'damaged' table. Follow the link at the bottom of the failmail and click 'resend' at the bottom of the damaged message UI to try the insert again.

MISSING_PREDECESSOR Message -XXXXXX indicates a pending DB entry with order ID YYYY, but none was found
Amazon and Astropay send IPN messages indicating payment completion, but without much donor info. The IPN listeners forward these messages along to the donations queue with a key 'completion_message_id' indicating how to find the rest of the information in the pending DB. The donations queue consumer looks for this row in the pending table, combines the pending info with the IPN message, and deletes the pending row.

Only sometimes, the pending row isn't found. Maybe it was deleted early? Maybe the donation came in from outer space (or Alexa) and there never was a pending row? Maybe the pending queue consumer is turned off? When this happens, the message is kicked to the damaged table with a requeue date. After enough retries, it's left on the damaged table with no requeue date and the queue consumer sends this email.

TODO: instead of the donations queue consumer doing this recombination, it could happen in a job on a new jobs-amazon queue. That job could fall back to an Amazon API call if it doesn't find the info in the pending table (https://phabricator.wikimedia.org/T194317).

BAD_AUDIT_LINE (paypal-audit)
Usually due to a missing subscr_id. This is a thing that randomly breaks every few months in the audit-file-producing machinery at PayPal, and confounds our process because we use the subscr_id to associate the donation with the parent subscription.

To deal with it:
 * 1) Email our tech contact at PayPal asking them to re-generate the file (filename should be included
 * 2) Use the one_off process-control job to delete the file from the /var/spool/audit/paypal/incoming or /var/spool/audit/paypal/completed directory
 * 3) The next nightly audit download & parse job should get the fixed file from PayPal

TODO: email once per file (or per parser run) with a summary of errors. Once per error is too darn much!

CIVI_REQ_FIELD Positive amount required
If the gateway is PayPal, probably someone is messing with our old account and sending donations below the minimum we should be allowing. Email our tech contact there and ask them to reinstate the block on sub-minimum donations. Long term plan: when we're totally happy with Express Checkout, turn off the old account.

INVALID_RECURRING Subscription account does not exist
Currently always PayPal-related. The donation has come in with a subscr_id value that doesn't correspond to any trxn_id in the civicrm_contribution_recur table. For PayPal, we create civicrm_contribution_recur records when we get a specific type of IPN message or audit line, so maybe that failed? Look at earlier damaged messages to see if there's a subscr_start one with the missing trxn_id that you can re-send.

Check forget me processing failed with code 1
A GDPR data erasure job has failed at Acoustic (formerly known as Silverpop). When someone makes a forgetme request from the Civi UI, Civi sends Acoustic a GDPR erasure request for that email address for each of the lists we maintain there. Acoustic queues the requests and returns job IDs. We store those in Civi as OmnimailJobProgress entities. The omnimail_check_progress process-control job looks at this table and queries Acoustic to see which of the jobs have finished. When an Acoustic job doesn't finish after a certain amount of time, the job sends failmail. To check on which address triggered the failing job, use the API4 explorer to check on the jobs at /civicrm/api4#/explorer/OmnimailJobProgress/get. The job identifier should indicate the email address, and the retrieval parameters should include the list ID at Acoustic. Log into Acoustic and search for that email address in the indicated list. If the email address is not in the list, you can just delete the OmnimailJobProgress entity, again using the API4 explorer. See T244811

EMAIL_SYSTEM_FAILURE: civi1001 (thank_you)
This one mostly pops up when we try to send an email to an invalid email address. In the case of email addresses that start with a hyphen(-), it turns out that this is valid, e.g. -dashatstart@example.com is a valid email address, but it gets rejected by postfix (see https://serverfault.com/questions/893648/postfix-bad-recipient-address-syntax-email-starting-with-dash-hyphen/893657#893657) when running the default config. More discussion is needed on whether we want to change our setup to support this.