Automated deadlock detection and mitigation

    Erth Paradine

    Server Admln & Bug Reporter
    Joined
    Feb 15, 2016
    Messages
    239
    Reaction score
    58
    Deadlocks are really irritating bugs, and it looks like the latest update is causing a number of them. So I've figured this script might come in handy for other server operators.

    Script is fairly basic: searches a jstack report for relevant entries suggesting a deadlock, then tarballs server logs, jstack, and jmap dumps. Also reports actions to syslog, and can optionally send an email.

    We've been running this in production since April 2016, and so-far we've seen zero false-alarms, and it has proven quite handy at both gathering reports for Schine to further debug, and just simply quickly recovering from a deadlock without an admin/operator needing to touch anything.

    Code:
    #!/bin/bash
    #
    # Server deadlocked, grab jstack, logs, and package them for sending to Schine.
    #
    # License: CC0
    #
    # Warranty: no warranties, no promise of support.
    #
    # Dependencies:
    #  CentOS7
    #  Functional SSMTP install.
    #
    # To add this entire file to /usr/local/bin/restartCrashReportStarmade.sh
    # Make executable, and then add the following to crontab
    #   */5 * * * * /usr/local/bin/restartCrashReportStarmade.sh
    #
    # CHANGELOG:
    # 18APR2016 Created initial script
    # 20JUL2016 Removed attempts at clean shutdown of deadlocked instance
    # 21SEP2016 Added PAFKA syslog hooks
    #
    # TODO:
    # Does not support spaces in file/folder names. Maybe look into this someday.
    #
    
    # Put your preferred values in the following fields.
    emailAddressRecipient=""
    emailAddressSender="[email protected]"
    pathSaveReports="/path/to/save/reports/"
    pathTemp="/path/to/temp/folder/"
    pathSMLogs="/path/to/starmade/logs/"
    timestamp=$(date +%s)
    
    jmap `pgrep -f "StarMade.jar"` > ${pathTemp}jmap-${timestamp}.txt
    sleep 2
    jstack  `pgrep -f "StarMade.jar"` > ${pathTemp}jstack-${timestamp}.txt
    
    # Test for and auto-mitigate deadlocks
    # There's no point in attempting clean shutdown, server is non-responsive w/ deadlock
    if (( `grep "Found.*Java-level deadlock" ${pathTemp}jstack-${timestamp}.txt |wc -l` >= 1 )); then
      kill -9 `pgrep -f "StarMade.jar"`
      sleep 10
      tar -czf ${pathSaveReports}deadlockDump-${timestamp}.tgz ${pathTemp} ${pathSMLogs}
      fileReportTarball=`md5sum ${pathSaveReports}deadlockDump-${timestamp}.tgz |cut -d " " -f 1`
      mv ${pathSaveReports}deadlockDump-${timestamp}.tgz ${pathSaveReports}${fileReportTarball}.tgz
      printf "[PAFKA] Deadlock detected, forced unclean restart. Tarball at ${pathSaveReports}${fileReportTarball}" |& logger -t SMPrd
      if [ ! -z "$emailAddressRecipient" ]; then
        printf "From:<${emailAddressSender}>\nSubject: StarMade Deadlock forced restart \n\nStarMade Deadlock forced restart" | /usr/sbin/ssmtp -4 ${emailAddressRecipient}
        printf "[PAFKA] Deadlock alert sent to ${emailAddressRecipient}" |& logger -t SMPrd
      fi
    fi
    
    # We're done, cleanup
    rm -f ${pathTemp}jstack-${timestamp}.txt ${pathTemp}jmap-${timestamp}.txt