Get information about a failed Platform component

If a Caplin Platform component fails, follow the instructions on this page to help you and Caplin Support diagnose the failure.

These instructions assume your Platform components run under the Caplin Deployment Framework.

Prerequisites

Perform the following preparatory tasks on all Caplin Platform servers before going live:

Set system clock to UTC

To make log file analysis easier in the event of component failure, we recommend that the system clocks of all servers in your Caplin Platform deployment are set to UTC.

Enable core dumps

To enable core dumps on Red Hat Enterprise Linux (RHEL), edit the /etc/security/limits.conf file and set the core-file limit for the user that runs Caplin Platform components to unlimited. For more information on setting user limits, see How to set ulimit values on the Red Hat website.

Enable Java garbage collection logging

To enable GC logging for JVMs embedded in Liberator, Transformer, and C-based adapters (for example, the TREP Adapter), add the following jvm-options configuration items to the component’s java.conf override file:

File: java.conf
jvm-options -Xloggc:var/gc.log
jvm-options -XX:+PrintGCDetails
jvm-options -XX:+PrintGCDateStamps
jvm-options -XX:+UseGCLogFileRotation
jvm-options -XX:NumberOfGCLogFiles=10
jvm-options -XX:GCLogFileSize=5M
jvm-options -XX:+PrintConcurrentLocks
jvm-options -XX:+PrintCommandLineFlags
jvm-options -XX:+HeapDumpOnOutOfMemoryError
jvm-options -XX:HeapDumpPath=var

To enable GC logging for JVMs that run integration adapters, add GC logging options to the java command in the adapter’s startup script kits/<blade_name>/DataSource/bin/start-jar.sh:

File: kits/<blade_name>/DataSource/bin/start-jar.sh
java \
  -Xloggc:var/gc.log \
  -XX:+PrintGCDetails \
  -XX:+PrintGCDateStamps \
  -XX:+UseGCLogFileRotation \
  -XX:NumberOfGCLogFiles=10 \
  -XX:GCLogFileSize=5M \
  -XX:+PrintConcurrentLocks \
  -XX:+PrintCommandLineFlags \
  -XX:+HeapDumpOnOutOfMemoryError \
  -XX:HeapDumpPath=var \
  ...original arguments

Install the GNU Debugger

To install the GNU Debugger on Red Hat Enterprise Linux, run the command below:

$ sudo yum install gdb

The Red Hat gdb package includes the commands gcore and gdb, which are used in the instructions below to generate and analyse core files.

Diagnostic information for a running component

For components that are failing but have not crashed (for example, an unresponsive component or a component with a suspected memory leak), follow the steps below to gather diagnostic information:

Incident date, software versions, and user activity

Record the following information about the environment at the time the component failed:

  • Date and time the component failed

  • Caplin component versions

    $ ./dfw versions > versions.out
  • Host operating system version

  • User activity (if known). For example, "We were testing clustered Liberators with this configuration (supplied) when one of the Liberators failed."

CPU, memory, and disk usage

Run the top command and record the output over 1 minute:

$ for i in {1..12}; do echo; echo; top -b -n 1; sleep 5; done > top.out

Run the top command for the component’s PID and record the output over 1 minute:

$ for i in {1..12}; do echo; echo; top -H -p <pid> -b -n 1; sleep 5; done > top-<pid>.out
If only one instance of the component is running on the host, a quick way to include the PID in the command above is with the pidof command. For example, to get the PID of a running Liberator, use $(pidof rttpd).

Record disk usage:

$ df -kh > df.out

Record memory usage:

$ free -h -w > free.out

Record memory, I/O, and CPU usage over 30 seconds:

$ vmstat -S m 1 30 > vmstat.out

GDB thread backtraces

Execute the following command three times at intervals of 5 minutes:

$ gdb <binary> -p <pid> \
> -ex "set confirm off" \
> -ex "set pagination off" \
> -ex "thread apply all bt full" \
> -ex "quit" > debug-$(date +%Y%m%d%H%M%S).log

JVM stack trace and shared object memory map

Run the commands below to dump thread stack traces and shared-object mappings for a running JVM:

Print stack traces for a running JVM
$ jstack -l <pid> > jstack.<pid>.output
Print shared object memory maps for a running JVM
$ jmap -dump:file=jmap.<pid>.output <pid>

System call trace

Use the strace command to trace system calls that are made and received by a running Caplin component:

$ mkdir strace
$ sudo strace -ff -tt -o strace/<process_name>_strace -p <pid>

Core file

Follow the steps below

  1. Generate a core file for a running component:

    $ sudo gcore -o core <pid>

    Alternatively, if the gcore command is not available on the host, you can generate a core file by terminating a running component with a SIGSEGV signal (11):

    $ kill -11 <pid>
  2. Generate a thread backtrace from the core file:

    $ gdb <binary> <core_file>
    (gdb) set logging file debug.log
    (gdb) set logging on
    (gdb) set pagination off
    (gdb) thread apply all bt full
    (gdb) quit
  3. Generate a list of shared libraries from the core file:

    $ gdb <binary> <core_file>
    (gdb) set logging file libs-list.out
    (gdb) set logging on
    (gdb) set pagination off
    (gdb) info sharedlibrary
    (gdb) quit
  4. Execute the command below to clean the libs-list.out output file:

    $ grep "Yes\|No" ./libs-list.out | awk '{if ($5 != "")print $5}' &> libs-list.txt
  5. Execute the command below to package the file libs-list.txt in a tar file for Caplin Support:

    $ cat libs-list.txt | sed "s/\/\.\.\//\//g" | xargs tar -chvf $(hostname)-libs-list.tar

Java garbage collection log

Copy the Java garbage collection (GC) log (if present) for the failed component(s).

Garbage collection log locations
Component Location

Liberator

servers/Liberator/var/gc.log

Transformer

servers/Transformer/var/gc.log

Integration adapter

kits/<blade_name>/Latest/DataSource/var/gc.log

Caplin log files for the period when the component failed

Copy the Caplin log files that cover the period during which the component(s) failed.

Log file locations
Component Location

Liberator

servers/Liberator/var/

Transformer

servers/Transformer/var/

Integration adapter

kits/<blade_name>/Latest/DataSource/var/

Useful commands for working with log files
Print a log file’s start and end date
$ cat input_file | (head -n 1; tail -n 1) | cut -c 1-20
Extract a log file’s lines between two dates
$ sed -n '/2016-07-13/,/2016-07-19/p' input_file > output_file
Download logs from a Deployment Framework (DFW) on a server
$ ssh host "shopt -s nullglob globstar && nice -n 10 tar --warning=no-file-changed --ignore-failed-read -czf - path_to_dfw/{kits,servers}/**/var/*.{log,[0-9]{[0-9],}}" > logs.tar.gz

Caplin configuration files

If the host is offline or can be taken offline, run the command below to stop all Caplin components on the host and generate a configuration dump.

$ ./dfw dump
Do not run dfw dump on a live production system. The command stops all Caplin components.

If you cannot take the host offline now or at a scheduled time, then run the command below from the root directory of the Deployment Framework to create an archive of raw configuration files:

$ tar czf $(hostname)-config.tar.gz global_config/*.conf global_config/overrides

Diagnostic information for a crashed component

After a crash, collate the following files and information for Caplin Support:

Incident date, software versions, and user activity

Record the following information about the environment at the time the component failed:

  • Date and time the component failed

  • Caplin component versions:

    $ ./dfw versions > versions.out
  • Operating system version

  • User activity (if known). For example, "We were testing clustered Liberators with this configuration (supplied) when one of the Liberators failed."

CPU, memory, and disk usage

Run the top command and record the output over 1 minute:

$ for i in {1..12}; do echo; echo; top -b -n 1; sleep 5; done > top.out

Record disk usage:

$ df -kh > df.out

Record memory usage:

$ free -h -w > free.out

Record memory, I/O, and CPU usage over 30 seconds:

$ vmstat -S m 1 30 > vmstat.out

Core file

Copy the core files for the failed component(s).

Core file locations
Component Deployment Framework Path

Liberator

servers/Liberator/core.<pid>

Transformer

servers/Transformer/core.<pid>

C-based adapter

kits/<adapter-blade-name>/latest/core.<pid>

Java-based adapters generate a HotSpot JVM error file on failure.

Core files may be named differently on your system. See Naming of core dump files in the core man page.

GDB thread backtrace and shared-library list

Generate a thread backtrace and list of shared libraries from the core file. This is easier to perform on the host on which the failure occurred, but can be performed on a different host if required. Contact Caplin Support for details.

To generate a thread backtrace from the core file, execute the following GDB commands on the host machine:

$ gdb <binary> <core_file>
(gdb) set logging file debug.log
(gdb) set logging on
(gdb) set pagination off
(gdb) thread apply all bt full
(gdb) quit

To generate a list of shared libraries from the core file, follow the steps below on the host machine:

  1. Execute the following GDB commands:

    $ gdb <binary> <core_file>
    (gdb) set logging file libs-list.out
    (gdb) set logging on
    (gdb) set pagination off
    (gdb) info sharedlibrary
    (gdb) quit
  2. Execute the command below to clean the libs-list.out output file:

    $ grep "Yes\|No" ./libs-list.out | awk '{if ($5 != "")print $5}' &> libs-list.txt
  3. Execute the command below to package the file libs-list.txt in a tar file for Caplin Support:

    $ cat libs-list.txt | sed "s/\/\.\.\//\//g" | xargs tar -chvf $(hostname)-libs-list.tar

HotSpot JVM error file

Copy the HotSpot JVM error file (if present) for the failed component(s).

HotSpot error file locations
Component Location

Liberator

servers/Liberator/hs_error_pid<pid>.log

Transformer

servers/Transformer/hs_error_pid<pid>.log

Integration adapter

kits/<blade_name>/Latest/hs_error_pid<pid>.log

Java garbage collection log

Copy the Java garbage collection (GC) log (if present) for the failed component(s).

Garbage collection log locations
Component Location

Liberator

servers/Liberator/var/gc.log

Transformer

servers/Transformer/var/gc.log

Integration adapter

kits/<blade_name>/Latest/DataSource/var/gc.log

Caplin log files for the period when the component failed

Copy the Caplin log files that cover the period during which the component(s) failed.

Log file locations
Component Location

Liberator

servers/Liberator/var/

Transformer

servers/Transformer/var/

Integration adapter

kits/<blade_name>/Latest/DataSource/var/

Useful commands for working with log files
Print a log file’s start and end date
$ cat input_file | (head -n 1; tail -n 1) | cut -c 1-20
Extract a log file’s lines between two dates
$ sed -n '/2016-07-13/,/2016-07-19/p' input_file > output_file
Download logs from a Deployment Framework (DFW) on a server
$ ssh host "shopt -s nullglob globstar && nice -n 10 tar --warning=no-file-changed --ignore-failed-read -czf - path_to_dfw/{kits,servers}/**/var/*.{log,[0-9]{[0-9],}}" > logs.tar.gz

Caplin configuration files

If the host is offline or can be taken offline, run the command below to stop all Caplin components on the host and generate a configuration dump.

$ ./dfw dump
Do not run dfw dump on a live production system. The command stops all Caplin components.

If you cannot take the host offline now or at a scheduled time, then run the command below from the root directory of the Deployment Framework to create an archive of raw configuration files:

$ tar czf $(hostname)-config.tar.gz global_config/*.conf global_config/overrides

Uploading diagnostic information to Caplin Support

To upload diagnostic information to Caplin Support, follow the steps below.

  1. Add all diagnostic files to a single compressed archive (.tar.gz, .zip).

  2. If you don’t yet have a JIRA ticket for the issue, navigate to https://jira.caplin.com and create a ticket.

    Do not attach any files to the ticket.

  3. Navigate to https://www.caplin.com/account and login with your Caplin account credentials.

  4. Click Uploads

  5. In the JIRA issue ID field, enter the ID of your JIRA ticket

  6. Select 'I acknowledge and consent that files will be temporarily uploaded to encrypted Amazon S3 filestore…​'

  7. Click Submit

  8. Drag the archive you created in step 1 to the Upload box.

  9. Click Upload

There is no upload file-size limit.

If the upload is interrupted, begin the upload again and the upload will resume from where it left off.

Statistics displayed in the progress bar are estimates only and can be incorrect after resuming an upload.

Files are transferred to Caplin servers using an encrypted TLS connection, where they are saved to an encrypted Amazon S3 file store. At no point is any data transferred in plain text, and access is restricted to the Caplin Support team only.

If you require more information regarding this service, please contact Caplin Support.


See also: