Checking the health of Triton

Modified: 26 Sep 2019 21:20 UTC

Periodically and when investigating problems you should check the overall health of your Triton installation.

This page provides details of a number of check you can perform using tools that come with Triton and other techniques.

Status of core services and agents

The status of the Triton core services and agents can be checked using the sdc-healthcheck command as follows:

# sdc-healthcheck
ZONE                                 STATE           AGENT               STATUS
global                               running         -                   online
assets                               running         -                   online
sapi                                 running         -                   online
binder                               running         -                   online
amonredis                            running         -                   online
ufds                                 running         -                   online
redis                                running         -                   online
workflow                             running         -                   online
papi                                 running         -                   online
sdc                                  running         -                   online
amon                                 running         -                   online
napi                                 running         -                   online
rabbitmq                             running         -                   online
cnapi                                running         -                   online
dhcpd                                running         -                   online
dapi                                 running         -                   online
fwapi                                running         -                   online
vmapi                                running         -                   online
adminui                              running         -                  offline
imgapi                               running         -                   online
cloudapi                             running         -                   online
manatee                              running         -                   online
moray                                running         -                   online
global                               running         provisioner         online
global                               running         zonetracker         online
global                               running         heartbeat           online
global                               running         ur                  online
global                               running         smartlogin          online

The status field can show one of 4 values

Value Status Description
online All good.
offline The Zone or Agent is stopped.
error The Zone specific check failed.
svc-err One or more services in the Zone is not online.

Standard checks for each zone are to confirm it is running and that all services are online (svcs -x).

Standard check for the agents is to attempt to connect to the agent using /opt/smartdc/agents/bin/ping-agent. For the Smartlogin agent svcs is used to verify the service is running.

Zone specific checks

The following specific checks are made in each zone to verify it is functioning as expected. Most comprise a call to an API endpoint.

Zone Check
amon sdc-amon /pub/admin/probes
cloudapi sdc-listdatacenters
cnapi sdc-cnapi /servers?headnode=true
fwapi sdc-fwapi /rules
imgapi sdc-imgapi /images?name=imgapi
napi sdc-napi /networks?name=admin
sapi sdc-sapi /services?name=sapi
ufds sdc-ldap search login=admin
vmapiv,api vmadm lookup -1 tags.smartdc_role=vmapi
workflow sdc-workflow /workflows

Parsing sdc-healthcheck output

Output from sdc-healthcheck can be generated in a parseable format using the -p flag.

# sdc-healthcheck -p
global:running:-:online
assets:running:-:online
sapi:running:-:online
binder:running:-:online
amonredis:running:-:online
--snip--

Resolving health check errors

Any zone or agent that returns an error is potentially a serious problem and could impact the ability to provision instances or perform other jobs in Triton. End user instances WILL NOT be affected by error conditions in the core services or agents.

Note: Many SmartOS commands have a -z flag to allow you to call the command from the Global Zone (GZ) for a specific zone. Check the man pages or use --help to see if -z is available on a specific command. Commands can also be run inside a core service zone using sdc-login, e.g.

# sdc-login adminui svcs -x

Both methods are used below.

Error: offline

# zoneadm -z $(sdc-vmname adminui) boot

[root@headnode (mxpa) ~]# zoneadm -z $(sdc-vmname adminui) list -v
  ID NAME             STATUS     PATH                           BRAND    IP
  28 faa81fbe-ffc5-4b51-bcee-2d2562f01daf running    /zones/faa81fbe-ffc5-4b51-bcee-2d2562f01daf joyent-minimal excl

Error: svc-err

# svcs -x -z $(sdc-vmname imgapi)
svc:/smartdc/site/imgapi:default (Triton Image API)
  Zone: 234c005d-63ca-49d2-b27d-7af560cae951
 Alias: imgapi0
 State: maintenance since  6 March 2014 09:26:37 UTC
Reason: Restarting too quickly.
   See: http://illumos.org/msg/SMF-8000-L5
   See: /zones/234c005d-63ca-49d2-b27d-7af560cae951/root/var/svc/log/smartdc-site-imgapi:default.log
Impact: This service is not running.

For any service showing a state of Maintenance you can attempt to restart it using svcadm. This command requires an action and a service name, but it is typically only necessary to provide enough of the name to uniquely identify a service. For example:

svc:/smartdc/site/imgapi:default

can be abbreviated to:

imgapi

For example:

# svcadm -z $(sdc-vmname imgapi) clear imgapi
# sdc-login imgapi svcadm clear imgapi

Re-check the status of the services using svcs -x -z $(sdc-vmname imgapi). If the services are still showing in this output it is time to dig into the log files.

The log file name is shown in the output of svcs above. It can also be obtained using the -L flag to svcs.

# svcs -L -z $(sdc-vmname imgapi) imgapi
/zones/234c005d-63ca-49d2-b27d-7af560cae951/root/var/svc/log/smartdc-site-imgapi:default.log

On examining the log file you may be able to understand the underlying problem and resolve it. However, it is most likely you will need to raise a support issue with Joyent at help.joyent.com. Please provide a support bundle with all issues relating to the operation of the head node and core services.

Additional health checks

The following checks should built into a regular overall health check of Triton. These can and should be automated via cron jobs or as part of a monitoring frame work such as Nagios or Zabbix.

Compute node agent checks

The health of the agents on the Compute nodes should be checked using svcs -x. This can be done from the Head Node using sdc-oneachnode as follows.

[root@headnode (mxpa) ~]# sdc-oneachnode -c svcs -x
=== Output from 44454c4c-3700-1039-8034-c2c04f445131 (CN1):
    svc:/network/ntp:default (Network Time Protocol (NTP) Version 4)
     State: maintenance since Tue Mar 11 15:11:57 2014
    Reason: Maintenance requested by "svc:/smartdc/agent/ur:default"
       See: /var/svc/log/smartdc-agent-ur:default.log
       See: http://illumos.org/msg/SMF-8000-R4
       See: ntpd(1M)
       See: ntp.conf(4)
       See: ntpq(1M)
       See: /var/svc/log/network-ntp:default.log
    Impact: This service is not running.

The -c flag tells sdc-oneachnode to run the command on Compute Nodes only and not on the head node.

Follow the same procedure as described under svc-err in the above table for any failed/maintenance services.

Compute node networking

Compute nodes do not have plumbed interfaces in the Global Zone for the networks used in instances. Thus it is not possible to simply ping a Compute Node to determine if its networking is functioning. DO NOT be tempted to add a plumbed interface for any networks in the Global Zone of a compute node. This could significantly compromise the security of Compute Nodes which must remain isolated from the public internet.

CloudAPI endpoint

Although sdc-healthcheck performs a direct query on CloudAPI it does so from within the global zone of the Head Node. This does not verify that there is publicly accessible internet connectivity to CloudAPI.

You should regularly poll the CloudAPI endpoint using sdc-listdatacenters to ensure it is responding in a timely manor. This should be done from a location that requires communication to pass over the wider internet.

Where to go next

You may want to review Troubleshooting Triton in order to become familiar with handling error conditions and common problems.