Checking the health of Triton
Periodically and when investigating problems you should check the overall health of your Triton installation.
This page provides details of a number of check you can perform using tools that come with Triton and other techniques.
The status of the Triton core services and agents can be checked using the
sdc-healthcheck command as follows:
# sdc-healthcheck ZONE STATE AGENT STATUS global running - online assets running - online sapi running - online binder running - online amonredis running - online ufds running - online redis running - online workflow running - online papi running - online sdc running - online amon running - online napi running - online rabbitmq running - online cnapi running - online dhcpd running - online dapi running - online fwapi running - online vmapi running - online adminui running - offline imgapi running - online cloudapi running - online manatee running - online moray running - online global running provisioner online global running zonetracker online global running heartbeat online global running ur online global running smartlogin online
The status field can show one of 4 values
|offline||The Zone or Agent is stopped.|
|error||The Zone specific check failed.|
|svc-err||One or more services in the Zone is not online.|
Standard checks for each zone are to confirm it is running and that all services are online (
Standard check for the agents is to attempt to connect to the agent using
/opt/smartdc/agents/bin/ping-agent. For the Smartlogin agent
svcs is used to verify the service is running.
The following specific checks are made in each zone to verify it is functioning as expected. Most comprise a call to an API endpoint.
sdc-healthcheck can be generated in a parseable format using the
# sdc-healthcheck -p global:running:-:online assets:running:-:online sapi:running:-:online binder:running:-:online amonredis:running:-:online --snip--
Any zone or agent that returns an error is potentially a serious problem and could impact the ability to provision instances or perform other jobs in Triton. End user instances WILL NOT be affected by error conditions in the core services or agents.
Note: Many SmartOS commands have a
-z flag to allow you to call the command from the Global Zone (GZ) for a specific zone. Check the man pages or use
--help to see if
-z is available on a specific command. Commands can also be run inside a core service zone using
# sdc-login adminui svcs -x
Both methods are used below.
- Attempt to restart the failed zone, e.g. boot up the adminui zone using
# zoneadm -z $(sdc-vmname adminui) boot
- Re-check the zone state:
[root@headnode (mxpa) ~]# zoneadm -z $(sdc-vmname adminui) list -v ID NAME STATUS PATH BRAND IP 28 faa81fbe-ffc5-4b51-bcee-2d2562f01daf running /zones/faa81fbe-ffc5-4b51-bcee-2d2562f01daf joyent-minimal excl
- Then re-run
- Check which service or services have failed. e.g. for the imgapi zone:
# svcs -x -z $(sdc-vmname imgapi) svc:/smartdc/site/imgapi:default (Triton Image API) Zone: 234c005d-63ca-49d2-b27d-7af560cae951 Alias: imgapi0 State: maintenance since 6 March 2014 09:26:37 UTC Reason: Restarting too quickly. See: http://illumos.org/msg/SMF-8000-L5 See: /zones/234c005d-63ca-49d2-b27d-7af560cae951/root/var/svc/log/smartdc-site-imgapi:default.log Impact: This service is not running.
For any service showing a state of
Maintenance you can attempt to restart it using
svcadm. This command requires an action and a service name, but it is typically only necessary to provide enough of the name to uniquely identify a service. For example:
can be abbreviated to:
# svcadm -z $(sdc-vmname imgapi) clear imgapi # sdc-login imgapi svcadm clear imgapi
Re-check the status of the services using
svcs -x -z $(sdc-vmname imgapi). If the services are still showing in this output it is time to dig into the log files.
The log file name is shown in the output of
svcs above. It can also be obtained using the
-L flag to
# svcs -L -z $(sdc-vmname imgapi) imgapi /zones/234c005d-63ca-49d2-b27d-7af560cae951/root/var/svc/log/smartdc-site-imgapi:default.log
On examining the log file you may be able to understand the underlying problem and resolve it. However, it is most likely you will need to raise a support issue with Joyent at help.joyent.com. Please provide a support bundle with all issues relating to the operation of the head node and core services.
The following checks should built into a regular overall health check of Triton. These can and should be automated via cron jobs or as part of a monitoring frame work such as Nagios or Zabbix.
The health of the agents on the Compute nodes should be checked using
svcs -x. This can be done from the Head Node using
sdc-oneachnode as follows.
[root@headnode (mxpa) ~]# sdc-oneachnode -c svcs -x === Output from 44454c4c-3700-1039-8034-c2c04f445131 (CN1): svc:/network/ntp:default (Network Time Protocol (NTP) Version 4) State: maintenance since Tue Mar 11 15:11:57 2014 Reason: Maintenance requested by "svc:/smartdc/agent/ur:default" See: /var/svc/log/smartdc-agent-ur:default.log See: http://illumos.org/msg/SMF-8000-R4 See: ntpd(1M) See: ntp.conf(4) See: ntpq(1M) See: /var/svc/log/network-ntp:default.log Impact: This service is not running.
-c flag tells
sdc-oneachnode to run the command on Compute Nodes only and not on the head node.
Follow the same procedure as described under
svc-err in the above table for any failed/maintenance services.
Compute nodes do not have plumbed interfaces in the Global Zone for the networks used in instances. Thus it is not possible to simply ping a Compute Node to determine if its networking is functioning. DO NOT be tempted to add a plumbed interface for any networks in the Global Zone of a compute node. This could significantly compromise the security of Compute Nodes which must remain isolated from the public internet.
sdc-healthcheck performs a direct query on CloudAPI it does so from within the global zone of the Head Node. This does not verify that there is publicly accessible internet connectivity to CloudAPI.
You should regularly poll the CloudAPI endpoint using
sdc-listdatacenters to ensure it is responding in a timely manor. This should be done from a location that requires communication to pass over the wider internet.
You may want to review Troubleshooting Triton in order to become familiar with handling error conditions and common problems.