K12531: Troubleshooting health monitors

发布时间 2023-10-10 10:07:01作者: techsuperman

Issue

A monitor is a BIG-IP feature that verifies connections to pool members or nodes. A health monitor is designed to report the status of a pool, pool member, or node on an ongoing basis, at a set interval. When a health monitor marks a pool, pool member, or node as down, the BIG-IP system stops sending traffic to the device.

A failing or misconfigured health monitor may cause traffic management issues similar to, but not limited to, the following:

  • Connections to the virtual server are interrupted or fail.
  • Web pages or applications fail to load or run.
  • Certain pool members or nodes receive more connections than others.

Any of these symptoms may indicate that a health monitor is marking a pool, pool member, or node as indefinitely down or that a monitor is repeatedly marking a pool member or node as down and then as back up (often called "bouncing"). For example, if a misconfigured health monitor repeatedly marks pool members as down and then as back up, connections to the virtual server may be interrupted or fail altogether. If this occurs, you need to determine whether the monitor is misconfigured, the device or application is failing, or some other factor, such as a network-related issue, is causing the monitor to fail. The troubleshooting steps you take depend on the monitor type and the symptoms you observe.

You can use the following procedures to troubleshoot health monitor issues:

Identifying a failing health monitor

You can use the Configuration utility, command line utilities, logs, or SNMP to help identify when a health monitor marks a pool, pool member, or node as down.

Configuration utility

The following table lists Configuration utility pages where you can check the status of pools, pool members, and nodes.

Configuration utility page Description Location
Network map Summary of pools, pool members, and nodes Local Traffic > Network Map
Pools Current status of pool/members Local Traffic > Pools > Statistics
Pool members Current status of pool/members Local Traffic > Pools > Statistics
Nodes Current status of nodes Local Traffic > Nodes > Statistics

Command line utilities

The following table lists command line utilities that you can use to monitor the status of pools, pool members, and nodes.

Command line utility Description Example commands
TMOS Shell (tmsh) (BIG-IP 10.x and later) Statistical information about pools, pool members, and nodes tmsh show /ltm pool <pool_name>
tmsh show /ltm node <node_IP>
bigtop Live statistics for pool members and nodes bigtop -n
bigpipe (BIG-IP 10.x) Statistical information about pools, pool members, and nodes bigpipe pool show, bigpipe node show

Logs

The BIG-IP system logs messages related to health monitors to the /var/log/ltm file. You can review log files to determine the frequency with which the system marks pool members and nodes as down.

  • Pools

    When a health monitor marks all members of a pool as down or up, the BIG-IP system logs messages to the /var/log/ltm file which appear similar to the following example:

    tmm err tmm[4779]: 01010028:3: No members available for pool <Pool_name>
    tmm err tmm[4779]: 01010221:3: Pool <Pool_name> now has available members

  • Pool members

    When a health monitor marks pool members as down or up, the BIG-IP system logs messages to the /var/log/ltm file which appear similar to the following example:

    notice mcpd[2964]: 01070638:5: Pool <Pool_name> member <ServerIP_port> monitor status down [ <MonitorA_name>: down, <MonitorB_name>: down ] [ was up for <#>hrs:<#>mins:<#>sec ]
    notice mcpd[2964]: 01070727:5: Pool <Pool_name> member <ServerIP_port> monitor status up. [ <MonitorA_name>: down, <MonitorB_name>: up ] [ was down for <#>hrs:<#>mins:<#>sec ]


    When a pool member is forced offline by the administrator, the BIG-IP system logs messages to the /var/log/ltm file which appear similar to the following example:

    notice mcpd[5897]: 01070638:5: Pool <Pool_name> member <ServerIP_port> monitor status forced down. [ <MonitorA_name>: down, <MonitorB_name>: up ] [ was up for <#>hrs:<#>mins:<#>sec ]

  • Nodes

    When a health monitor marks a node as down or up, the BIG-IP system logs messages to the /var/log/ltm file which appear similar to the following example:

    notice mcpd[2964]: 01070640:5: Node <ServerIP> monitor status down.
    notice mcpd[2964]: 01070728:5: Node <ServerIP> monitor status up.

Monitor logging

In BIG-IP 11.5.0 and later, the Monitor Logging option is available to allow the system to log more verbose messages for each pool member and node level. The BIG-IP system stores the log of the respective pool member or node in the /var/log/monitors/ directory. The system does not save the Monitor Logging option setting into the system configuration but rather disables the option when the configuration loads. Additionally, the BIG-IP system does not include the Monitor Logging option in syncing operations.

The log file has the following file naming format:

<MonitorPartition>_<MonitorName>-<NodePartition>_<NodeName>-<port>.log

For example, if the Gateway_ICMP monitor is set to monitor pool member 10.10.12.200 and the Monitor Logging option is set to Enabled, the BIG-IP system generates the following log file for the pool member:

/var/log/monitors/Common_gateway_icmp-Common_10.10.12.200-0.log

Enabling monitor logging for a pool member

Impact of procedure: The /var/log directory may become full if you leave monitor logging enabled for a long period of time. Be sure to disable monitor logging after troubleshooting.

  1. Log in to the Configuration utility.
  2. Go to Local Traffic > Pools > Pool List.
  3. Select the name of the pool that contains the pool member for which you want to enable monitor logging.
  4. Select the Members tab.
  5. In the Current Members, select the Member name of the pool member for which you want to enable monitor logging.
  6. For Monitor Logging, select the Enable check box.
  7. Select Update.

Enabling monitor logging for a node

Impact of procedure: The /var/log directory may become full if you leave monitor logging enabled for a long period of time. Be sure to disable monitor logging after troubleshooting.

  1. Log in to the Configuration utility.
  2. Go to Local Traffic > Nodes > Node List.
  3. Select the name of the node for which you want to enable monitor logging.
  4. For Monitor Logging, select the Enable check box.
  5. Select Update.

SNMP

When you configure the BIG-IP system to send SNMP traps and a health monitor marks a pool member or node as down or up, the system sends the following traps:

  • Pool members

    alert BIGIP_MCPD_MCPDERR_POOL_MEMBER_MON_STATUS {
    snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.10"
    }
    alert BIGIP_MCPD_MCPDERR_POOL_MEMBER_MON_STATUS_UP {
    snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.11"
    }

  • Nodes

    alert BIGIP_MCPD_MCPDERR_NODE_ADDRESS_MON_STATUS {
    snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.12"
    }
    alert BIGIP_MCPD_MCPDERR_NODE_ADDRESS_MON_STATUS_UP {
    snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.13"
    }

Verifying monitor settings

You must verify that monitor settings are properly defined for your environment. F5 recommends that in most cases the timeout value should be equal to three times the interval value, plus one. For example, the default timeout/interval ratio is 5/16 (three times 5 plus one equals 16). This setting prevents the monitor from marking the node as down before sending the last check.

Simple monitors

You can use a simple monitor to verify the status of a destination node (or the path to the node through a transparent device). Simple monitors only monitor the node address itself, not individual protocols, services, or applications on a node. The BIG-IP system provides the following pre-configured simple monitor types: gateway_icmp, icmp, tcp_echo, tcp_half_open. If you determine that a simple monitor is marking a node as down, you can verify the following settings:

Note: There are other monitor settings that can be defined for simple monitors. For more information, refer to the Configuration Guide for BIG-IP Local Traffic Management. For information about how to locate F5 product manuals, refer to K98133564: Tips for searching AskF5 and finding product documentation.

  • Interval/timeout ratio

    You must configure an appropriate interval/timeout ratio for simple monitors. In most cases, the timeout value should be equal to three times the interval value, plus one. For example, the default ratio is 5/16 (three times 5 plus one equals 16). Verify that the ratio is properly defined.

  • Transparent

    A transparent monitor uses a path through the associated node to monitor the aliased destination. Verify that the destination target device is reachable and configured properly for the monitor.

Extended content verification (ECV) monitors

ECV monitors use Send and Receive string settings to retrieve content from pool members or nodes. The BIG-IP system provides the following pre-configured monitor types: tcp, http, https, and https_443. If you determine that a simple monitor is marking a node as down, you can verify the following settings:

Note: There are other monitor settings that can be defined for ECV monitors. For more information, refer to the Configuration Guide for BIG-IP Local Traffic Management. For information about how to locate F5 product manuals, refer to K98133564: Tips for searching AskF5 and finding product documentation.

Note: HTTPS monitors use OpenSSL for cipher negotiations.

  • Interval/timeout ratio

    As with simple monitors, you need to properly set the interval/timeout ratio for ECV monitors. In most cases, the timeout value should be equal to three times the interval value, plus one. For example, the default ratio is 5/16 (three times 5 plus one equals 16). Verify that the ratio is properly defined

  • Send string

    The Send string is a text string that the monitor sends to the pool member. The default setting is GET /, which retrieves a default HTML file for a website. If the Send string is not properly constructed, the server may send an unexpected response and be subsequently marked as down by the monitor. For example, if the server requires the monitor request to be HTTP/1.1 compliant, you must adjust the monitor’s Send string.

    Note: For information about modifying HTTP requests for use with HTTP or HTTPS application health monitors, refer to the following articles:

  • Receive string

    The Receive string is the regular expression representing the text string that the monitor looks for in the returned resource. ECV monitor requests may fail and mark the pool member as down if the Receive string is not configured properly. For example, if the Receive string appears too late in the server response, or the server responds with a redirect, the monitor marks the pool member as down.

    Note: For information about modifying the monitor to issue a request to a redirection target, refer to K3224: HTTP health checks may fail even though the node is responding correctly.

  • User name and password

    ECV monitors have User Name and Password fields, which can be used for resources that require authentication. Verify whether the pool member requires authentication and ensure that these fields contain valid credentials.

Troubleshooting monitor types

Simple monitors

If you determine that a simple monitor is marking a node as down (or if the node is bouncing), you can use the following steps to troubleshoot:

  1. Determine the IP address of the nodes being marked as down.

    You can determine the IP address or the nodes that the monitor is marking as down by using the Configuration utility, command line utilities, or log files. You can quickly search the /var/log/ltm file for node status messages by typing the following command:

    # grep 'Node' /var/log/ltm |grep 'status'

    Output will appear similar to the following example:

    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070640:5: Node 10.10.65.1 monitor status down.
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070640:5: Node 172.24.64.4 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 10.1.0.200 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 10.10.65.122 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 10.1.0.100 monitor status unchecked.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 11.1.1.1 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 172.16.65.3 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 172.16.65.229 monitor status down.

    Note: If a large number of nodes are being marked as down (or bouncing), you can sort the results by IP addresses by typing the following command.

    grep 'Node' /var/log/ltm |grep 'status' | sort -t . -k 3,3n -k 4,4n

  2. Check connectivity to the node.

    If there are occurrences of node addresses being marked as down and not back up, or of nodes bouncing, use commands such as ping and traceroute (BIG-IP 10.x and 11.x) to check the connectivity to the nodes from the BIG-IP system. For example, if you determine that a simple monitor is marking the node address 10.10.65.1 as down, you can attempt to ping the resource from the BIG-IP system, as shown in the following example:

    # ping -c 4 10.10.65.1
    PING 10.10.65.1 (10.10.65.1) 56(84) bytes of data.
    64 bytes from 10.10.65.1: icmp_seq=1 ttl=64 time=11.32 ms
    64 bytes from 10.10.65.1: icmp_seq=2 ttl=64 time=8.989 ms
    64 bytes from 10.10.65.1: icmp_seq=3 ttl=64 time=10.981 ms
    64 bytes from 10.10.65.1: icmp_seq=4 ttl=64 time=9.985 ms

    Note: The ping output in the previous example shows high round-trip times, which may indicate a network issue or a slowly responding node.

    In addition, make sure that the node is configured to respond to the simple monitor. For example, tcp_echo is a simple monitor type that requires that you enable TCP echo service on the monitored nodes. The BIG-IP system sends a SYN segment with information that the receiving device echoes.

  3. Check the monitor settings.

    Use the Configuration utility or command line utilities to verify that the monitor settings (such as the interval/timeout ratio) are appropriate for the node.

    Type the following tmsh command to list the configuration for the icmp_new monitor:

    tmsh list /ltm monitor icmp_new

  4. Create a custom monitor (if needed).

    If you are using a default monitor and have determined that the settings are not appropriate for your environment, consider creating and testing a new monitor with custom settings.

ECV monitors

If you determine that an ECV monitor is marking a pool member as down (or if the pool member is bouncing), you can use the following steps to troubleshoot the issue:

  1. Determine the IP address of the pool members that the monitor is marking as down by using the Configuration utility, command line utilities, or log files.

    For example, you can search the /var/log/ltm file for pool member status messages by typing the following command:

    # grep -i 'pool member' /var/log/ltm | grep 'status'

    Output appears similar to the following example:

    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:21 monitor status node down.
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:80 monitor status node down.
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:80 monitor status node down.
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:80 monitor status node down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070638:5: Pool member 172.16.65.3:80 monitor status node down.
    Jan 21 15:05:05 local/3400a notice mcpd[2964]: 01070638:5: Pool member 172.16.65.3:80 monitor status unchecked.

  2. Check connectivity to the pool member.

    Check the connectivity to the pool members from the BIG-IP system using the ping or traceroute commands.

  3. Check the ECV monitor settings.

    Use the Configuration utility or command line utilities to verify that the monitor settings (such as the interval/timeout ratio) are appropriate for the pool members.

    The following tmsh command lists the configuration for the http_new monitor:

    tmsh list /ltm monitor http_new

  4. Create a custom monitor (if needed).

    If you are using a default monitor and have determined that the settings are not appropriate for your environment, consider creating and testing a new monitor with custom settings.

  5. Test the response from the application.

    Use a command line utility on the BIG-IP system to test the response from the web application. For example, the following curl and time command attempts to transfer data from the web server while timing the response:

    # time curl http://10.10.65.1

    Output syntax appears similar to the following example:

    <html>
    <head>
    ---
    </body>
    </html>
    real 0m18.032s
    user 0m0.030s
    sys 0m0.060s

    Note: If you want to test a specific HTTP request, including HTTP headers, you can use the telnet command to connect to the pool member.

    For example:

    telnet <serverIP> <serverPort>

    At the prompt, enter an appropriate HTTP request line and HTTP headers, pressing Enter once after each line.

    For example:

    GET / HTTP/1.1 <enter>
    Host: www.yoursite.com <enter>
    Connection: close <enter>
    <enter>

Troubleshooting daemons related to health monitoring

The bigd process manages health checking for pool members, nodes, and services on the BIG-IP LTM system. The bigd process collects health checking status and communicates the status information to the mcpd process, which stores the data in shared memory so that the Traffic Management Microkernel (TMM) can read it. If you are having monitoring issues, you can check the memory utilization of the bigd process. If the %MEM is unusually high, or continually increases, the process may be leaking memory.

For example, to check the current memory utilization of bigd, type the ps command:

# ps aux |grep bigd

Output appears similar to the following example:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 3020 0.0 0.6 28208 10488 ? S 2010 5:08 /usr/bin/bigd

Note: If the bigd process fails, the health check status of pool members, nodes, and services remain in their current state until the bigd process restarts. For more information, refer to K6967: When the BIG-IP LTM bigd daemon fails, the health check status of pool members, nodes, and services remain unchanged until the bigd daemon restarts.

Additionally, you can run the bigd process in debug mode. Debug logging for the bigd process is extremely verbose as it logs multiple messages for every monitor attempt. For information about running bigd in debug mode, contact F5 Technical Support.

Using tcpdump to capture the monitor traffic

If you are unable to determine the cause of a failing health monitor, you may need to perform packet captures on the BIG-IP system. To use the tcpdump command to capture monitor traffic, perform the following steps:

Impact of procedure: You should only run tcpdump packet captures during active troubleshooting sessions.

  1. Log into the BIG-IP command line.
  2. Use the following command syntax to determine the self IP address that the BIG-IP system uses for health monitoring:

    ip route get <server ip address>

    Note: Replace <server ip address> with the IP address of the destination server.

    Output appears similar to the following example, which uses the destination server address 10.20.4.100:

    ip route get 10.20.4.100
    10.20.4.100 dev internal_vlan  src 10.20.4.3
    cache

    Note: In the example, the server 10.20.4.100 is associated with VLAN internal_vlan and the self IP address for health monitoring is 10.20.4.3.

  3. Use the following tcpdump syntax to capture monitor traffic.

    tcpdump -nnvi <internal_vlan_name>:nnn -s0 -w /var/tmp/<filename>.pcap host <self-ip address>

    For example,

    tcpdump -nnvi internal_vlan:nnn -s0 -w /var/tmp/monitortraffic.pcap host 10.20.4.3

  4. When you have captured the appropriate amount of monitor traffic, press Ctrl+C to terminate the tcpdump capture.

Note: For more information about running tcpdump, refer to K411: Overview of packet tracing with the tcpdump utility.

Verify the connectivity between F5 and pool members

  1. Send a ping from F5 to a pool member.
  2. Identify the intermediate device between F5 and pool member and ping to that device IP from F5.
  3. If the intermediate device is a switch, check for ARP entry in F5 ARP table using the arp -a command. 
  4. Verify VLAN and VLAN tagging configuration on F5 and the connected switch/L3 switch.  
  5. If the ping is blocked, perform a telnet test.