Wednesday, January 8, 2014

How to troubleshoot takeover of partner is disabled due to unsynchronized log

Issue:
You receive one or more of these messages from your Netapp syslog server.

2011-10-03 09:49:21 Kernel.Debug netapp2 Nov 3 09:48:21 [iwarp-vfiler@netapp2: ctrl.rdma.failConnect:debug]: Failed to connect.

2011-10-03 09:58:20 Kernel.Notice netapp2 Nov 3 09:57:20 [netapp2: cf.fsm.partnerNotResponding:notice]: Failover monitor: partner not responding

2011-10-03 10:01:08 Kernel.Warning netapp2 Nov 3 09:00:08 [netapp2: cf.takeover.disabled:warning]: Controller Failover is licensed but takeover of partner is disabled due to reason : unsynchronized log.

Filer View may report the following
Controller failover of hostname is not possible: unsynchronized log

Cause:
Typical causes of the interconnect going up and down are
  • Loose connections on the interconnect cable.
  • Bad interconnect cable
  • Bad internal connection or port on the Netapp Controller.

Solution:
In general call Netapp Support to help diagnose and troubleshoot these error messages.  If the problem requires parts replacement you really should have Netapp support do the replacement.

More Information:
Troubleshooting steps you can take on your own.

1)  Run the following command
cf status

a.  If you receive a message similar to this, call Netapp Support asap.

netapp1 is up, takeover disabled because of reason (unsynchronized log)
netapp2 has disabled takeover by netapp1 (unsynchronized log)
Interconnect status: down.

2)  Check the interconnect cables between the two Netapp Controllers
a.  The type of connect may vary by Netapp Controller model, in general
  i.  Check for loose connections
  ii.  Check for amber lights (normal status is a green light)
b.  If the interconnect is down you can disconnect and reconnect the interconnect cable(s) as needed in an attempt to bring the connection back up.
3)  Depending on your controller type this may or may not apply.
a.  If your controller uses two cables/ports (example: FAS 3240 uses ports c0a and c0b), there should be 4 lighted arrows corresponding to the ports as follows.
  i.  Top port: corresponds to the 2 lighted arrows pointing up.
  ii.  Bottom port: corresponds to the 2 lighted arrows pointing down.
b.  So the following should hold true for two port interconnect configurations
  i.  Two up arrows or two down arrows (colored amber) indicate a bad cable or individual port.
·  It also follows you would see corresponding Link down messages in the syslog similar to the following for that port.

netif.linkDown:info Ethernet c0b: Link down, check cable.

  ii.  One up and one down arrow (colored amber) indicates a bad internal connection on one of the Netapp Controllers. 
·  It also follows you should not see Link down messages in the syslog, but should see link messages for up and down when you can disconnect and reconnect the cable from the port.

Effects or ramifications of the interconnect being down
The Netapp Controller interconnect is used in the failover (takeover/giveback) process, while the controller interconnect is down this process will not work.  If a takeover is initiated the controller will crash, requiring a manual restart of the logical controller on either physical Netapp Controllers.

The Netapp Controller interconnect is also used to pass Misconfigured Partner Path (aka: secondary path) IOs back to the Primary Controller for the given volume/lun.  When the controller interconnect is down this functionality is not available, so IOs over secondary paths will fail. 

If this happens you will need to correct the server configuration to use the primary paths or disable the secondary paths as needed on your servers until the controller interconnect is back up.

If you are using the Microsoft DSM your servers should detect the secondary paths as unavailable automatically.

The Netapp DSM may not detect the interconnect as being down, and may instead report an error with the LUN.

Definitions
RDMA (Remote Direct Memory Access): is used for high performance data transfer between controllers configured in an High Availability (HA) pair and is done over the controller interconnect.  This technology is part of the controller to controller communications providing the takeover/giveback and secondary path functionality of a HA pair.


Known Bug
Bug ID 489576
Title FAS/V3200 series storage failover disabled due to "unsynchronized log"

Description
After some period of uptime with storage failover enabled, a FAS/V3200 series system may suddenly encounter an "unsynchronized log" condition that it cannot recover from. This will result in the loss of high-availability failover and SFO will be disabled.

This event is triggered by a timing error in the onboard 10Gb controller firmware that is responsible for storage cluster communications on this platform. This problem may occur with any FAS/V3200A or FAS/V3200AE configuration running 7.3.5 or 8.0.1 versions of Data ONTAP.

Workaround
A reboot will clear the condition temporarily.

The bug is fixed in the following versions
Data ONTAP 7.3.6RC1 (First Fixed) - Fixed
Data ONTAP 7.3.6 (GA) - Fixed
Data ONTAP 8.0.2 (GA) - Fixed
Data ONTAP 8.1RC1 (RC) - Fixed

1 comment:

  1. Hi Rajat Garg,
    Thanks for nice and helpful explanation.
    My disk array N3220A have the below error.
    what will be the causes.
    N3220A> [N3220A:cf.takeover.disabled:warning]: Controller Failover is licensed b ut takeover of partner is disabled due to reason : partner halted in notakeover mode.
    Thanks in Advance, Appreciate your support.

    ReplyDelete