HANA High Availability Cluster Testing – Part 2

By | 01/11/2018

This is the second post for HANA High Availability Cluster Testing. Please refer to HANA High Availability Cluster Testing – Part 1 before going through below post.

HANA High Availability Cluster Testing

TEST 5: CRASH PRIMARY SITE NODE (NODE 1)

Simulate a crash of the primary site node running the primary HANA database.

TEST PROCEDURE

Crash the primary node by sending a ‘fast-reboot’ system request.

RECOVERY PROCEDURE

  1.  If SBD fencing is used then pacemaker will not automatically restart after being fenced, in this case, clear the fencing on all SBD devices and subsequently start pacemaker.
  2. HANAPRD# systemctl start pacemaker
  3.  Manually register the old primary (on node 1) with the new primary after takeover (on node 2) as hspadm.
  4. HAPRD# hdbnsutil -sr_register –remoteHost=HANAPRDSHD –remoteInstance=02 – replicationMode=sync –name=NODEA
  5. Restart the HANA database (now secondary) on node 1 as root.
  6. HANAPRD# crm resource cleanup rsc_SAPHana_HSP_HDB02 HANAPRD

Expected:

  • The cluster detects the failed node (node 1) and declares it UNCLEAN and sets the secondary node (node 2) to status “partition WITHOUT quorum”.
  • The cluster fences the failed node (node 1).
  • The cluster declares the failed node (node 1) OFFLINE.
  • The cluster promotes the secondary HANA database (on node 2) to take over as primary.
  • The cluster migrates the IP address to the new primary (on node 2).
  • After some time, the cluster shows the sync_state of the stopped primary (on node 2) as SFAIL.
  • If SBD fencing is used, then the manual recovery procedure will be used to clear the fencing and restart pacemaker on the node.
  • Because AUTOMATED_REGISTER=”false” the cluster does not restart the failed HANA database or register it against the new primary.
  • After the manual register and resource cleanup, the system replication pair is marked as in sync (SOK).
  • The cluster “failed actions” are cleaned up after following the recovery procedure.

TEST 6 CRASH SECONDARY SITE NODE (NODE 2)

Simulate a crash of the secondary site node running the primary HANA database.

TEST PROCEDURE

Crash the secondary node by sending a ‘fast-reboot’ system request.

RECOVERY PROCEDURE

  1.  If SBD fencing is used then pacemaker will not automatically restart after being fenced, in this case, clear the fencing on all SBD devices and subsequently start pacemaker.
  2. HAPRDSHD# systemctl start pacemaker
  3. . Manually register the old primary (on node 2) with the new primary after takeover (on node 1) as hspadm.
  4. HAPRDSHD# hdbnsutil -sr_register –remoteHost=HAPRD –remoteInstance=02 – replicationMode=sync –name=NODEB
  5. Restart the HANA database (now secondary) on node 2 as root.
  6. HAPRDSHD# crm resource cleanup rsc_SAPHana_HSP_HDB02 HAPRDSHD

Expected:

  • The cluster detects the failed secondary node (node 2) and declares it UNCLEAN and sets the primary node (node 1) to status “partition WITHOUT quorum”.
  • The cluster fences the failed secondary node (node 2).
  • The cluster declares the failed secondary node (node 2) OFFLINE.
  • The cluster promotes the secondary HANA database (on node 1) to take over as primary.
  • The cluster migrates the IP address to the new primary (on node 1).
  • After some time, the cluster shows the sync state of the stopped secondary (on node 2) as SFAIL.
  • If SBD fencing is used, then the manual recovery procedure will be used to clear the fencing and restart pacemaker on the node.
  • Because AUTOMATED_REGISTER=”false” the cluster does not restart the failed HANA database or register it against the new primary.
  • After the manual register and resource cleanup the system replication pair is marked as in sync (SOK).
  • The cluster “failed actions” are cleaned up after following the recovery procedure.

TEST 7 STOP THE SECONDARY DATABASE ON NODE 2

The secondary HANA database is stopped during normal cluster operation.

TEST PROCEDURE

Stop the secondary HANA database gracefully as sidadm.

HAPRDSHD# HDB stop

RECOVERY PROCEDURE

  1. Cleanup the failed resource status of the secondary HANA database (on node 2) as root.
  2. HAPRDSHD# CRM resource cleanup rsc_SAPHana_HSP_HDB02 HAPRDSHD

Expected:

  • The cluster detects the stopped secondary database (on node 2) and marks the resource failed.
  • The cluster detects the broken system replication and marks it as failed (SFAIL).
  • The cluster restarts the secondary HANA database on the same node (node 2). 4. The cluster detects that the system replication is in sync again and marks it as ok (SOK).
  • The cluster “failed actions” are cleaned up after following the recovery procedure.

TEST 8 CRASH THE SECONDARY DATABASE ON NODE 2

TEST PROCEDURE

Kill the secondary database system using signals as sidadm.

node2# HDB kill-9

RECOVERY PROCEDURE

  1. Cleanup the failed resource status of the secondary HANA database (on node 2) as root.
  2. HAPRDSHD# crm resource cleanup rsc_SAPHana_HSP_HDB02 HAPRDSHD

Expected:

  • The cluster detects the stopped secondary database (on node 2) and marks the resource failed.
  • The cluster detects the broken system replication and marks it as failed (SFAIL).
  • The cluster restarts the secondary HANA database on the same node (node 2).
  • The cluster detects that the system replication is in sync again and marks it as ok (SOK).
  • The cluster “failed actions” are cleaned up after following the recovery procedure.

Test 9 TEST FAILURE OF REPLICATION LAN

Loss of replication LAN connectivity between the primary and secondary node.

TEST PROCEDURE

Break the connection between the cluster nodes on the replication LAN.

RECOVERY PROCEDURE:

  • Re-establish the connection between the cluster nodes on the replication LAN.

Expected:

  • After some time, the cluster shows the sync_state of the secondary (on node 2) as SFAIL.
  • The primary HANA database (node 1) “HDBSettings.sh systemReplicationStatus.py” shows
  • “CONNECTION TIMEOUT” and the secondary HANA database (node 2) is not able to reach the primary database (node 1).
  • The primary HANA database continues to operate as “normal”, but no system replication takes place and is therefore no longer a valid take over the destination.
  • Once the LAN connection is re-established, HDB automatically detects connectivity between the HANA databases and restarts the system replication process
  • The cluster detects that the system replication is in sync again and marks it as ok (SOK).

Using Maintenance Mode

Every now and then, you need to perform upgrade or maintenance tasks on individual cluster components or the whole cluster—be it changing the cluster configuration, updating software packages for individual nodes, or upgrading the cluster to a higher product version.

Using Maintenance Mode via HAWK

Applying Maintenance Mode to Nodes

  1. Sometimes it is necessary to put single nodes into maintenance mode.
  2. Start a Web browser and log in to the cluster as described in Starting Hawk and Logging In.
  3. In the HOME page, select Nodes Tab.
  4. In one of the individual nodes’ views, click on options next to the node and switch to Maintenance.
  5. This will add the following instance attribute to the node: maintenance=”on”. The resources previously running on the maintenance-mode node will become unmanaged. No new resources will be allocated to the node until it leaves the maintenance mode.
  6. After you have finished, remove the maintenance mode to start normal cluster operation. Set the node to out-maintenance. (ready).
  7. Start a Web browser and log in to the cluster as described in Starting Hawk and Logging In.
  8. In the HOME page, select Nodes Tab.
  9. In one of the individual nodes’ views, click on options next to the node and switch to Maintenance off.

Click on the link below to read the following blog posts

One thought on “HANA High Availability Cluster Testing – Part 2

  1. Pingback: HANA High Availability Cluster Testing - Part 1 - SAP Wine

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.