HANA High Availability Cluster Testing – Part 1

By | 01/11/2018

HANA High Availability Cluster Testing

In the earlier posts, we have learned Concepts of High Availability in HANA. Now we will learn about HANA High Availability Cluster Testing.

Now we need to test whether our configuration is working as expected.

Auto-failover – in this method you need to deploy additional host to the current HANA database and configure it to work in standby mode. In case the active node failures, the standby host can automatically switch operations to the secondary node. This solution requires a shared storage.

System replication – in this solution you need to install separate HANA system and configure replication for data changes. By default, the system replication doesn’t support High Availability as HANA database doesn’t support automatic failover.

But you can use the features of SUSE Linux to enhance the base solution.

Testing of Cluster

In the following test descriptions, when the parameters are PREFER_SITE_TAKEOVER=”true” and AUTOMATED_REGISTER=”false”.

TEST 1: STOP PRIMARY DATABASE ON NODE 1

The primary HANA database is stopped during normal cluster operation.

TEST PROCEDURE

Stop the primary HANA database gracefully as sidadm.

HAPRD# HDB stop

RECOVERY PROCEDURE

  • Manually register the old primary (on node 1) with the new primary after takeover (on node 2) as hspadm.
  • HAPRD# hdbnsutil -sr_register –remoteHost=HANAPRDSHD –remoteInstance=02 – replicationMode=sync –name=NODEA
  •  Restart the HANA database (now secondary) on node 1 as root.
  • HAPRD# crm resource cleanup rsc_SAPHana_HSP_HDB02 HANAPRD

Expected:

  1.  The cluster detects the stopped primary HANA database (on node 1) and marks the resource failed.
  2.  The cluster promotes the secondary HANA database (on node 2) to take over as primary.
  3.  The cluster migrates the IP address to the new primary (on node 2).
  4.  After some time, the cluster shows the sync_state of the stopped primary (on node 1) as SFAIL.
  5.  Because AUTOMATED_REGISTER=”false” the cluster does not restart the failed HANA database or register it against the new primary.
  6.  After the manual register and resource cleanup the system replication pair is marked as in sync (SOK).
  7.  The cluster “failed actions” are cleaned up after following the recovery procedure.

TEST 2: STOP PRIMARY DATABASE ON NODE 2

The primary HANA database is stopped during normal cluster operation.

 TEST PROCEDURE

Stop the database gracefully as sidadm.

HAPRDSHD# HDB stop

RECOVERY PROCEDURE

  • Manually register the old primary (on node 2) with the new primary after takeover (on node 1) as hspadm.
  • mode02# hdbnsutil -sr_register –remoteHost=HANAPRD –remoteInstance=02 -replicationMode=sync — name=NODE
  •  Restart the HANA database (now secondary) on node 2 as root.
  • HANAPRDSHD # crm resource cleanup rsc_SAPHana_HSP_HDB02 HANAPRDSHD

Expected:

  1.  The cluster detects the stopped primary HANA database (on node 2) and marks the resource failed.
  2.  The cluster promotes the secondary HANA database (on node 1) to take over as primary.
  3.  The cluster migrates the IP address to the new primary (on node 1).
  4. After some time, the cluster shows the sync_state of the stopped primary (on node 2) as SFAIL.
  5.  Because AUTOMATED_REGISTER=”false” the cluster does not restart the failed HANA database or register it against the new primary.
  6.  After the manual register and resource cleanup the system replication pair is marked as in sync (SOK).
  7.  The cluster “failed actions” are cleaned up after following the recovery procedure.

TEST 3: CRASH PRIMARY DATABASE ON NODE 1

TEST PROCEDURE

Kill the primary database system using signals as hspadm.

HANAPRD# HDB kill-9

RECOVERY PROCEDURE

  • Manually register the old primary (on node 1) with the new primary after takeover (on node 2) as hspadm.
  • HAPRD# hdbnsutil -sr_register –remoteHost=HAPRDSHD –remoteInstance=02 – replicationMode=sync –name=NODEA
  •  Restart the HANA database (now secondary) on node 1 as root.
  • HAPRD# crm resource cleanup rsc_SAPHana_HSP_HDB02 HANAPRD

Expected:

  1.  The cluster detects the stopped primary HANA database (on node 1) and marks the resource failed.
  2.  The cluster promotes the secondary HANA database (on node 2) to take over as primary.
  3.  The cluster migrates the IP address to the new primary (on node 2).
  4.  After some time, the cluster shows the sync_state of the stopped primary (on node 1) as SFAIL.
  5.  Because AUTOMATED_REGISTER=”false” the cluster does not restart the failed HANA database or register it against the new primary.
  6.  After the manual register and resource cleanup the system replication pair is marked as in sync (SOK).
  7.  The cluster “failed actions” are cleaned up after following the recovery procedure.

TEST 4: CRASH PRIMARY DATABASE ON NODE 2

Simulate a complete break-down of the primary database system.

TEST PROCEDURE

Kill the primary database system using signals as hspadm.

HAPRDSHD# HDB kill-9

RECOVERY PROCEDURE

  1. Manually register the old primary (on node 2) with the new primary after takeover (on node 1) as hspadm.
  2. HAPRDSHD# hdbnsutil -sr_register –remoteHost=HANAPRD –remoteInstance=02 – replicationMode=sync –name=NODEB
  3.  Restart the HANA database (now secondary) on node 1 as root.
  4. HAPRDSHD# crm resource cleanup rsc_SAPHana_HSP_HDB02 HAPRDSHD

Expected:

  1.  The cluster detects the stopped primary HANA database (on node 2) and marks the resource failed.
  2.  The cluster promotes the secondary HANA database (on node 1) to take over as primary.
  3.  The cluster migrates the IP address to the new primary (on node 1).
  4.  After some time, the cluster shows the sync_state of the stopped primary (on node 2) as SFAIL.
  5.  Because AUTOMATED_REGISTER=”false” the cluster does not restart the failed HANA database or register it against the new primary.
  6.  After the manual register and resource cleanup, the system replication pair is marked as in sync (SOK).
  7. 7. The cluster “failed actions” are cleaned up after following the recovery procedure.

Next continuation post- HANA High Availability Cluster Testing – Part 2


Click on the link below to read the following blog posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.