There has been a lot of talk and interest about Disaster Recovery and Cloud Computing. Specifically, is it possible to use a public cloud as a Disaster Recovery site? The answer is not that simple. First, an on-premise production system will be behind at least one firewall and often more! Secondly, the traffic to/from the public cloud is exposed to all of the networks in between (and possibly some untrustworthy people!). Encryption is an absolute must for any public cloud based DR! DB2’s HADR is an industry leading High Availability and Disaster Recovery technology but how well will it work under these conditions and restrictions? Let's find out...
HADR requires two direct TCP connections: one from the primary to the standby and the other from the standby to the primary. The two connections are required because HADR is a symmetrical configuration in that the primary and standby can switch roles. However, can you imagine a connection from the raw Internet directly through the corporate firewall into your production system? Yeah right! A production environment may be able to connect out to the Internet but not vice versa. We need a secure tunnel through the Internet in order to use HADR... fortunately ssh provides what we need for this experiment.
SSH provides two options for tunneling: -L and -R. The -L option sets up a local port that is tunneled to a remote system/port. The -R option sets up a remote port that is tunneled back to a local system/port. Together, these can facilitate a direct TCP/IP connection to and from our DR site and production server.
There were two systems involved in the DR to cloud experiment:
On-premise "production" system: production1.ibm.com (system name not important)
Remote cloud DR system: ec2-174-129-117-64.compute-1.amazonaws.com
db2inst1@production1:~> ssh -i db2cloudkey.pem
The diagram below gives a bit more information on how this works under the covers. For the ssh -L command, the ssh connection goes from our production system to the cloud (black line). The ssh executable itself listens on port 60002 and forwards traffic to the cloud (blue line). For ssh -R, the ssh connection still goes from the production system to the cloud (black line) but in this case, the ssh daemon (sshd) listens on port 60002 and forwards traffic back to the production system (blue line).
In any case, with the tunnels in place, I tested out the HADR simulator (link) to test the waters. On the production system, I ran the HADR simulator as follows:
db2inst1@production1:~> ./simhadr.Linux -role P -lhost 127.0.0.1 -lport
60001 -rhost 127.0.0.1 -rport 60002 -n 100
-syncmode ASYNC
db2inst1@domU-12-31-39-07-B0-61:~> ./simhadr.Linux -role S -rhost 127.0.0.1
-rport 60002 -lhost 127.0.0.1 -lport 60001
Once again, the two IP addresses are the same (127.0.0.1) using our secure tunnel.
With this ssh tunneling method, there is a bit of a quirk with the HADR simulator. If I start the standby simulator first, it
connects to one of the tunneled ports, receives the following error message
and then exits:
Zero byte received. Remote end closed
connection.
This is because the ssh daemon is listening on the tunneled port but has no
connection on the other side of the socket (an artifact of how the tunneling
works). This is why I said "semi-normal TCP/IP connection" before. So instead of getting a "connection refused" (what the HADR simulator expects), the HADR
simulator get a connection successful and a zero byte send. It would be
interesting to see how HADR itself reacts to this error (hopefully it is more
forgiving then the simulator!). In any case, we can still test the simulator by starting the primary first.
In any case, it works... The simulation started right away and was completely unaware of the secure tunnel. Here is the table of average results from a series of 10 benchmarks:
So... ASYNC is about 2.5x faster than SYNC. This makes sense... the latency to the cloud will hurt a synchronous workload. Also, the difference between SYNC and NEARSYNC are moot also due to the latency to the cloud. Changes to the network options in the
HADR simulator had little affect n throughput (~4%) for this
configuration although they could be more important for other
configurations and network speeds. It is also worth noting that the connection speed to the DR site will depend heavily on the networks in between. Our of curiosity, I also tested point-point speeds from difference cities around North America and saw everything from very poor performance (80KB/s) to very good performance (34MB/s) when measured from city to city using the Internet. Summary The ssh tunnels and the HADR simulator worked well throughout all of the tests and matched the raw speed of the network connection. Although this isn't a thoroughly tested configuration, it was quick and easy to configure DR to the public cloud and it worked well as least for the HADR simulator. DR to a public cloud certainly
seems to be a plausible option, at least from a technical point of view. On the down side, this configuration is far from being perfect. If ssh tunnels were to go down, obviously the HADR setup would go down with it. In addition, nothing special was done to protect the data once it was on the EC2 instance. We could have encrypted the data in DB2 or we could have tested this with Amazon's newly announced Virtual Private Cloud (which had better network isolation than the host firewall). Lastly, this experiment was done with the HADR simulator and not the real HADR. In part #2, we'll test HADR itself... gkjnptrefw
SYNC: 1.83 MB/s NEARSYNC: 1.84 MB/s ASYNC: 4.84 MB/s
Comments