Article Listing 1 Listing 2 dec2006.tar

MySQL 5 Cluster with SolarisTM 10 Zones/ZFS/Resource Control

Derek Crudgington

Solaris 10 introduced several pieces of technology that can benefit systems administrators, particularly when used together. Zones can provide application isolation and assist in server consolidation efforts. Combining the Fair Share Scheduler (FSS) with zones gives the administrator control over the amount of system resources allocated to each zone. Running zones on top of the ZFS file system allows each zone to be set up in a matter of minutes and can also provide other benefits such as compression, snapshots, and self healing.

On this configuration, I use Solaris Express Community Release build 45, which can be downloaded from:

http://www.opensolaris.org 
            
Solaris Express Community Release is the post Solaris 10 code base called Nevada, which will eventually become Solaris 11. I'll combine these features with MySQL 5 to show how admins can benefit from data replication, redundancy, and isolating applications down to each zone with only resources allowed.

All zones will be installed in /zones. Setup will be as follows:

mysqlmanage zone - 192.168.0.10 
mysqldata1 zone - 192.168.0.11 
mysqldata2 zone - 192.168.0.12 
First, let's set up our interface in the global zone:

# ifconfig bcme0 plumb 
# ifconfig bcme0 192.168.0.2 up 
Zone Configuration

We will begin with creating and setting up the zones. We will loopback mount the /opt directory (using add inherit-pkg-dir) on all zones so we aren't copying a load of packages over to each zone. This also allows us to install MySQL in the global zone in /opt and access it from all zones so we don't have to install MySQL three times. To begin, create the zone that will be working as the MySQL management node; it will be called mysqlmanage:

# zonecfg -z mysqlmanage 
mysqlmanage: No such zone configured 
Use 'create' to begin configuring a new zone. 
zonecfg:mysqlmanage> create 
zonecfg:mysqlmanage> set zonepath=/zones/mysqlmanage 
zonecfg:mysqlmanage> add inherit-pkg-dir 
zonecfg:mysqlmanage:inherit-pkg-dir> set dir=/opt 
zonecfg:mysqlmanage:inherit-pkg-dir> end 
zonecfg:mysqlmanage> add net 
zonecfg:mysqlmanage:net> set address=192.168.0.10 
zonecfg:mysqlmanage:net> set physical=bcme0 
zonecfg:mysqlmanage:net> end 
zonecfg:mysqlmanage> info 
zonecfg:mysqlmanage> verify 
zonecfg:mysqlmanage> commit 
zonecfg:mysqlmanage> exit 
Create the ZFS zone file system and set the mount point to be /zones. (Note: ZFS is the ZettaByte file system, which is an extremely powerful storage file system that provides features like unlimited snapshots, RAIDZ, compression, self healing, and more without the traditional storage complexity. It was released in Solaris 10 Update 2.):

# zfs create mypool/zones 
# zfs set mountpoint=/zones mypool/zones 
Install the mysqlmanage zone:

# zoneadm -z mysqlmanage install 
A ZFS file system has been created for this zone. 
Preparing to install zone <mysqlmanage>. 
Now we will create the mysqldata1 and mysqldata2 zones. Since they are going to be configured the same, we will clone the mysqlmanage zone for these, which is a lot quicker than installing these zones separately. Cloning is a feature introduced in Solaris Express Community Release b33 or higher and copies all configurations, packages, etc. from the source zone to the target zone. The source zone must be halted when doing this procedure. To show how fast cloning works, we will time it.

Configure mysqldata1 and mysqldata2 nodes:

# zonecfg -z mysqldata1 
mysqldata1: No such zone configured 
Use 'create' to begin configuring a new zone. 
zonecfg:mysqldata1> create 
zonecfg:mysqldata1> set zonepath=/zones/mysqldata1 
zonecfg:mysqldata1> add inherit-pkg-dir 
zonecfg:mysqldata1:inherit-pkg-dir> set dir=/opt 
zonecfg:mysqldata1:inherit-pkg-dir> end 
zonecfg:mysqldata1> add net 
zonecfg:mysqldata1:net> set physical=bcme0 
zonecfg:mysqldata1:net> set address=192.168.0.11 
zonecfg:mysqldata1:net> end 
zonecfg:mysqldata1> verify 
zonecfg:mysqldata1> commit 
zonecfg:mysqldata1> exit  

# zonecfg -z mysqldata2 
mysqldata2: No such zone configured 
Use 'create' to begin configuring a new zone. 
zonecfg:mysqldata2> create 
zonecfg:mysqldata2> set zonepath=/zones/mysqldata2 
zonecfg:mysqldata2> add inherit-pkg-dir 
zonecfg:mysqldata2:inherit-pkg-dir> set dir=/opt 
zonecfg:mysqldata2:inherit-pkg-dir> end 
zonecfg:mysqldata2> add net 
zonecfg:mysqldata2:net> set physical=bcme0 
zonecfg:mysqldata2:net> set address=192.168.0.12 
zonecfg:mysqldata2:net> end 
zonecfg:mysqldata2> verify 
zonecfg:mysqldata2> commit 
zonecfg:mysqldata2> exit 
Clone mysqlmanage to be mysqldata1:

# time zoneadm -z mysqldata1 clone -m copy mysqlmanage 
A ZFS file system has been created for this zone. 
Copying /zones/mysqlmanage... 
 
real    0m19.813s 
user    0m0.179s 
sys     0m2.198s 
And the same for mysqldata2:

# time zoneadm -z mysqldata2 clone -m copy mysqlmanage 
    
A ZFS file system has been created for this zone. 
Copying /zones/mysqlmanage... 

real    0m18.691s 
user    0m0.176s 
sys     0m2.162s 
Now /zones/mysqlmanage, /zones/mysqldata1, and /zones/mysqldata2 are set up and ready to go.

Resource Configuration

The next step is to configure the resource limitations for each zone. There are a few different ways you can divide up resources: fixed CPU, varying CPU, and FSS. We will use the FSS (Fair Share Scheduler) here because we only have one CPU and we can assign a number of shares, or parts of the CPU, to each zone.

In the global zone, enable the pools and create the default configuration file:

# pooladm -e && pooladm -s 
To make sure it has been enabled, check for the poold process or pools:default in SMF:
# svcs -a | grep pool 
disabled       11:53:54 svc:/system/pools/dynamic:default 
online         13:22:56 svc:/system/pools:default 
Create the resource pools, pool1 for management node, pool2 and pool3 for data nodes. Pool4 will be the global zone:

# poolcfg -c 'create pool pool1 ( string pool.scheduler = "FSS" )'
# poolcfg -c 'create pool pool2 ( string pool.scheduler = "FSS" )'
# poolcfg -c 'create pool pool3 ( string pool.scheduler = "FSS" )'
# poolcfg -c 'create pool pool4'
When you type "pooladm", the pools won't show up in the configuration; type "pooladm -c" to create the configuration file /etc/pooladm.conf:

# pooladm -c 
    
Now we can type "pooladm" and see the configuration in place:

# pooladm 
[output stripped out..] 
This output shows the default pool configuration. You don't need to understand it -- just read over it. We then need to tell the system default scheduling to use FSS, and move all processes to use FSS:

# dispadmin -d FSS 
# priocntl -s -c FSS -i all 
When you run ps -efc, you should see all processes listed as "FSS" class.

Next we will configure each zone to use the pools we created earlier, and give these pools a certain number of shares of the CPU. Here's how we will divide up the CPU:

mysqlmanage - pool1 - 10 shares (20% CPU) 
mysqldata1  - pool2 - 20 shares (40% CPU) 
mysqldata2  - pool3 - 20 shares (40% CPU) 
It is important to understand here that zone resource limitations will only take effect if the zones are all consuming to capacity. For instance, if mysqlmanage, mysqldata1, and mysqldata2 are all resource hungry, they will only go to their limits, which are 20%, 40%, and 40%. If mysqldata1 is the only node that needs resources and the others are idle, it will consume 100% of the CPU.

Starting with mysqlmanage zone, because this process few resources, we will only give this zone 10 shares, which is equal to 20% of the CPU:

# zonecfg -z mysqlmanage 
zonecfg:mysqlmanage> set pool=pool1 
zonecfg:mysqlmanage> add rctl 
zonecfg:mysqlmanage:rctl> set name=zone.cpu-shares 
zonecfg:mysqlmanage:rctl> add value (priv=privileged,limit=10,action=none) 
zonecfg:mysqlmanage:rctl> end 
zonecfg:mysqlmanage> verify 
zonecfg:mysqlmanage> exit 
mysqldata1 and mysqldata2 zones will each be allocated 40% of the CPU:

# zonecfg -z mysqldata1 
zonecfg:mysqldata1> set pool=pool2 
zonecfg:mysqldata1> add rctl 
zonecfg:mysqldata1:rctl> set name=zone.cpu-shares 
zonecfg:mysqldata1:rctl> add value (priv=privileged,limit=20,action=none) 
zonecfg:mysqldata1:rctl> end 
zonecfg:mysqldata1> verify 
zonecfg:mysqldata1> exit 
# zonecfg -z mysqldata2 
zonecfg:mysqldata2> set pool=pool3 
zonecfg:mysqldata2> add rctl 
zonecfg:mysqldata2:rctl> set name=zone.cpu-shares 
zonecfg:mysqldata2:rctl> add value (priv=privileged,limit=20,action=none) 
zonecfg:mysqldata2:rctl> end 
zonecfg:mysqldata2> verify 
zonecfg:mysqldata2> exit 
<test>

Testing Resources
Boot the zones one at a time and set them up with proper configurations:

# zoneadm -z mysqlmanage boot 
# zlogin -C mysqlmanage 
[Connected to zone 'mysqlmanage' console] 
Each one will be prompted to configure through the initial set up. Repeat these steps for each zone, setting up proper local and TCP/IP settings: mysqlmanage 192.168.0.10, mysqldata1 192.168.0.11, mysqldata2 192.168.0.12.

Once these zones are booted and running, it's a good idea to go into each of them and shut down services that are not needed. Use svcadm and check /etc/rcX.d directories. Also make sure you can ping each IP from a zone.

Before we start setting up the MySQL cluster, we must verify that the resource limitations are working. We will test the resource limitations by using cpuhog.pl in Listing 1.

Put the script in /opt so that all of the zones can read it. Remember /opt is loopback mounted read-only in each zone. Use zlogin and log in to each zone and execute /opt/cpuhog.pl.

In the global zone, run prstat -Z:

ZONEID    NPROC  SIZE   RSS MEMORY      TIME  CPU ZONE 
     2       34  100M   51M   4.9%   0:00:37  39% mysqldata1 
     3       34  100M   51M   5.0%   0:00:22  38% mysqldata2 
     1       34  100M   51M   4.9%   0:00:23  17% mysqlmanage 
     0       78 1017M  330M    32%   0:00:58 1.6% global 
The CPU column shows that zones are each hovering where they should be -- 40% for both data zones, and 20% for the management zone. If we killed off the mysqlmanage cpuhog.pl, we would see mysqldata1 and mysqldata2 both go to 50% to share the remaining available resources of the CPU.

However, the global zone doesn't have any resources defined to it. By default, the global zone only gets one share; we will give it more so that if something bizarre happens the global zone will still have resources to use.

Let's view the shares the global zone has now:

# prctl -n zone.cpu-shares -i zone global 
zone: 0: global 
NAME    PRIVILEGE       VALUE    FLAG   ACTION         RECIPIENT 
zone.cpu-shares 
        privileged          1       -   none                   - 
        system          65.5K     max   none                   - 
Only 1 shown here, we will give it 10 (20% of the CPU):

# prctl -n zone.cpu-shares -v 10 -r -i zone global 
# prctl -n zone.cpu-shares -i zone global 
zone: 0: global 
NAME    PRIVILEGE       VALUE    FLAG   ACTION         RECIPIENT 
zone.cpu-shares 
        privileged         10       -   none                   - 
        system          65.5K     max   none                   - 
The global zone has enough resources now and we are ready to set up MySQL cluster. To verify, we test with cpuhog.pl running in all zones. mysqldata1 and 2 should have around 33%; mysqlmanage and global should have around 16%:

ZONEID    NPROC  SIZE   RSS MEMORY      TIME  CPU ZONE 
     6       35  126M   62M   3.1%   0:00:51  29% mysqldata2 
     4       35  126M   60M   3.0%   0:00:51  29% mysqldata1 
     2       35  126M   57M   2.8%   0:00:46  19% mysqlmanage 
     0       98 1426M  558M    28%   1:34:16  18% global 
MySQL Cluster Configuration

Download the latest production release MySQL from mysql.com and get the MAX package because it includes the clustering that the standard package does not.

Add the mysql group and user to the global zone. We won't run mysql in the global zone but it must be added because the mysql package requires it before installing:

# groupadd mysql && useradd -g mysql mysql 
Add the MySQL package; -G is for global zone only because we don't want it pkgadding the package to every zone, which will take a long time and /opt is loopback mounted in each zone so each zone can use the package:

# gzip -d mysql-max-5.0.22-solaris10-x86_64.pkg.gz 
# pkgadd -G -d mysql-max-5.0.22-solaris10-x86_64.pkg 
# chgrp -R mysql/mysql/* 
To have a true redundant MySQL cluster set up, you must have at least three nodes. If you try to use two nodes, you will run into the split-brain scenario where if there is only one node remaining up, the cluster portion (ndbd) will shut itself down. The mysqld will still be running but any data added will not be replicated. A common scenario is to run two data nodes and one management node and this is what we do here. The data nodes run the ndbd process and hold all of the database information that is replicated between other data nodes. The management server runs the ndb_mgmd process, which provides the management interface and keeps the cluster in sync.

To set up the mysqlmanage zone:

# zlogin mysqlmanage 
Add the mysql group and user:

# groupadd mysql && useradd -g mysql mysql  
The mysqlmanagement node only needs to use the ndb_mgm (client) and ndb_mgmd (server). We'll start with the configuration file for the ndb_mgmd and data nodes.

Create a directory to use for the config.ini:

# mkdir /var/lib/mysql-cluster  
Insert the following into /var/lib/mysql-cluster/config.ini, which should look like Listing 2. The comments in Listing 2 explain what each configuration does.

If you work with larger databases, I recommend upping the values of MaxNoOfOrderedIndexes, MaxNoOfAttributes, MaxNoOfUniqueHashIndexes, DataMemory, and IndexMemory in the mysql management config.ini. This may save time if you are running into errors importing or using larger databases.

Start the MySQL management ndb_mgmd daemon. It already knows to look in /var/lib/mysql-cluster for the config.ini. If you have the config.ini in a different directory, use ndb_mgmd -f <location to config.ini>:

# /opt/mysql/mysql/bin/ndb_mgmd 
# ps -ef | grep mgmd 
    root  3595  3320   0 12:42:31 pts/9   0:00 grep mgmd 
    root  3593  1215   0 12:42:29 ?       0:00 /opt/mysql/ \
                                               mysql/bin/ndb_mgmd 
Log in with the MySQL management client ndb_mgm and make sure everything looks okay:

# /opt/mysql/mysql/bin/ndb_mgm 
-- NDB Cluster -- Management Client -- 
ndb_mgm> show 
Connected to Management Server at: localhost:1186 
Cluster Configuration 
--------------------- 
[ndbd(NDB)]     2 node(s) 
id=2 (not connected, accepting connect from 192.168.0.11) 
id=3 (not connected, accepting connect from 192.168.0.12) 

[ndb_mgmd(MGM)] 1 node(s) 
id=1    @192.168.0.10  (Version: 5.0.22) 

[mysqld(API)]   2 node(s) 
id=4 (not connected, accepting connect from any host) 
id=5 (not connected, accepting connect from any host) 
-------------------------------------- 
We can see here it is waiting for connections from mysqldata1 and mysqldata2. Let's set up the data nodes and repeat the following on mysqldata1 and mysqldata2:

# groupadd mysql && useradd -g mysql mysql 
# cd /opt/mysql/mysql/ 
# scripts/mysql_install_db --user=mysql 
In each data node's /etc/my.cnf put:

[mysqld] 
ndbcluster 
# IP address of mysqlmanagement server 
ndb-connectstring=192.168.0.10 
[mysql_cluster] 
# IP address of mysqlmanagement server 
ndb-connectstring=192.168.0.10 
This tells our data nodes who the management server is.

Next, we will make the data directory and start the data node. You must use --initial when starting the data node for the first time or when you've made configuration changes:

# mkdir /var/lib/mysql-cluster 
# cd /var/lib/mysql-cluster/ 
# /opt/mysql/mysql/bin/ndbd --initial 
# /opt/mysql/mysql/support-files/mysql.server start 
After performing this on both mysqldata1 and mysqldata2, go back to mysqlmanage with the MySQL management client running and type "show":

ndb_mgm> show 
Cluster Configuration 
--------------------- 
[ndbd(NDB)]     2 node(s) 
id=2    @192.168.0.11  (Version: 5.0.22, Nodegroup: 0, Master) 
id=3    @192.168.0.12  (Version: 5.0.22, Nodegroup: 0) 

[ndb_mgmd(MGM)] 1 node(s) 
id=1    @192.168.0.10  (Version: 5.0.22) 

[mysqld(API)]   2 node(s) 
id=4    @192.168.0.11  (Version: 5.0.22) 
id=5    @192.168.0.12  (Version: 5.0.22) 
    
Here they have fully joined. If they still show "starting", then you must wait until they fully join the cluster before adding data to the cluster.

Test the Cluster

We need to test the cluster to make sure it's working properly. On mysqldata1 and mysqldata2, log into mysql and create the cluster database on each:

# /opt/mysql/mysql/bin/mysql -u root 
mysql> create database cluster; 
Query OK, 1 row affected (0.00 sec) 
    
On mysqldata2, create test1 table with the NDBCLUSTER engine:

mysql> create table test1 ( id int(11) primary key not null, \
  name varchar(300) not null ) ENGINE=NDBCLUSTER; 
Query OK, 0 rows affected (1.27 sec) 
Go back to mysqldata1 and see if the table showed up:

mysql> use cluster; 

Database changed 
mysql> show tables; 
+-------------------+ 
| Tables_in_cluster | 
+-------------------+ 
| test1             | 
+-------------------+ 
1 row in set (0.01 sec) 
Now we can test the failover on the cluster. Kill off mysqldata1's ndbd:

# pkill ndbd 
Go to mysqldata2 and try a select on the cluster table:

mysql> select * from cluster; 
Empty set (0.13 sec) 
Here we can see it is still returning results on queries, which means it is working. If you do a show on the MySQL management it will show the node down.

Conclusion

While this type of configuration will save you from a database failure, it will not save you from a hardware failure because of it being located on one physical machine. This could also easily be implemented for any other clustering solutions such as Tomcat or Weblogic, and you could even go further by combining this with the Service Management Facility. With the features of Solaris 10, the possibilities are almost endless. For more information on zones and resource management, see:

http://www.sun.com/blueprints/browsesubject.html#resource
Derek Crudgington is a contractor for The InSource Group and a Solaris addict. You can check out his blog at http://hell.jedicoder.net or email him at dacrud@gmail.com.