Article

feb2007.tar

Questions and Answers

Amy Rich

Q We just installed a new batch of Solaris 9 machines where I work. These machines are of different types (280Rs, V120s, V240s, V490s, V1280s, etc.), and on some of them, it looks like ipfilter refuses to work on the public interface. This doesn’t seem to be consistent across hardware types, though. In each case where ipfilter is failing, I’ve reduced the rulesets to the absolute bare minimum. Here’s an example from one of our 280Rs:

block in log level local3.info quick on hme0 all
block out log level local3.info quick on hme0 all
block in log level local3.info quick on eri0 all
block in log level local3.info quick on eri0 all

To prove that it’s picking up the rules correctly, here’s the ipfstat -io output:

block out log level local3.info quick on hme0 all
block out log level local3.info quick on eri0 all
block in log level local3.info quick on hme0 all
block in log level local3.info quick on eri0 all

When I try to ping the machine using the private interface (eri0), the packets are blocked, and I get a timeout on the ping. Here’s the corresponding ipfstat -d output:

bad packets:            in 0 out 0
 IPv6 packets:          in 0 out 0
 input packets:         blocked 49 passed 0 nomatch 0 counted 0 short  0
output packets:         blocked 0 passed 0 nomatch 0 counted 0 short 0
 input packets logged:  blocked 49 passed 0 output packets logged:  blocked 0 passed 0
 packets logged:        input 0 output 0
 log failures:          input 0 output 0
fragment state(in):     kept 0  lost 0  not fragmented 0
fragment state(out):    kept 0  lost 0  not fragmented 0
packet state(in):       kept 0  lost 0
packet state(out):      kept 0  lost 0
ICMP replies:  0        TCP RSTs sent:  0
Invalid source(in):     0
Result cache hits(in):  29      (out):  0
IN Pullups succeeded:   0       failed: 0
OUT Pullups succeeded:  0       failed: 0
Fastroute successes:    0       failures:       0
TCP cksum fails(in):    0       (out):  0
IPF Ticks:      265
Packet log flags set:   (0)
       none

And when I ping the external interface (hme0), it answers right away! If I do another ipfstat -d, it doesn’t show any packets passed, which is very odd. When I check the per rule hits with ipfstat -hi, I only see blocks on eri0:

0 block in log level local3.info quick on hme0 all
154 block in log level local3.info quick on eri0 all

If I take a look at ifconfig output for both the interfaces, they appear fine:

eri0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
        inet 10.1.1.10 netmask ffffff00 broadcast 10.1.1.255
        ether xx:xx:xx:xx:xx:xx
hme0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 5
        inet 192.168.1.6 netmask ffffff00 broadcast 192.168.1.255
        ether yy:yy:yy:yy:yy:yy

There’s nothing about packets being blocked in the log file, either. I’m completely mystified, and hoping that you can point me in the right direction.

A Does every host that’s having a problem have multiple interface drivers (e.g., hme and eri, eri and ce, dmfe and ce) while every machine that works have only one driver type (e.g., eri0 and eri1 for public and private interfaces)? You don’t specify your version of ipfilter, but it sounds like your problem is with pfil, the interface to the kernel, not ipfilter itself. If you run the following command, do you see one or two QIF entries:

ndd /dev/pfil qif_status

If you’re only seeing QIF1, then you’re missing a device entry in the pfil configuration file /etc/opt/pfil/iu.ap. In your case, the file should contain:

eri -1 0 pfil
hme -1 0 pfil

I suspect you’re missing the hme line. Add it, reboot the machine, and your filters should start working properly. When you run ndd /dev/pfil qif_status again, you should see two QIF lines, one for each interface:

ifname ill q OTHERQ ipmp num sap hl nr nw bad copy copyfail drop notip nodata notdata
QIF3 0x0 0x30001588a58 0x30001588b48 0x0 3 806 0 6 9 0 0 0 0 0 0 0
hme0 0x30000062530 0x30000fd2530 0x30000fd2620 0x0 2 800 14 0 0 0 0 0 0 0 0 0
QIF1 0x0 0x30000fd27c0 0x30000fd28b0 0x0 1 806 0 3290 39 0 0 0 0 0 0 0
eri0 0x300000622b0 0x30000fd3490 0x30000fd3580 0x0 0 800 14 191 196 0 0 0 0 0 0 0

If you have a large number of machines with this issue, you might want to automate the process of adding all of the correct entries into the pfil configuration file. On each machine, make sure all of your interfaces are up, and run the following bit of code based on the pfil install script:

PFIL='/etc/opt/pfil/iu.ap'
if [ -f ${PFIL} ]; then
  # ignore the loopback and fcip devices
  IFACES=`ifconfig -a|sed -ne 's/\([a-z0-9]*\)[0-9]\{1,\}[:0-9]*: \

    .*/\1/p'|egrep -v 'fcip|lo'|sort -u`
  for IFACE in ${IFACES}; do
    # if the interface is not already in the config file, add it
    # make sure you use tabs, not spaces below
    /usr/xpg4/bin/grep -q "   ${IFACE}        " ${PFIL}
    if [ $? -ne 0 ] ; then     
      echo "Adding ${IFACE} interface to ${PFIL}"
      # make sure you use tabs, not spaces below
      echo "  ${IFACE}        -1      0       pfil" >> ${PFIL}
    fi
   done
fi

To avoid this problem in the future, make sure all of your ethernet interfaces are up when you’re installing the pfil package. If you can’t have all the interfaces up for some reason (security during the installation, perhaps), run the above script after you’ve configured all of your interfaces.

Q One of my clients has a Web site that they manage via ftp. The person who was previously managing the content left the company, and the client doesn’t have a copy of everything that’s on their Web site. They’d like to pull down a copy of everything, so I attempted to use wget to do so. Unfortunately, their username has an @ symbol in the name (username@www.host.domain is their login). As far as I can tell, this breaks wget since it’s looking for the hostname after the @ and doesn’t recognize that it’s part of the username. Is there some other tool I can use that will mirror the content of this site via ftp?

A You can actually use wget for this purpose, but the way to do so isn’t obviously documented. In the section under --http-user and --http-passwd, the man page says:

Another way to specify username and password is in the URL itself. Either method reveals your password to anyone who bothers to run "ps”. To prevent the passwords from being seen, store them in .wgetrc or .netrc, and make sure to protect those files from other users with "chmod”. If the passwords are really important, do not leave them lying in those files either -- edit the files and delete them after Wget has started the download.

So you just need to set up a .netrc file (which most ftp-like programs will honor) that specifies the username. Your .netrc should contain an entry such as the following:

machine www.host.domain
login username@www.host.domain
password yourpasshere

So, to grab the entire site, your wget command line might be:

wget--passive-ftp -r -l 20 -I site ftp://www.host.domain/

Q We have an external array attached to a Sun box. We set up the disks in the array as a JBOD stripe. Someone in our data center accidentally unplugged the SCSI cable to the array, and all of the disks fell off line, taking the stripe with it. Unfortunately, we can’t get SVM to re-recognize the stripe, even after we plug the array back in. All of the disks are there; we can see them with format. The metastat -s vol01 output says:

vol01/d360: Mirror
   Submirror 0: vol01/d361
     State: Needs maintenance
   Pass: 1
   Read option: roundrobin (default)
   Write option: parallel (default)
   Size: 324000768 blocks (154 GB)
vol01/d361: Submirror of vol01/d360
   State: Needs maintenance
   Invoke: after replacing "Maintenance" components:
               metareplace vol01/d360 d19s0 <new device>
   Size: 324000768 blocks (154 GB)
   Stripe 0: (interlace: 32 blocks)
       Device   Start Block  Dbase        State Reloc Hot Spare
       d27s0           0     No            Okay   No  
       d19s0           0     No      Last Erred   No

Since the data is all there, I attempted to run:

metareplace -s vol01 -s d360 /dev/did/dsk/d19s0

But that gave me the error:

Attempt to replace a component on the last running submirror

I even tried using -f to force it, but that didn’t work, either. How can I get my device back so I can access my data?

A What you’ve done is set up a one-way mirror (a mirror with only one sub-mirror). Your sub-mirror is a stripe with two devices in it. As the error indicates, you can’t recover a one-way mirror by metareplacing the same device. What you can do is delete the mirror device and rebuild it. You can instead delete the mirror and just mount the stripe metadevice on its own without creating a one-way mirror again. In either case, the first step is to clear the mirror:

metaclear -f -s vol01 d360

You can now either change /etc/vfstab to reference /dev/md/vol01/dsk/d361 directly or rebuild the superfluous mirror. To rebuild the mirror:

metainit -s vol01 d360 -m d361

Because of the circumstances, you might also have a corrupt file system on your hands. If it gives you an error when you try to mount the file system, it should tell you if a fsck is necessary. You can run fsck without accepting any changes to get a sense of what changes need to be made without actually modifying the data on the disk (substitute d361 for d360 if you chose not to rebuild the mirror):

fsck -n /dev/md/vol01/rdsk/d360

If there are a lot of errors, you have no backups, and your data is very important, you may want to try sending the disk out for data recovery instead of using fsck to fix it. If there aren’t many errors and/or your data is backed up or not as important, run fsck -y instead of fsck -n to accept all changes that fsck requires.

Q We use a software tool called orca to monitor various trending information on our Solaris machines. We have the same orca package running on various versions of Solaris on various hardware platforms. On a small handful of machines, orcallator, the collector process, refuses to start, though. The init script we use comes with orca and is pretty basic. The start_orcallator and stop_orcallator scripts are also those that come with orca. The three scripts are exactly the same on all of our machines.

When I execute /etc/init.d/orcallator start, I get the following output:

Writing data into /usr/local/orca/orcallator/myhost/
Using www access log file /var/log/httpd/access.log
Starting logging
Sending output to nohup.out

If you do a ps and look for the orcallator process, though, nothing is running. The file /nohup.out contains the following error message:

Fatal: subscript: 3 out of range for: GLOBAL_net[3]

What does this error mean, why am I getting it, and how can I fix it?

A The error is a result of the GLOBAL_net array being too small to hold all of the entries for the machine’s interfaces. The typical fix for this, other than upgrading to the latest development branch which theoretically fixes this issue, is to modify the file /opt/RICHPse/etc/se_defines to include the line:

force MAX_IF 15

Increase the number as needed until you stop seeing the error in /nohup.out and the orcallator.se process stops dying. If you use a reasonably large number and this still fails to work (I’ve occasionally seen this happen), you can instead modify the /usr/local/bin/start_orcallator script to change this line:

SE_PATCHES=

to:

SE_PATCHES="-DMAX_IF=15"

If you need to go this route, you’ll also have to modify /usr/local/bin/stop_orcallator, because it performs a pgrep looking for the string orcall. With the addition of the -DMAX_IF=15 to the command line, the output from pgrep is too long and the regex for orcall won’t match. Modify the file so that this line:

pids=`/usr/bin/pgrep -P 1 -f orcall`

instead reads:

pids=`/usr/bin/pgrep -P 1 -f se.sparc`

Amy Rich has more than a decade of Unix systems administration experience in various types of environments. Her current roles include that of Senior Systems Administrator for the University Systems Group at Tufts University, Unix systems administration consultant, author, and charter member of LOPSA. She can be reached at: qna@oceanwave.com.