Questions and Answers
Amy Rich
Q We just installed a new batch of Solaris 9 machines
where I work. These machines are of different types (280Rs, V120s, V240s,
V490s, V1280s, etc.), and on some of them, it looks like ipfilter refuses to
work on the public interface. This doesn’t seem to be consistent across
hardware types, though. In each case where ipfilter is failing, I’ve reduced
the rulesets to the absolute bare minimum. Here’s an example from one of our
280Rs:
block in log level local3.info quick on hme0 all
block out log level local3.info quick on hme0 all
block in log level local3.info quick on eri0 all
block in log level local3.info quick on eri0 all
To prove that it’s picking up the rules correctly, here’s
the ipfstat -io output:
block out log level local3.info quick on hme0 all
block out log level local3.info quick on eri0 all
block in log level local3.info quick on hme0 all
block in log level local3.info quick on eri0 all
When I try to ping the machine using the private
interface (eri0), the packets are blocked, and I get a timeout on the ping.
Here’s the corresponding ipfstat -d output:
bad packets: in 0 out 0
IPv6 packets: in 0 out 0
input packets: blocked 49 passed 0 nomatch 0 counted 0 short 0
output packets: blocked 0 passed 0 nomatch 0 counted 0 short 0
input packets logged: blocked 49 passed 0 output packets logged: blocked 0 passed 0
packets logged: input 0 output 0
log failures: input 0 output 0
fragment state(in): kept 0 lost 0 not fragmented 0
fragment state(out): kept 0 lost 0 not fragmented 0
packet state(in): kept 0 lost 0
packet state(out): kept 0 lost 0
ICMP replies: 0 TCP RSTs sent: 0
Invalid source(in): 0
Result cache hits(in): 29 (out): 0
IN Pullups succeeded: 0 failed: 0
OUT Pullups succeeded: 0 failed: 0
Fastroute successes: 0 failures: 0
TCP cksum fails(in): 0 (out): 0
IPF Ticks: 265
Packet log flags set: (0)
none
And when I ping the external interface (hme0), it answers
right away! If I do another ipfstat -d, it doesn’t show any packets passed,
which is very odd. When I check the per rule hits with ipfstat -hi, I only see
blocks on eri0:
0 block in log level local3.info quick on hme0 all
154 block in log level local3.info quick on eri0 all
If I take a look at ifconfig output for both the
interfaces, they appear fine:
eri0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
inet 10.1.1.10 netmask ffffff00 broadcast 10.1.1.255
ether xx:xx:xx:xx:xx:xx
hme0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 5
inet 192.168.1.6 netmask ffffff00 broadcast 192.168.1.255
ether yy:yy:yy:yy:yy:yy
There’s nothing about packets being blocked in the log
file, either. I’m completely mystified, and hoping that you can point me in the
right direction.
A Does every host that’s having a problem have multiple
interface drivers (e.g., hme and eri, eri and ce, dmfe and ce) while every
machine that works have only one driver type (e.g., eri0 and eri1 for public
and private interfaces)? You don’t specify your version of ipfilter, but it
sounds like your problem is with pfil, the interface to the kernel, not
ipfilter itself. If you run the following command, do you see one or two QIF
entries:
ndd /dev/pfil qif_status
If you’re only seeing QIF1, then you’re missing a device
entry in the pfil configuration file /etc/opt/pfil/iu.ap. In your case, the
file should contain:
eri -1 0 pfil
hme -1 0 pfil
I suspect you’re missing the hme line. Add it, reboot the
machine, and your filters should start working properly. When you run ndd
/dev/pfil qif_status again, you should see two QIF lines, one for each
interface:
ifname ill q OTHERQ ipmp num sap hl nr nw bad copy copyfail drop notip nodata notdata
QIF3 0x0 0x30001588a58 0x30001588b48 0x0 3 806 0 6 9 0 0 0 0 0 0 0
hme0 0x30000062530 0x30000fd2530 0x30000fd2620 0x0 2 800 14 0 0 0 0 0 0 0 0 0
QIF1 0x0 0x30000fd27c0 0x30000fd28b0 0x0 1 806 0 3290 39 0 0 0 0 0 0 0
eri0 0x300000622b0 0x30000fd3490 0x30000fd3580 0x0 0 800 14 191 196 0 0 0 0 0 0 0
If you have a large number of machines with this issue,
you might want to automate the process of adding all of the correct entries
into the pfil configuration file. On each machine, make sure all of your
interfaces are up, and run the following bit of code based on the pfil install
script:
PFIL='/etc/opt/pfil/iu.ap'
if [ -f ${PFIL} ]; then
# ignore the loopback and fcip devices
IFACES=`ifconfig -a|sed -ne 's/\([a-z0-9]*\)[0-9]\{1,\}[:0-9]*: \
.*/\1/p'|egrep -v 'fcip|lo'|sort -u`
for IFACE in ${IFACES}; do
# if the interface is not already in the config file, add it
# make sure you use tabs, not spaces below
/usr/xpg4/bin/grep -q " ${IFACE} " ${PFIL}
if [ $? -ne 0 ] ; then
echo "Adding ${IFACE} interface to ${PFIL}"
# make sure you use tabs, not spaces below
echo " ${IFACE} -1 0 pfil" >> ${PFIL}
fi
done
fi
To avoid this problem in the future, make sure all of
your ethernet interfaces are up when you’re installing the pfil package. If you
can’t have all the interfaces up for some reason (security during the
installation, perhaps), run the above script after you’ve configured all of
your interfaces.
Q One of my clients has a Web site that they manage via
ftp. The person who was previously managing the content left the company, and
the client doesn’t have a copy of everything that’s on their Web site. They’d
like to pull down a copy of everything, so I attempted to use wget to do so.
Unfortunately, their username has an @ symbol in the name
(username@www.host.domain is their login). As far as I can tell, this breaks
wget since it’s looking for the hostname after the @ and doesn’t recognize that
it’s part of the username. Is there some other tool I can use that will mirror
the content of this site via ftp?
A You can actually use wget for this purpose, but the way
to do so isn’t obviously documented. In the section under --http-user and
--http-passwd, the man page says:
Another way to specify username and password is in the
URL itself. Either method reveals your password to anyone who bothers to run
"ps”. To prevent the passwords from being seen, store them in .wgetrc or
.netrc, and make sure to protect those files from other users with "chmod”. If
the passwords are really important, do not leave them lying in those files
either -- edit the files and delete them after Wget has started the
download.
So you just need to set up a .netrc file (which most
ftp-like programs will honor) that specifies the username. Your .netrc should
contain an entry such as the following:
machine www.host.domain
login username@www.host.domain
password yourpasshere
So, to grab the entire site, your wget command line might
be:
wget--passive-ftp -r -l 20 -I site ftp://www.host.domain/
Q We have an external array attached to a Sun box. We set
up the disks in the array as a JBOD stripe. Someone in our data center
accidentally unplugged the SCSI cable to the array, and all of the disks fell
off line, taking the stripe with it. Unfortunately, we can’t get SVM to
re-recognize the stripe, even after we plug the array back in. All of the disks
are there; we can see them with format. The metastat -s vol01 output says:
vol01/d360: Mirror
Submirror 0: vol01/d361
State: Needs maintenance
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 324000768 blocks (154 GB)
vol01/d361: Submirror of vol01/d360
State: Needs maintenance
Invoke: after replacing "Maintenance" components:
metareplace vol01/d360 d19s0 <new device>
Size: 324000768 blocks (154 GB)
Stripe 0: (interlace: 32 blocks)
Device Start Block Dbase State Reloc Hot Spare
d27s0 0 No Okay No
d19s0 0 No Last Erred No
Since the data is all there, I attempted to run:
metareplace -s vol01 -s d360 /dev/did/dsk/d19s0
But that gave me the error:
Attempt to replace a component on the last running submirror
I even tried using -f to force it, but that didn’t work,
either. How can I get my device back so I can access my data?
A What you’ve done is set up a one-way mirror (a mirror
with only one sub-mirror). Your sub-mirror is a stripe with two devices in it.
As the error indicates, you can’t recover a one-way mirror by metareplacing the
same device. What you can do is delete the mirror device and rebuild it. You
can instead delete the mirror and just mount the stripe metadevice on its own
without creating a one-way mirror again. In either case, the first step is to
clear the mirror:
metaclear -f -s vol01 d360
You can now either change /etc/vfstab to reference
/dev/md/vol01/dsk/d361 directly or rebuild the superfluous mirror. To rebuild
the mirror:
metainit -s vol01 d360 -m d361
Because of the circumstances, you might also have a
corrupt file system on your hands. If it gives you an error when you try to
mount the file system, it should tell you if a fsck is necessary. You can run
fsck without accepting any changes to get a sense of what changes need to be
made without actually modifying the data on the disk (substitute d361 for d360
if you chose not to rebuild the mirror):
fsck -n /dev/md/vol01/rdsk/d360
If there are a lot of errors, you have no backups, and
your data is very important, you may want to try sending the disk out for data
recovery instead of using fsck to fix it. If there aren’t many errors and/or
your data is backed up or not as important, run fsck -y instead of fsck -n to
accept all changes that fsck requires.
Q We use a software tool called orca to monitor various
trending information on our Solaris machines. We have the same orca package
running on various versions of Solaris on various hardware platforms. On a
small handful of machines, orcallator, the collector process, refuses to start,
though. The init script we use comes with orca and is pretty basic. The
start_orcallator and stop_orcallator scripts are also those that come with orca.
The three scripts are exactly the same on all of our machines.
When I execute /etc/init.d/orcallator start, I get the
following output:
Writing data into /usr/local/orca/orcallator/myhost/
Using www access log file /var/log/httpd/access.log
Starting logging
Sending output to nohup.out
If you do a ps and look for the orcallator process,
though, nothing is running. The file /nohup.out contains the following error
message:
Fatal: subscript: 3 out of range for: GLOBAL_net[3]
What does this error mean, why am I getting it, and how
can I fix it?
A The error is a result of the GLOBAL_net array being too
small to hold all of the entries for the machine’s interfaces. The typical fix
for this, other than upgrading to the latest development branch which
theoretically fixes this issue, is to modify the file
/opt/RICHPse/etc/se_defines to include the line:
force MAX_IF 15
Increase the number as needed until you stop seeing the
error in /nohup.out and the orcallator.se process stops dying. If you use a
reasonably large number and this still fails to work (I’ve occasionally seen
this happen), you can instead modify the /usr/local/bin/start_orcallator script
to change this line:
SE_PATCHES=
to:
SE_PATCHES="-DMAX_IF=15"
If you need to go this route, you’ll also have to modify
/usr/local/bin/stop_orcallator, because it performs a pgrep looking for the
string orcall. With the addition of the -DMAX_IF=15 to the command line, the
output from pgrep is too long and the regex for orcall won’t match. Modify the
file so that this line:
pids=`/usr/bin/pgrep -P 1 -f orcall`
instead reads:
pids=`/usr/bin/pgrep -P 1 -f se.sparc`
Amy Rich has more than a decade of Unix systems
administration experience in various types of environments. Her current roles
include that of Senior Systems Administrator for the University Systems Group
at Tufts University, Unix systems administration consultant, author, and
charter member of LOPSA. She can be reached at: qna@oceanwave.com.
|