OpenBSD bug: CARP + Arpbalance on i386 and IPv4
Contents
Top
Introduction
This is a breif article outlining the bugs found with the arpbalance code in
every release of OpenBSD since 3.5. The problem in 3.5 was replaced with a
different problem in 3.6 onwards. I have written patches (below) which will
correct these problems. Please Note: This documentation has had the original
subnet addresses replaced with "10.0.0." for privacy purposes.
This article only addresses issues in the IPv4 portion of code, and no
assumptions are made as to if the problems outlined below affect the IPv6
subsystem.
The systems used to detect and diagnose these problems were both Intel machines.
The first setup is a load-balancing firewall, with 3 physical interfaces. For
load balancing, this requires a total of 6 CARP interfaces.
The second setup consisted of two basic PIII machines with single interfaces,
requiring only two CARP interfaces for load balancing.
DMESGs for these systems can be found at the end of this article.
Top
Acknowledgements
I would like to acknowledge Jason Dixon for his initial support and time
he gave me in the beginning to help me try and fix my issue.
I would also like to acknowlege Sorot Panichprecha for his assistance with
C, and for his assistance in helping to determine the problem.
Top
Problem Description (3.5)
For a /25 and /24 subnet, the arpbalance code in the OpenBSD 3.5 kernel does
not function as documented. Since this isn't a current problem, detail will
not be explained here, but suffice to say that the source address of the
incoming packet was not being bit-converted from network-order to host-order.
This means that the order of bits in the packet needs to be reversed when
being used by the host. Since they were not being reversed, the numbers were not
representing their proper addresses, and turned out for the subnets
mentioned above, ALL addresses were being interpreted by the system
as either all-odd or all-even numbers. For an arp-balancing system of only
two hosts, this means that only one host was being selected for use.
We should mention that if the CARP interfaces were facing the internet, then
this problem would not have surfaced - as many subnets would have accessed
the CARP host. It was discovered that the last digit of the first portion of
the IP address (i.e. __x.___.___.___) was the number that decided which host
serviced the request.
This was corrected by using the function ntohl() in the carp_iamatch() code before
performing a modulus on the source address. This then ordered the number
correctly for i386 architectures, such that it was the last digit of the last
portion (i.e. ___.___.___.__x) which determined which host serviced the request,
which is more desirable for subnets. A patch to apply this to the 3.5 kernel
source can be downloaded below.
Top
Problem Description (3.6+)
For a /25 and /24 subnet, the arpbalance option did not work as documented.
Further to this, logs frequently reported that the CARP IP was duplicated on the
network.
The carp_iamatch() function from sys/netinet/ip_carp.c had been re-written and
differs from the 3.5 kernel, so the 3.5 problem no longer applies to this source.
The function prototype had also changed, passing the source MAC address instead of
the source packet header, amongst other things.
The insertion of the following line into carp_iamatch() revealed the problem:
log(LOG_DEBUG,"Address being modulated: %u (%s)\n",
ia->ia_addr.sin_addr.s_addr,
inet_ntoa(ia->ia_addr.sin_addr));
where ia->ia_addr.sin_addr.s_addr is what is being modulated against to
select which virtual host should handle this request, further down in the code.
The debug log showed the following:
/bsd: Address being modulated: 2114368899 (10.0.0.126)
/bsd: Address being modulated: 2114368899 (10.0.0.126)
/bsd: Address being modulated: 2114368899 (10.0.0.126)
/bsd: duplicate IP address 10.0.0.126 sent from ethernet address 00:00:5e:00:01:15
Although this doesn't seem out of the ordinary, the IP addresses being modulated
are actually the addresses of the CARP interfaces receiving the packet, NOT
the source IP of the packet. For proof:
[root@secure-master ~]# ifconfig carp
...
carp11: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
carp: MASTER carpdev em1 vhid 11 advbase 1 advskew 0
inet 10.0.0.126 netmask 0xffffff80 broadcast 10.0.0.127
...
carp21: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
carp: BACKUP carpdev em1 vhid 21 advbase 1 advskew 100
inet 10.0.0.126 netmask 0xffffff80 broadcast 10.0.0.127
(only relevant carp entries shown)
The actual sources for those entries were 10.0.0.73, 10.0.0.4 and
10.0.0.23, by SSHing from those hosts to 10.0.0.126.
This means that both hosts were using a fixed IP to modulate against, once
again resulting in only one host being chosen to handle requests, regardless
of the source IP. This also seems to have resulted in the second message above
regarding duplicate IP addresses.
Top
Solution 1: MAC-Based Balancing
It was noted that, even though the source MAC address was being passed to
carp_iamatch(), it was never actually used. Secondly, since the source IP of
the packet is no longer being passed to carp_iamatch(), one's first thought
would be to modulate on the source MAC. This was implemented easily enough,
and a patch can be downloaded below which works on 3.6 -> 3.8.
Only the last portion of the MAC address (i.e. __:__:__:__:xx) was considered,
which makes this fairly efficient. Anything more than this is probably just
going to waste cycles, as it's the last digits that determine the result of
the modulus.
There are caveats with this solution. Firstly, since the last portion is
only 8 bits, this limits the use of arp-balancing virtual hosts to 256.
Although this isn't a problem (as using 256 virtual hosts is probably
not the best solution to begin with), it is a limition that the IP-based
solution does not suffer from.
Secondly, it was shown that in my environments, by chance I happened to have
a large number of MAC addresses whereby the last portion was even. This
resulted in uneven distribution of traffic between the load-balancing hosts.
Top
Solution 2: IP-Based Balancing
I preferred to balance based on IP, because at least the traffic can be
somewhat equally distributed if the range of IPs in use is being filled
consecutively.
Unfortunately, this required the modification of the carp_iamatch() function
prototype, along with some of its source, and modification of sys/netinet/
if_ether.c so carp_iamatch() was called correctly.
This now forces the function to modulate based on source IP, as per the
original function in the 3.5 kernel. This also removes the restriction
of the first solution being limited to 256 virtual hosts.
Downloads for both these patches can be found below. A different
patch is required for 3.8 to do IP-based balancing due to changes in
if_ether.c which prevents the patch from installing correctly.
Top
Patches
Source should be downloaded and patches applied as such:
# wget http://some.mirror.com/pub/OpenBSD/3.8/sys.tar.gz
# tar -zxvf sys.tar.gz
# patch -p0 < patchname.patch
OpenBSD 3.5 kernel - Source IP-Based Balancing
OpenBSD 3.6-3.8 kernel - Source MAC-Based Balancing
OpenBSD 3.6-3.7 kernel - Source IP-Based Balancing
OpenBSD 3.8 kernel - Source IP-Based Balancing
Top
DMESG - P4 Blade Firewalls
OpenBSD 3.7 (GENERIC) #1: Wed Nov 2 09:05:34 EST 2005
root@secure-master:/usr/src/sys/arch/i386/compile/GENERIC
cpu0: Intel(R) Pentium(R) 4 CPU 3.00GHz ("GenuineIntel" 686-class) 3 GHz
cpu0: FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,ACPI,
MMX,FXSR,SSE,SSE2,SS,HTT,TM,SBF,PNI,MWAIT,CNXT-ID
real mem = 536387584 (523816K)
avail mem = 482652160 (471340K)
using 4278 buffers containing 26923008 bytes (26292K) of memory
mainbus0 (root)
bios0 at mainbus0: AT/286+(00) BIOS, date 12/03/04, BIOS32 rev. 0 @ 0xf0010
pcibios0 at bios0: rev 2.1 @ 0xf0000/0x10000
pcibios0: PCI IRQ Routing Table rev 1.0 @ 0xf51b0/208 (11 entries)
pcibios0: no compatible PCI ICU found: ICU vendor 0x8086 product 0x25a1
pcibios0: PCI bus #3 is the last bus
bios0: ROM list: 0xc0000/0x8000
cpu0 at mainbus0
pci0 at mainbus0 bus 0: configuration mode 1 (no bios)
pchb0 at pci0 dev 0 function 0 "Intel 82875P Host" rev 0x02
ppb0 at pci0 dev 3 function 0 "Intel 82875P PCI-CSA" rev 0x02
pci1 at ppb0 bus 1
em0 at pci1 dev 1 function 0 "Intel PRO/1000CT (82547EI)" rev 0x00: irq 3, address: 00:04:23:b3:6a:f6
ppb1 at pci0 dev 28 function 0 "Intel 6300ESB PCIX" rev 0x02
pci2 at ppb1 bus 2
em1 at pci2 dev 2 function 0 "Intel PRO/1000MT DP (82546EB)" rev 0x03: irq 9, address: 00:04:23:b0:4c:e8
em2 at pci2 dev 2 function 1 "Intel PRO/1000MT DP (82546EB)" rev 0x03: irq 9, address: 00:04:23:b0:4c:e9
uhci0 at pci0 dev 29 function 0 "Intel 6300ESB USB" rev 0x02: irq 5
usb0 at uhci0: USB revision 1.0
uhub0 at usb0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1 at pci0 dev 29 function 1 "Intel 5300ESB USB" rev 0x02: irq 9
usb1 at uhci1: USB revision 1.0
uhub1 at usb1
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
"Intel 6300ESB WDT" rev 0x02 at pci0 dev 29 function 4 not configured
"Intel 6300ESB APIC" rev 0x02 at pci0 dev 29 function 5 not configured
ehci0 at pci0 dev 29 function 7 "Intel 6300ESB USB" rev 0x02: irq 7
ehci0: EHCI version 1.0
ehci0: companion controllers, 2 ports each: uhci0 uhci1
usb2 at ehci0: USB revision 2.0
uhub2 at usb2
uhub2: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub2: single transaction translator
uhub2: 4 ports with 4 removable, self powered
ppb2 at pci0 dev 30 function 0 "Intel 82801BA AGP" rev 0x0a
pci3 at ppb2 bus 3
vga1 at pci3 dev 0 function 0 "ATI Rage XL" rev 0x27
wsdisplay0 at vga1: console (80x25, vt100 emulation)
wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
fxp0 at pci3 dev 1 function 0 "Intel 82557" rev 0x10, i82550: irq 11, address 00:04:23:b3:6a:f7
inphy0 at fxp0 phy 1: i82555 10/100 PHY, rev. 4
ichpcib0 at pci0 dev 31 function 0 "Intel 6300ESB LPC" rev 0x02
pciide0 at pci0 dev 31 function 2 "Intel 6300ESB SATA" rev 0x02: DMA,
channel 0 configured to compatibility, channel 1 configured to compatibility
atapiscsi0 at pciide0 channel 0 drive 1
scsibus0 at atapiscsi0: 2 targets
cd0 at scsibus0 targ 0 lun 0: <TOSHIBA, DVD-ROM SD-C2102, 1139> SCSI0 5/cdrom removable
cd0(pciide0:0:1): using PIO mode 4, DMA mode 2
wd0 at pciide0 channel 1 drive 0: <WDC WD800JD-23JNA1>
wd0: 16-sector PIO, LBA, 76324MB, 156312576 sectors
wd0(pciide0:1:0): using PIO mode 4, Ultra-DMA mode 5
"Intel 6300ESB SMBus" rev 0x02 at pci0 dev 31 function 3 not configured
isa0 at ichpcib0
isadma0 at isa0
pckbc0 at isa0 port 0x60/5
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0 (mux 1 ignored for console): console keyboard, using wsdisplay0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: <PC speaker>
sysbeep0 at pcppi0
lm0 at isa0 port 0x290/8: W83627HF
npx0 at isa0 port 0xf0/16: using exception 16
pccom0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB 80 cyl, 2 head, 18 sec
biomask f7e5 netmask ffed ttymask ffef
pctr: user-level cycle counter enabled
dkcsum: wd0 matched BIOS disk 80
root on wd0a
rootdev=0x0 rrootdev=0x300 rawdev=0x302
DMESG - PIII Simple Configuration
OpenBSD 3.5 (GENERIC) #6: Fri Oct 28 17:07:09 EST 2005
root@faker-dmz:/usr/src/sys/arch/i386/compile/GENERIC
cpu0: Intel Pentium III ("GenuineIntel" 686-class) 599 MHz
cpu0: FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE
real mem = 133791744 (130656K)
avail mem = 117886976 (115124K)
using 1658 buffers containing 6791168 bytes (6632K) of memory
mainbus0 (root)
bios0 at mainbus0: AT/286+(f1) BIOS, date 02/23/00, BIOS32 rev. 0 @ 0xfd7a0
apm0 at bios0: Power Management spec V1.2
apm0: AC on, battery charge unknown
pcibios0 at bios0: rev. 2.1 @ 0xfd7a0/0x860
pcibios0: PCI IRQ Routing Table rev. 1.0 @ 0xfdf30/176 (9 entries)
pcibios0: PCI Interrupt Router at 000:07:0 ("Intel 82371FB ISA" rev 0x00)
pcibios0: PCI bus #1 is the last bus
bios0: ROM list: 0xc0000/0x8000 0xc8000/0x800 0xe0000/0x4000! 0xe4000/0xc000
pci0 at mainbus0 bus 0: configuration mode 1 (no bios)
pchb0 at pci0 dev 0 function 0 "Intel 82443BX AGP" rev 0x03
ppb0 at pci0 dev 1 function 0 "Intel 82443BX AGP" rev 0x03
pci1 at ppb0 bus 1
vga1 at pci1 dev 0 function 0 "Matrox MGA G400/G450 AGP" rev 0x04
wsdisplay0 at vga1: console (80x25, vt100 emulation)
wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
pcib0 at pci0 dev 7 function 0 "Intel 82371AB PIIX4 ISA" rev 0x02
pciide0 at pci0 dev 7 function 1 "Intel 82371AB IDE" rev 0x01: DMA, channel 0
wired to compatibility, channel 1 wired to compatibility
wd0 at pciide0 channel 0 drive 0: <ST310212A>
wd0: 32-sector PIO, LBA, 9768MB, 20005650 sectors
wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 2
atapiscsi0 at pciide0 channel 1 drive 0
scsibus0 at atapiscsi0: 2 targets
cd0 at scsibus0 targ 0 lun 0: <ATAPI, CD-ROM DRIVE-40X, N0CP> SCSI0 5/cdrom removable
cd0(pciide0:1:0): using PIO mode 4, Ultra-DMA mode 2
uhci0 at pci0 dev 7 function 2 "Intel 82371AB USB" rev 0x01: irq 9
usb0 at uhci0: USB revision 1.0
uhub0 at usb0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
"Intel 82371AB Power Mgmt" rev 0x02 at pci0 dev 7 function 3 not configured
yds0 at pci0 dev 12 function 0 "Yamaha 740C" rev 0x03: irq 10
ac97: codec id 0x41445303 (Analog Devices AD1819)
ac97: codec features Analog Devices Phat Stereo
audio0 at yds0
xl0 at pci0 dev 14 function 0 "3Com 3c905C 100Base-TX" rev 0x74: irq 5 address 00:01:02:53:4f:04
exphy0 at xl0 phy 24: Broadcom 3C905C internal PHY, rev. 6
isa0 at pcib0
isadma0 at isa0
pckbc0 at isa0 port 0x60/5
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: <PC speaker>
sysbeep0 at pcppi0
lpt0 at isa0 port 0x378/4 irq 7
npx0 at isa0 port 0xf0/16: using exception 16
pccom0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
pccom1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo
fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB 80 cyl, 2 head, 18 sec
opl0 at yds0: model OPL3
midi1 at opl0: <DS-1 integrated Yamaha OPL3>
mpu at yds0 not configured
mpu at yds0 not configured
mpu at yds0 not configured
mpu at yds0 not configured
biomask c240 netmask c260 ttymask c2e2
pctr: 686-class user-level performance counters enabled
mtrr: Pentium Pro MTRR support
dkcsum: wd0 matched BIOS disk 80
root on wd0a
rootdev=0x0 rrootdev=0x300 rawdev=0x302
Author: Matt Bradford <openbsd at mattbradford dot com>
Created: 2 Nov, 2005
Modified: 2 Jan, 2006