Monday, October 8, 2012

On binding datagram (UDP) sockets to the ANY addresses

This story goes back a long time. For around 10 years now, people have been requesting that PowerDNS learn how to automatically listen on all available IP addresses. And for slightly less than that time, we've been telling people we would not be adding that feature.

For one, if you run a nameserver, you should *know* what IP addresses you listen on! How else could people delegate to you, or rely on you to resolve their queries? Secondly, running services by default on 'all' IP addresses is a security risk. The PowerDNS Recursor for this reason binds to 127.0.0.1 by default.

But still, people wanted this feature, and we didn't do it. Because we knew it'd be hard work. There, the truth is out. But we finally bit the bullet and had to figure out how to do it. This page shares that knowledge, including the fact that the Linux manpages tell you to do the wrong thing.

There are two ways to listen on all addresses, one of which is to enumerate all interfaces, grab all their IP addresses, and bind to all of them. Lots of work, and non-portable work too.  We really did not want to do that. You also need to monitor new addresses arriving.

Secondly, just bind to 0.0.0.0 and ::! This works very well for TCP and other connection-oriented protocols, but can fail silently for UDP and other connectionless protocols. How come? When a packet comes in on 0.0.0.0, we don't know which IP address it was sent to. And this is a problem when replying to such a packet - what would the correct source address be? Because we are connectionless (and therefore stateless), the kernel doesn't know what to do.

So it picks the most appropriate address, and that may be the wrong one. There are some heuristics that make some kernels do the right thing more reliably, but there are no guarantees.

When receiving packets on datagram sockets, we usually use recvfrom(2), but this does not provide the missing bit of data: which IP address the packet was actually sent to. There is no recvfromto(). Enter the very powerful recvmsg(2). Recvmsg() allows for the getting of a boatload of parameters per datagram, as requested via setsockopt().

One of the parameters we can request is the original destination IP address of the packet.

IPv6

For IPv6, this is actually standardized in RFC 3542, which tells us to request parameter IPV6_RECVPKTINFO via setsockopt(), which will lead to the delivery of the IPV6_PKTINFO parameter when we use recvmsg(2).

This parameter is sent to us as a struct in6_pktinfo, and its ipi6_addr member contains the original destination IPv6 address of the query.

When replying to a packet from a socket bound to ::, we have the reverse problem: how to specify which *source* address to use. To do so, use sendmsg(2) and specify an IPV6_PKTINFO parameter, which again contains a struct in6_pktinfo.

And we are done!

To get this to work on OSX, please #define __APPLE_USE_RFC_3542, but otherwise this feature is portable across FreeBSD, OSX and Linux. (Please let me know about Windows, I want to make this page as valuable as possible).

IPv4
For IPv4 the situation is more complicated. Linux and the BSDs picked a slightly different way to do things, since they did not have an RFC to guide them. Confusingly, the Linux manpages document this incorrectly (I'll submit a patch to the manpages as soon as everybody agrees that this page describes things correctly).

For BSD, use a setsockopt() called IP_RECVDSTADDR to request the original destination address. This then arrives as an IP_RECVDSTADDR option over recvmsg(), which carries a struct in_addr, which does NOT necessarily have all fields filled out (like for example the destination port number).

For Linux, use the setsockopt() called IP_PKTINFO, which will get you a parameter over recvmsg() called IP_PKTINFO, which carries a struct in_pktinfo, which has a 4 byte IP address hiding in its ipi_addr field.

Conversely, for sending on Linux pass a IP_PKTINFO parameter using sendmsg()  and make it contain a struct in_pktinfo. 

On FreeBSD, pass the IP_SENDSRCADDR option, and make it contain a struct in_addr, but again note that it probably does not make sense to set the source port in there, as your socket is bound to exactly one port number (even if it covers many IP addresses).

Binding to :: for IPv6 *and* IPv4 purposes

On Linux, one can bind to :: and get packets destined for both IPv6 and IPv4. The good news is that this combines well with the above, and Linux delivers an IPv4 IP_PKTINFO for IPv4 packets, and will also honour the IP_PKTINFO for outgoing IPv4 packets on such a combined IPv4/IPv6 socket.

On FreeBSD, and probably other BSD-derived systems, one should bind explicitly to :: and 0.0.0.0 to cover IPv4 and IPv6. This is probably better. To get this behaviour on Linux, use the setsockopt() IPV6_V6ONLY, or set /proc/sys/net/ipv6/bindv6only to 1.

Actual source code

To see all this in action, head over to http://wiki.powerdns.com/trac/browser/trunk/pdns/pdns/nameserver.cc - it contains the relevant setsockopt(), sendmsg() and recvmsg() calls. 

No comments:

Post a Comment