T/TCP -- Transaction TCP Source Changes for Sun OS 4.1.3 Release 1.0 September 14, 1994 Prepared by Bob Braden USC Information Sciences Institute INTRODUCTION The README file describes a package of extensions to Sun OS 4.1.3 (basically, BSD Reno TCP) for the transaction extensions to TCP described in RFC1644. "T/TCP -- TCP Extensions for Transactions; Functional Specifications". To use this package, you will need kernel sources for 4.1.3. These sources have been used also with 4.1.3U1 without difficulty. This package includes the following files: README -- This file netinet.diffs -- Context diffs for files in netinet/ directory in kernel source tree. os.diffs -- Ditto for os/ directory sys.diffs -- Ditto for sys/ directory test/* -- A set of test and diagnostic programs. See the README file there. BUILDING A KERNEL Each of these *.diffs files can be applied to the corresponding source directory using the 'patch' program, and a new kernel generated. For example, starting in the base system build tree directory: cd netinet patch < ../4.1.3TTCP/netinet.diffs cd ../os patch < ../4.1.3TTCP/os.diffs cd ../sys patch < ../4.1.3TTCP/sys.diffs To include the T/TCP changes, you will also need to add the following line to your kernel config file: options XACT # T/TCP transaction TCP extensions If you omit this line, you will build a standard 4.1.3x kernel. TESTING THE KERNEL First of all, the new kernel should be fully compatible with normal TCP unless you exercise its new features. The only anomaly you should observe will be seen with tcpdump: your SYN segments will be sent with a new TCP option (CC.NEW, kind = 12; see RFC1644). The listening TCP that does not implement T/TCP should just ignore this option. (If you think you observe any problems with standard-TCP compatibility, try to capture a tcpdump trace of it and send it to me). USING T/TCP 1. New setsockopts: Two new set/getsockopts are defined for TCP (level = IPPROTO_TCP): TCP_NOTPUSH This option turns the NotPush flag on or off for the specified socket (nonzero value turns the flag on, which suppresses PUSH). Basically, this supplies a capability implied by the TCP specification but not implemented in standard BSD TCP, which always has an implied PUSH for any data write, i.e., BSD TCP always forwards any data passed to it for output. Turning on NOTPUSH removes this implied Push, so that data may be buffered within TCP and combined with a data from later writes to form full-sized segments for transmission. This buffering may continue until the connection is closed or NOTPUSH is turned off. This facility was added for T/TCP, to be executed on the server socket. This will ensure that the SYN, response data, and FIN from the server are piggy-backed on the same segment whenever possible. [Note: I am not sure this bit is really required for T/TCP; this needs further experimentation.] Note: NOTPUSH should not be confused with the existing setsockopt TCP_NODELAY, which suppresses the Nagel algorithm. The two are related, but different. TCP_NOOPTS This option suppresses sending any options on the initial SYN segment. It is intended for use for communicating with a (broken!) host that cannot accept and ignore TCP options on SYN segments. Both the TCP_NOTPUSH and TCP_NOOPTS flags are inherited from the LISTEN socket when a new socket is created as the result of an accept() call. 2. New send flag MSG_EOF A new send flag is defined in socket.h, MSG_EOF. This flag causes an "end of file" after the data in a send, sendto, or sendmsg operation. That is, it closes the send side of the connection after sending the data, all in one system call. This forces the FIN bit to be piggy-backed on the last data segment. Note that this flag does NOT give the effect of a send (or write) followed by a close() call. The BSD socket abstraction is full-duplex, while the underlying TCP protocol has a more complex dual-simplex logic: a TCP connection is allowed to be half-open, i.e., open in only one direction. This is essential for T/TCP. Setting the MSG_EOF in the send call sends the data and closes the send side of the connection; it does not affect the other side or delete the socket. A subsequent close call must still be made to delete the socket. 3. Sendto Call for TCP A very small change in the socket layer allows the sendto call to be used for a SOCK_STREAM socket, i.e., for TCP. This is the basic call used by a client application to issue a normal transaction request: sendto(so, msg, len, MSG_EOF, to, tolen) This call does an implied connect() to the socket address *to, sends the data in msg with specified length, and closes the send side. If the length len is short enough to fit into a single packet, it will send one packet containing a SYN and FIN bit as well as the data. Unlike a sendto call on a UDP socket, there is no limit on the size of a message with TCP. If the flag parameter is zero rather than MSG_EOF, the call will do an implied connect and send the data but not close the connection; then further send[to] or write calls can be issued. Note: We did not implement a TCP version of recvfrom(); receipt of T/TCP transactions uses the standard TCP listen/accept mechanism. 4. Client, Server Application Code Here is a cryptic outline of basic client and server code using T/TCP. All code to handle errors or other exceptions is omitted. CLIENT: so = socket(AF_INET, SOCK_STREAM, 0); /* Send request message */ sendto(so, reqbuf, reqlen, MSG_EOF, &peeraddr_in, peeraddr_len); /* Read reply message into buffer (assumed here to be * large enough). Reply msg is delimited by EOF (FIN). */ read(so, rcv_buf, sizeof(rcvbuf)); /* Delete socket. */ close(so); SERVER: so = socket(AF_INET, SOCK_STREAM, 0); bind(so, local_addr, addr_len); setsockopt(so, IPPROTO_TCP, TCP_NOTPUSH, &One, sizeof(int)); listen(so, n); while (1) { new_so = accept(so, foreign_addr, &addr_len); /* Read request message into buffer (assumed here to be * large enough). Request msg is delimited by EOF (FIN). */ read(new_so, reqbuf, sizeof(reqbuf)); /* */ sendto(new_so, replybuf, replylen, MSG_EOF, 0, 0); close(new_so); } IMPLEMENTATION Here are some brief notes on the T/TCP kernel changes. 1. Support TCP options The original BSD code was not structured properly to allow TCP options. Correcting this required significant changes in a number of the TCP routines. These changes are delimited by #ifdef USE_OPTIONS, while the rest of the T/TCP changes are delmited by #ifdef XACT. (Defining XACT defines USE_OPTIONS for you). (Explanatory note: the original source from which this release is derived has *both* T/TCP and RFC-1323 changes, controlled by separate #ifdefs. We bound and removed the RFC-1323 #ifdefs for this T/TCP release, because the combination with RFC1323 changes gets quite confusing. As a result of the way this was done, USE_OPTIONS sometimes delimits code that is not strictly for supporting options, but rather for any extension that uses options.) 2. Large TCPCB The standard BSD TCP places a connection control block tcpcb in a 128-byte mbuf, and it JUST fits. To extend TCP, the tcpcb must be expanded, and this requires a new way to allocate a tcpcb. This version uses the dumbest and most inefficient way... it uses an entire 1K cluster mbuf for each tcpcb. The mbuf.h and uipc_mbuf.c files include a new mechanism (XCLGET and free_xclfun) that will suballocate a cluster mbuf to hold 5 tcpcb's. This code has worked, but to be conservative we do not use it in this version. 3. RTT Calculations It is important to maintain RTT estimates across multiple transactions (see RFC1644). The cache maintained by T/TCP therefore records the SRTT and RTT_VAR values for a particular remote host, and uses them to initialize the values for the next transaction to/from that host. However, this code forces a slowstart (initializes cwnd to 1*maxseg) if the destination is not on the local network. 4. Header Prediction This code adds the CC count to header prediction, assuming the same format in which it sends CC counts.