TCP IP Illustrated Volume 3

Embed Size (px)

Citation preview

TCP/IP illustrated, Volume-3 .forTransactions,.HTTP,NNTP,. UNIX6Domain Protocols ACK ANSI API ARP ARPANET A5Cll BPF BSD cc CEKI CR DF DNS EOL FAQ FIN FI'P GIF IITML HTfP ICMP IEEE INN lNND IP IPC IRTP lSN ISO ISS LAN LF MIME M5L MSS MTU ACRONYMS acknowledgment flagi TCP header American National Standards Institute application program interface Address Resolution Protocol Advanced Research Projects Agency network American Standard Code for Information Interchange BSD Packet Filter Berkeley Software Distnbution connection count; T /TCP Computer Emergency Response Team carriage return don't fragment flagi IP header Domain Name System end of option list frequently asked question finish flagi TCP header File Transfer Protocol graphics interchange format Hypertext Markup Language Hypertext Transfer Protocol Internet Control Message Protocol Institute of Electrical and Electronics Engineers InterNet News InterNet News Daemon Internet Protocol interprocess communication Internet Reliable Transaction Protocol initial sequence number International Organization for Standardization initial send sequence number local area network line feed multipurpose Internet mail extensions maximum segment lifetime maximum segment size maximum transmission unit NCSA NFS NNRP NNTP NOAO NOP OSF 051 PAWS PCB POSIX PPP PSH RDP RFC RPC RST RTO RIT SLIP SMTP SPT SVR4 SYN TAO TCP Tl'L Telnet UDP URG URI URL URN VMTP WAN www ACRONYMS National Center for Supercomputing Applications Network File System Network News Reading Protocol Network News Transfer Protocol National Optical Astronomy Observatories no operation Open Software Foundation open systems interconnection protection against wrapped sequence numbers protocol control block Portable Operating System Interface Point-to-Point Protocol push flag; TCP header Reliable Datagram Protocol Request for Comment remote procedure call reset flag; TCP header retransmission time out round-trip time Serial Line Internet Protocol Simple Mail Transfer Protocol server processing time System V Release 4 synchronize sequence numbers flag; TCP header TCP accelerated open Transmission Control Protocol time-to-live remote terminal protocol User Datagram Protocol urgent pointer flag; TCP header uniform resource identifier uniform resource locator uniform resource name Versatile Message Transaction Protocol wide area network World Wide Web ... Praise forTCP/IP Illustrated,Volume 3: TCP for Transactions,HTTP,NNTP, and theU N I X ~Domain Protocols "An absolutely wonderful example oflww to apply scientific thinking and analysis to atechnolosical problem. .. it is the highest caliber of technical writing and thinking. " - Marcus J.Ranum, Firewall Architect "Aworthy successor that continues the series' standards ofexcellence for both clarity and accuracy.The coverage ofT/I'CP and H1TP is particularly timely,given the explosion ofthe World WuleWeb. " - Vern Paxson, Network Research Group, Lawrence Berkeley National Laboratory "The coverage of the H1TP protocol will be invaluable to anyone who needs to understand the detailed behavior of web servers. " - Jeffrey Mogul, Digital Equipment Corporation "Volume 3 is a natural addition to the series.It covers the network aspects of Webservices and transaction TCP in depth. " - Pete Haverlock, Program Manager, IBM "Inthis latest volume ofTCPIIP illustrated, Rich Stevens maintains the high standards he set upin the previous volumes:clear presentation and tech-nical accuracy tothe finest detail." -Andras Olah, University of Twente "This volume maintains the superb quality of the earlier ones in the series, extending the in-depth examination of networking implementation in new directions.The entire series is a must for anybody who is seriously interested in how the Internet works today." - Ian Lance Taylor, Author of GNU/faylor UUCP Addison-Wesley ProfessionalComputing Series Brian W. Kernighan, Consulting Editor Matthew H. Austem, Generic Programming and tire STL: Using andExtending theC++ Standard TempliJteLibrary David RButenhof, Programming witll POSTxe Threads Brent Callaghan, NFS illustrated Tom Cargill, C++Programming Style William R.Cheswick/Steven MBellovin/ Aviel D.Rubin, Firewalls and Internet Security, Second Edition: Repelling the WilyHacker David A. Curry, UNIX- System Security: A Guide forUsers and System Administrators Stephen C. Dewhurst, C++ Gotchas: Avoiding CommonProblems in Coding andDesign Erich Gamma/Richard Helm/Ralph Johnson/John Vlissides, DesignPatterns: Elements of Reusable Object-Oriented Software Erich Gamma/Richard Helm/Ralph Johnson/John Vlissides, DesignPatterns CD:Elements of Reusable Object-Oriented Software Peter Haggar, Practical java"'Programming Language Guide David RHanson, C Interfaces and Implementations:Techniques for Creatittg Reusable Software ,.. Mark Harrison/Michael McLennan, Effective Tcl/Tk Programming:Writing Better Programs with Teland Tk Michi Henning/Steve Vmoski, Advanced CORBAProgramming with C++ Brian W.Kernighan/Rob Pike, The Practice of Programming S.Keshav, An Engineering Approach to Computer Networking: ATM Networks, the Internet, and the TelephoneNetwork John lakos, Large-Scale C++ Software Design Scott Meyers, Effective C++ CD:85 Specific Waysto Improve YourPrograms and Designs Scott Meyers,Effective C++, Second Edition: 50 SpecificWays to ImproveYourPrograms andDesigns Scott Meyers, MoreEffective C++: 35 New Ways to Improve Your Programs and Designs Scott Meyers, Effective STL: 50 Specific Waysto Improve YourUse of the Standard Template Library Robert B.Murray, C++ Strategies and Tactics DC\vidR.Musser/Gillmer J. Derge/Atul Saini, STL Tutorial and Reference Guide, Second Edition: C++ Programming with the Standard TemplateLibrary John K. Ousterhout, Tel attd the Tic Toolkit Craig Partridge, Gigabit Networking Radia Perlman, Interconnections, Second Edition: Bridges, Routers, Switches, and InternetworkingProtocols Stephen ARago, ~ S y s t e mV Network Programming Curt Schimmel, UNIJlaptop.8888 : ack402win8312 Figure 3.1T / TCP client reboots and sends a transaction to server. In line 1 we see fromthe CCnew option that the client'stcp_ccgenis 1.In line 2 theserverechoestheclient'sCCnewvalue,andtheserver'stcp_ccgen is18.The serverACIlaptop. 8888:SFP101619844:101620144 (300_) . 8712 laptop.8888>bsdi.l026:SFP140211128:140211528{400) ack101620146win8712 30.029330(0.0011)bsdi.1026>laptop.8888: ack402win8312 Figure 3.6Normal T /TCP transaction. The server is expecting aCC value greater than 2 fromthis client, so the received SYN with a CC of 3 passes the TAO test. 3.5ServerReboot Wenowrebootthe server andthen send atransactionfromthe client oncethe server has rebooted, and once the listening server process has been restarted.Figure 3.7 shows the exchange. 10.0 20.025420(0.0254) 30.025872(0.0005) 40.033731(0.0079) bsdi.1027>laptop.8888:SFP146513089:146513389(300) win8712 arpwho-hasbsditelllaptop arpreplybsdiis-at0:20:af:9c:ee:95 laptop.8888>bsdi.1027:s27338882:27338882(0) ack146513090win8712 50.034697(0.0010)bsdi.1027>laptop.8888:F301:301(0) ack1win8712 60.044284(0.0096)laptop.8888>bsdi.1027: ack302win8412 70.066749(0.0225)laptop.8888>bsdi.1027:FP1:401(400) ack302win8712 80.067613(0.0009)bsdi.1027>laptop.8888:. ack4028312 Figure 3.7T /TCP packet exchange after server reboots. Sincetheclientdoesnotknowthattheserverhasrebooted,it sendsanormal T / TCP request with its CC of 4 (line 1).The server sends an ARP request and the client responds with an ARPreply since the client's hardware address on the server was lost Scna3.6RequestorReplyExceedsMSS45 ., whenthe server rebooted.The server forcesanormal three-wayhandshaketooccur (line 4), because it doesn't know the value of the last CC received from this client. Similar to what we saw in Figure 3.1, the client completes the three-way handshake withanACKthatalsocontainsitsFIN-the 300bytesofdataarenotresent.The client's data is retransmitted only when the client's retransmission timer expires, which we'll see occur in Figure 3.11.Upon receiving thisthird segment,the server immedi-atelyACKsthe data andthe FIN.The server sends itsreply (line7),which the client acknowledges in line 8. Afterthe exchange in Figure 3.7 we expect to see another minimal T / TCP transac-tion between this client and server the next timethey communicate, which is what we show in Figure 3.8. 10.0bsdi.1028>laptop.8888:SFP152213061:152213361(300) win8712 20.034851(0.0349)laptop.8888>bsdi.1028:SFP32869470:32869870(400) ack152213363win8712 30.035955(0.0011)bsdi.1028>laptop.8888: ack402win8312 Figure 3.8Normal T / TCP client-server transaction. RequestorReplyExceedsMSS In all our examples so far,the client sends lessthan one MSSof data, and the server replieswithlessthan one MSSof data.If the client application sends more than one MSS of data, and the client TCP knows that the peer understands T /TCP, multiple seg-ments are sent.Sincethe peer's MSSis saved in the TAO cache (tao_mssopt in Fig-ure2.5)the client TCPknowsthe MSSof the server host but the client TCPdoes not know the receive window of the peer process.(Sections 18.4 and 20.4 of Volume 1 talk about the MSS and window size, respectively.)Unlike the MSS, which is normally con-stant for agiven peer host,the window can be changed by the application if it changes the size of its socket receive buffer.Furthermore, even if the peer advertises a large win-dow (say,32768), if the MSS is 512, there may well be intermediate routers that cannot handle an initial burst of 64 segments from the client to the server (i.e., slow start should not be skipped).T /TCP handles these problems with the following two restrictions: 1.T /TCPassumesaninitialsendwindowof4096bytes.InNet/3thisisthe snd_wnd variable,which controlshowmuch data TCP output can send.The initial value of 4096 will be changed when the first segment is received from the peer with a window advertisement. 2.T /TCP startsaconnection using slow start onlyif the peer is nonlocal.Slow startiswhenTCPsetsthevariablesnd_cwndtoonesegment.This local/nonlocaltestisgiveninFigure10.14andisbasedonthekernel's 46T/TCPExamplesChapter3 in_localaddr function.Apeer is considered localif (a)it sharesthe same network and subnet as the local host, or (b) it shares the same network but a dif-ferent subnet, but the kernel's subnetsarelocal variable is nonzero. Net/3 starts every connectionusing slow start (p. 902 of Volume 2)but this prevents a transaction client from sending multiple segments tostart atransaction.The compro-mise is to allow multiple segments, foratotal of up to4096bytes, but only for alocal peer. WheneverTCPoutputiscalled,itsendsuptotheminimumofsnd_wnaand snd_cwnd bytes of data.Theformerstarts at the maximum value of aTCP window advertisement,whichweassumetobe65535.(Itisactually65535 x 214,or almost1 gigabyte, when the window scale option is being used.)For a local peer snd_wnd starts at 4096and snd_cwnd starts at 65535.TCP output will initially send up to 4096bytes until a window advertisement is received.If the peer's window advertisement is 32768, then TCP can continue sending until the peer's window is filled (since the minimum of 32768 and 65535 is 32768).Slow start is avoided and the amount of data sent is limited by the advertised window. But if the peer is nonlocal, snd_wnd still starts at 4096 but now snd_cwnd starts at one segment (assumethe saved MSSforthis peer is SU).TCP will initially send just one segment, and when the peer's window advertisement is received,snd_cwnd will increase by one segment for each ACKSlow start is now in control and the amount of data sent is limited by the congestion window, until the congestion window exceeds the peer's advertised window. As an example we modified our T /TCP client and server from Chapter 1 to send a request of 3300 bytes and a reply of 3400 bytes.Figure 3.9 shows the packet exchange. 1 2 3 4 5 6 7 8 This example shows abug in Tcpdump's printing of relative sequence numbers formultiseg-mentT /TCPexchanges.Theacknowledgmentnumberprintedforsegments6,8,and10 should be 3302, not 1. 0.0bsdi.1057>laptop.8888:s3846892142:3846893590(1448) win8712 0.001556(0.0016)bsdi.1057>laptop.8888: 3846893591:3846895043(1452) win8712 0. 002672(0.0011)bsdi.1057>laptop.8888:FP3846895043:3846895443(400) win8712 0.138283(0.1356)laptop.8888>bsdi .1057:s3786170031:3786170031(0) ack3846895444win8712 0.139273(0.0010)bsdi.1057>laptop.8888: ack1win8712 0.179615(0.0403)1aptop.8888>bsdi.1057:.1:1453 (1452) ack1win8712 0.180558(0.0009)bsdi.1057>laptop.8888: ack1453win7260 0.209621(0.0291)1aptop.8888>bsdi.1057:.1453:2905 (1452) ack1win8712 RequestorReplyExceedsMSS47 90 .210565(0.0009)bsdi.l057>laptop. 8888: ack2905win7260 100 .223822(0 . 0133)laptop . 8888>bsdi . 1057:FP2905 : 3 401( 496) ack1win8712 ll0 .224719(0.0009)bsdi.1057>laptop. 8888 : ack3402win8216 Figure 3.9Client request of 3300 bytes and server reply of 3400 bytes. Sincethe client knows that the server supports T / TCP it can send up to 4096 bytes immediately.Segments 1, 2,and 3 are sent in the first 2.6 ms.The first segment carries lhe SYN flag,1448 bytes of data,and 12 bytes of TCP options (MSSand CC).The sec-ondsegment has no flags, 1452 bytes of data, and 8 bytes of TCP options.The third seg-ment carries the FIN and PSH flags, the remaining 400 bytes of data, and 8 bytes of TCP options.Segment 2 is unique in that none of the six TCP flags is set,not even the ACK flag.Normally the ACKflagis alwayson, except foraclient'sactive open,which car-ries the SYN flag.(A client can never send an ACK until it receives a segment from the server.) Segment 4isthe server's SYNand it also acknowledges everything the client sent: SYN, data, and FIN.The client immediately ACIsvr4.8888:F301:301(0) ack1win9216 40.012279(0.0052)svr4.8888>bsdi.1031: ack302win3796 50. 071683(0.0594)svr4.8888>bsdi.1031:p1:401(400) ack302win4096 60.072451(0.0008)bsd.i .1031>svr4.8888: ack401win8816 70.078373(0.0059)svr4.8888>bsdi.1031:F401:401(0) ack302win4096 80.079642(0.0013)bsdi.1031>svr4.8888: ack402win9216 Figure 3.10T / TCP client sends transaction to TCP server. The client TCP still sends afirst segment containing the SYN, FIN, and PSHflags, along with the 300 bytes of data.A CCnew option is sent since the client TCP does not 50T I TCPExamplesChapter 3 have acachedvalue forthisserver.The server responds with thenormal second seg-ment of the three-way handshake (line 2),which the clientACKsin line 3.Notice that the data is not retransrrUtted in line 3. When the server receivesthe segment in line 3 it immediately ACIsun. 8888 :SFP2693814107:2693814407(300) win8712 20.002808(0 . 0028)sun.8888>bsdi . 1033 :s317904 0768:317904 0768(0) ack2693814108win8760 (DF) 30.003679(0.0009)bsdi .1.033>sun.8888:F301:301(0) ack1win8760 41.287379(1. 2837)bsdi . 1033>sun. 8888 :FP1:301(300) ack1win8760 51.289048(0.0017)sun.8888>bsdi.1033: ack302win8760(DF) 61.291323(0.0023)sun . 8888>bsdi . 1033:p1 : 401(400) ack302win8760(OF) 71.292101(0.0008)bsdi . 1033>sun. 8888 : ack401win8360 81.292367(0.0003)sun.8888>bsdi.1033 :F401:401(0) a ck302win8760(OF) 91.293151(0.0008)bsdi.1033>sun.8888: ack:402win8360 Figure 3.11T / TCP client sending transactionto TCP server on Solaris 2.4. Lines 1,2,and 3are the sameasin Figure 3.10:aSYN,FIN,PSH,and the client's 300-byte request,followedbytheserver's SYN/ ACK,followedbytheclient' sACK . .. Summary51 This isthe normal three-wayhandshake.Also,the clientTCP sends aCCnew option, since it doesn't have a cached value for this server. The presence of the "don't fragment"flag(DF) on each segment fromthe Solaris host is path MTU discovery (RFC 1191 [Mogul and Deering 1990]). Unfortunately we now encounter abug in the Solarisimplementation.It appears the server's TCP discards the data that was sent on line 1 (the data is not acknowledged in segment 2), causing the client TCP to time out and retransmit the data on line 4.The FIN is also retransmitted.The server then ACIso_proto->pr_flags&PR_CONNREQUIRED)&& (so->so_proto->pr_flags&PR_IMPLOPCL) 0){ if((so->so_state&SS_ISCONFIRMING)==0&& ! (resid:=0&&clen!=0)) snderr(ENOTCONN); }elseif(addr==0) snderr(so->so_proto->pr_flags&PR_CONNREQUIRED? ENOTCONN:EDESTADDRREQ); 336} ----------------------------------uipc_socket.c Figure 5.1sosend function: error checking. Thenext changetosendto isshown in Figure 5.2andreplaces lines399-403 on p. 499 of Volume 2. -------------------------------uipc_socket.c 415s= splnet(); 416,. 417IftheuserspecifiesMSG_EOF.andtheprotocol 418understandsthisflag(e.g .T/ TCPI.andthere's 419nothinglefttosend,thenPRU_SEND_EOFinstead 420ofPRU_SEND.MSG_OOBtakespriority,however. 421. , 422req= (flags&MSG_OOB)?PRU_SENDOOB: 423((flags&MSG_EOF)&& 424(so->so_proto->pr_flags&PR_IMPLOPCLI&& 425(residso_proto->pr_usrreq)(so,req,top,addr,control); 427splx(s); --------------------------------uipc_socket.c ... Figure 5.2sosend function:protocol dispatch . This is our first encounter with the comment XXX.It is a waming tothe reader that the code is obscure, contains nonobvious side effects, or is a quick solution to a more difficultproblem.In this case,the processor priority is being raisedby splnet to prevent the protocol processing fromexecuting.TheprocessorpriorityisrestoredattheendofFigure5.2bysplx.Sec tion 1.12 of Volume 2 describes the various Net/3 interrupt levels. 416-427If the MSG_OOBflagis specified, the PRU_SENDOOB request is issued.Otherwise, if the MSG_EOFflagisspecified,andtheprotocol supportsthePR_IMPLOPCLflag,and thereis no more data tobe sent to theprotocol (resid is lessthan or equal to 0),then the PRU_SEND_EOF request is issued instead of the normal PRU_SEND request. 72T /TCPImplementation:SocketLayerChapter' Recall our example in Section 3.6.The application calls sendto to write 3300 byte-specifying the MSG_EOF flag.The fust time around the loop in sosend the code inF I ~ ure 5.2 issues aPRU_SEND request for the first 2048 bytes of data (an mbuf cluster).J1-., secondtime aroundtheloop insosend arequest of PRU_SEND_EOFis issued forttJo.; remaining 1252 bytes of data (in another mbuf cluster). 5.4Summary T /TCP adds an impliedopen-rt_refcnt==0){/ *thisisfirstreference* / 74if(rt->rt_prflags& RTPRF_OURS){ 75rt->rt_prflags&= 76 77 78 79 80) rt->rt_rmx.rmx_expire= 0; } ) return(rn); . -------------------------------m_mu.c Figure 6.6in_matroute function. Call rn_matc:h to look up route n-1srn_match looks up the route in the routing table.If the route is found and the ref erence count is 0,this is the first referencetotherouting table entry.If the entry was being timed out,that is,if the RTPRF_OURSflagis set,that flagisturned off and the rmx_expiretimerissetto0.Thisoccurswhen arouteisclosed,butthenreused before the route is deleted. 6.9in_clsrouteFunction Wementionedearlierthatanewfunctionpointer,rnh_close,isaddedtothe radix_node_headstructurewiththeT / TCPchanges.Thisfunctioniscalledby rtfree when the reference count reaches 0.This causes in_clsroute, shown in Fig ure 6.7,to be called. 5e o6.10 in_rtqtimoFunction79 . -----------------------------------------------------------------ln_nnx.c 89staticvoid 90in_clsroute(structradix_node*rn,structradix_node_head*head) 91( 92structrtentry*rt= (structrtentry*)rn; 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 if(I (rt->rt_flags& RTF_UP)) return; if((rt->rt_flags&(RTF_LLINFOIRTF_HOST))I=RTF_HOST) return; if( (rt->rt_prflags&(RTPRF_WASCLONEOIRTPRF_OORS)l !=RTPRF_WASCLONED) return; I * *Ifrtq_reallyoldis0,just deletetheroutewithout *waitingforatimeoutcycletokillit. .. , if(rtq_reallyold!=0)( rt->rt_prflagsI=RTPRF_OURS; rt->rt_rmx.rmx_expire- time.tv_sec+rtq_reallyold; }else( } rtrequest(RTM_OELETE, (structsockaddr* )rt_key(rt), rt->rt_gateway,rt_mask(rt), rt->rt_flags,0); } . -----------------------------------------------------------------1n_nnx.c Figure 6.7in_clsroute function. Check flags ~ J - 9 9The followingtests are made:the route must be up,the RTF_HOST flag must be on (i.e., this is not a network route), the RTF_LLINFO flag must be off (this flag is turned on forARPentries),RTPRF_WASCLONEDmustbeon(theentrywascloned),and RTPRF_OURSmust be off (we are not currentlytimingout thisentry).If any of these tests fail,the function returns. Set expiration time for routing table entry :ca-u2In the common case,rtq_reallyold is nonzero, causing the RTPRF_OURS flagto beturnedonandthermx_expiretimetobesettothecurrenttimeinseconds (time. tv_sec)plusrtq_reallyold(normally3600seconds,or1hour).Ifthe administrator setsrtq_reallyold to0using the sysctl program, then the route is immediately deleted by rtrequest. 6.10in_ rtqt imoFunction The in_rtqtimo function was calledforthe first time by in_ini thead in Figure 6.4. Each time in_rtqtimo is called, it schedules itself to be called again in rtq_timeout seconds in the future (whose default is 600 seconds or 10 minutes). 80T /TCP Implementation:RoutingTableChapter6 The purpose of in_rtqtimo is to walk the entire IP routing table (using the generic rn_walktree function), caJling in_rtqkill for every entry.in_rtqkill makes the decisionwhethertodeletetheentryornot.Informationneedstobepassedfrom in_rtqtimotoin_rtqkill(recallFigure6.1),andviceversa,andthisisdone through the third argument to rn_walktree.This argument is apointer that is passed by rn_walktree toin_rtqkill.Since the argument is a pointer, information can be passed in either direction between in_rtqtimo and in_rtqkill. The pointer passed by in_rtqtimo to rn_walktree is a pointer toanstructure, shown in Figure 6.8. . -------------------------------m_rmx.c 114structrtqk_arg{ 115structradix_node_head 116intfound; 117intkilled; 118 119 120 121}; intupdating; intdraining; time_tnextstop; *rnh;/*headofroutingtable* / / * foundthatwe'retimingout*/ / *#entriesdeletedbyin_rtqkill* / /*setwhendeletingexcessentries* / / *normally0*/ / *timewhentodoit allagain*/ . -------------------------------m_rmx.c Figure 6.8rtqk_arg structure: information from in_rtqtimo to in_rtqldll and vice versa. We'llsee howthese membersareused aswe lookatthein_rtqtimo function, shown in Figure 6.9. Set r tqk_arg structure and call rn_walkt r e e 167-172Thertqk_arg structure is initializedby setting rnh tothehead of theIProuting table,the countersfound and killed to 0,the draining and update flagsto0,and nextstop to the current time (in seconds) plus rtq_timeout (600 seconds, or 10 min-utes).rn_walktreewalkstheentireIProutingtable,callingin_rtqkill(Fig-ure 6.11)for every entry. Check for too many routing table entries 173-189There aretoo many routing tablesentries if thefollowingthree conditions are all true: 1.The number of entries still inthe routingtablethat we are timing out (found minus killed) exceeds rtq_toomany (which defaults to 128). 2.Thenumberofsecondssincewelastperformedthisadjustmentexceeds rtq_timeout (600 seconds, or 10 minutes). 3.rtq_reallyold exceeds rtq_minreallyold (which defaults to 10). If these are all true,rtq_reallyold is set to two-thirds of its current value (using inte-ger division).Since its value starts at 3600 seconds (60minutes), it takeson the values 3600,2400,1600,1066,710,and soon.Butthisvalueisneverallowedtogobelow rtq_minreallyold (which defaults to 10 seconds).The current time is saved in the static variable last_adjusted_timeout and a debug message is sent to the syslogd daemon.(Section 13.4.2 of [Stevens 1992] shows howthe log functionsends messages SEt n6.10in_rtqtimoFunction81 . --------------------------------------------------------------------------1n_rmx.c 159staticvoid 160in_rtqtimo(void*rock) 161{ 162structradix_node_headrob- rock; 163structrtqk_argarg; 164structtimeva1atv; 165statictime_tlast_adjusted_timeout- 0; 166ints; 167arg.rnh=rnh; 168arg.found=arg.killed=arg.updating=arg.draining- 0; 169arg.nextstop=time.tv_sec+rtq_timeout; 170s=splnet(); 171rnh->rnh_walktree(rnh,in_rtqkil1,&arg); 172splx(s); 173/ * 174*Attempttobesomewhatdynamicaboutthis: 175*Ifthereare'too manyroutessitting aroundtakingupspace. 176*thencrankdownthetimeout,andseeifwecan't makesomemore 177*goaway.However,wemakesurethatwewillneveradjustmore 178*thanonceinrtq_timeoutseconds,tokeepfromcrankingdowntoo 179hard. 180* / 181if{(arg.found- arg.killed>rtq_toomany)&& 182(time . tv_sec- last_adjusted_timeout>=rtq_timeout)&& 183rtq_reallyold>{ 184rtq_reallyold=2rtq_reallyoldI3; 185if(rtq_reallyo1drnh_walktree(rnh, splx(s): - 0 - ' in_rtqkill,&arg); atv.tv_usec= 0; atv.tv_secarg.nextstop; timeout(in_rtqtimo,rock,hzto(&atv)); . -------------------------------------------------------------------------- m_rmx.c Figun 6.9in_rtqtimo function. tothesyslogddaemon.)Thepurposeofthiscodeandthedecreasingvalueof rtq_reallyold is to processthe routing table more frequently,timing out old routes, as the routing table fills . :c-l9sThecountersfoundandkilled in thertqk_arg structureare initializedto0 .... ...... again, the updating flag is set to 1 this time, and rn_walktree is called again . 82 96-198 T/TCPImplementation:RoutingTableChapter 6 Thein_rtqkill function sets the nextstop member ofthertqk_arg structure tothe next time at whichin_rtqtimo should be called again.The kernel'stimeout function schedules this event in the future. How much overhead is involved in walking through the entire routing table every 10 minutes? Obviously this depends on the number of entries in the table.In Section 14.10 we simulate the size of the T /TCP routing table for abusy Web server andfindthat even thoughthe server is contacted by over 5000 different clients over a 24-hour period, with a1-hour expiration time on the host routes, the routing table never exceeds about 550 entries.Some backbone routers on theInternet have tens of thousandsof routing table entries today,butthese areroutei'S,not hosts.WewouldnotexpectabackboneroutertorequireT /TCPandthenhavetowalk through such a large routing table on a regular basis, purging old entries. 6.11in_rtqkillFunction .34-135 .36-146 :47-151 in_rtqkill is calledbyrn_walktree, whichiscalledbyin_rtqtimo.Thepur-pose of in_rtqkill, which we show in Figure 6.11, is to delete IP routing table entries when necessary. Only process entries that we are timing out Thisfunctiononly processes entries with the RTPRF_OURSflagset,that is,entries that have been closed by in_clsroute (i.e.,their reference counts have reached 0), and then only after atimeout period (normally 1 hour) has expired.This function does not affectroutesthatare currently inuse(sincetheroute'sRTPRF_OURSflagwillnotbe set). If either the draining flagis set (which it never is in the current implementation) orthetimeout hasexpired(thermx_expiretime islessthanthe currenttime),the route is deleted by rtrequest.The found member of the rtqk_arg structure counts thenumberofentriesintheroutingtablewiththeRTPRF_OURSflagset,andthe killed member counts the number of these that are deleted. Thiselse clauseisexecutedwhenthecurrententryhasnottimedout.If the updating flagisset (whichwesaw in Figure6.9occurswhentherearetoomany routes being expired and the entire routing table is processed asecond time), and if the expirationtime(whichmustbe in thefutureforthesubtractiontoyieldapositive result)istoofarinthefuture,theexpirationtimeisresettothecurrenttimeplus rtq_reallyold.To understand this, consider the example shown in Figure 6.10. expiration time initially set to 3600 seconds in future difference = 3100 expiration time reset to 2400 seconds in future 10060030003700 in_clsroutein_rtqtimo in_rtqkill Figure 6.10in_rtqkill resetting an expiration time in the future. . s:\gJ6.11 in_rtqldll Function83 -------------------------------------------------------------------tn_rmx.c 127staticint 128in_rtqkill(structradix_node*rn,void*rock) 129{ 130structrtqk_arg*ap= rock; 131structradix_node_head*rnh= ap->rnh; 132structrtentry*rt- (structrtentry*)rn; 133interr; 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 if(rt->rt_prflags&RTPRF_OURS)( ap->found++; if(ap->drainingI Irt->rt_rmx.rmx_expirert_refcnt>0) panic("rtqkillroutereallynotfree"); err- if (err) (structsockaddr*)rt_key(rt), rt->rt_gateway,rt_mask(rt), rt->rt_flags,0); log(LOG_WARNING,in_rtqkill:error%d\n",err); else ap->killed++; }else{ if(ap->updating&& 149 150 (rt->rt_rmx.rmx_expire- time.tv_sec>rtq_reallyold)){ rt->rt_rmx.rmx_expire= time.tv_sec+rtq_reallyold; 151} 152ap->nextstop= lmin(ap->nextstop,rt->rt_rmx.rmx_expire); 153} 154 } 155return(0); 156} . ------------------------------------------------------------------ m_rmx.c Figure 6.11in_rtqkill function. The x-axis is time in seconds.Aroute is closed by in_clsroute at time 100 (when its referencecountreaches0)andrtq_reallyold has its initialvalue of 3600(1hour). The expiration time forthe route isthen 3700.But at time 600,in_rtqtimo executes and the route is not deleted (since its expiration time is 3100 seconds in the future),but there are too many entries, causing in_rtqtimo to reset rtq_reallyold to2400, set updating to1,andrn_walktreeprocessesthe entireIP routingtableagain.This timein_rtqkill findsupdating set to 1 andthe route will expire in 3100 seconds. Since 3100 is greater than 2400, the expiration time is reset to 2400 seconds in the future, namely, time 3000.As the routing table grows, the expiration times get shorter. Calculate next timeout time :52-153This code is executed every time an entry is found that is being expired but whose expiration time hasnot yet been reached.nextstop is set to the minimum of its cur-rentvalueandtheexpirationtimeofthisroutingtableentry.Recallthattheinitial 84T /TCPImplementation:RoutingTableChapter 6 valueof nextstop wasset by in_rtqtimo tothe currenttimeplusrtq_timeou-: (i.e., 10 minutes in the future). Consider the example shown in Figure 6.12.The x-axis is time in seconds and the large dots at times 0, 600, etc., are the times at which in_rtqtimo is called. routes in_clsroute expue tttt 100300 + II+++ ++ 3700 3900 + ++ ' 0600120018002400300036004500 F ~6.12Execution of in_rtqtimo based on expiration of routes. An IP route is created by in_addroute and then closed by in_clsroute at time 100. Its expiration time isset to 3700(1hour in the future).A second route is created and later closed at time 300, causing its expiration time to be set to 3900.in_rtqt:imo exe-cutesevery10minutes,at times0,600,1200,1800,2400,3000,and3600.Attimes0 through3000nextstop issettothecurrenttimeplus600sowhenin_rtqkill is called for each of thetwo routes at time 3000, nextstop isleft at 3600because 3600 is less than 3700 and less than 3900.But when in_rtqkill is called for each of these two routes at time 3600, nextstop becomes 3700 since 3700 is lessthan 3900 and less than 4200.Thismeansin_rtqtimo willbecalledagainattime3700,insteadof at time 4200.Furthermore, when in_rtqkill is called at time 3700, the other route that is due to expire at time 3900 causes nextstop to be set to 3900.Assuming there are no other IP routes expiring, after in_rtqtimo executes at time 3900, it will execute again at time 4500, 5100, and so on. Interactions with Expiration Time There are afew subtle interactions involving the expiration time of routing table entries and the rmx_expire member of thert_rnetrics structure.First, this member is also used by ARPto time out ARP entries (Chapter 21of Volume 2).This means the routing table entry for a host on the local subnet (along with its associated TAO information) is deleted when the ARP entry for that host is deleted, normally every 20 minutes.This is soonerthanthedefaultexpirationtimeusedbyin_rtqkill(1hour).Recallthat in_clsrouteexplicitlyignoredtheseARPentries(Figure6.7),whichhavethe RTM_LLINFO flag set, allowing ARP to time them out, instead of in_rtqkill. Second, executing the route program to fetch and print the metrics and expiration timeforacloned T / TCP routing tableentry has the side eHectof resetting the expira-tiontime.This happens as follows.Assume aroute is in use and then closed (its refer-ence count becomes0).Whenitisclosed,itsexpirationtimeis setto1hour in the future.But 59 minutes later, 1 minute before it would have expired, the route program is used to print themetricsforthe route.The following kernel functionsexecute asa resultoftherouteprogram:route_outputcallsrtallocl,whichcalls in_matroute (the Internet-specificrnh_matchaddr function),which increments the reference count, say,from 0 to1.When this is complete, assuming the reference count 6.USummary85 goesfrom1to0,rtfree callsin_clsroute, whichresetsthe expiration timeto1 hour in the future. Summary With T /TCP we add 16 bytes to the rt_metrics structure.Ten of these bytes are used by T /TCP as the TAO cache: tao_cc, the latest CC received in a valid SYN from this peer, tao_ccsent, the latest CC sent to the peer, and tao_mssopt, the latest MSS received from the peer. Onenewfunctionpointerisaddedtotheradix_node_headstructure:the rnh_close member,which (if defined)is called when the reference count foraroute reaches 0. Four new functions are provided that are specific to the Internet protocols: 1.in_inithead initializesthe Internet radix_node_head structure, setting the four function pointers that we're currently describing. 2.in_addroute is called whenanew route is added to the IP routing table.It turns on the cloning flag forevery IP route that is not ahost route and is not a route to a multicast address. 3.in_matroute is called each time an IP route is looked up.If the route was cur-rently being timed out by in_clsroute, its expiration time is reset to 0 since it is being used again. 4.in_clsroute is called when the last reference to an IP route is closed.It sets the expiration time for the route to be 1 hour in the future.We also saw that this time can be decreased ii the routing table gets too large. 7.1 7 T/TCPImplementation: ProtocolControlBlocks Introduction One small change is required to the PCB functions (Chapter 22 of Volume 2) for T /TCP. Thefunctionin_pcbconnect(Section22.8ofVolume2)isnowdividedintotwo pieces:aninnerroutinenamedin_pcbladdr,whichassignsthelocalinterface address,andthefunctionin_pcbconnect,whichperformsthesamefunctionas before (and which calls in_pcbladdr). Wesplitthefunctionalitybecause it ispossiblewith T /TCPtoissueaconnect when a previous incarnation of the same connection (i.e., the same socket pair) is still in theTIME_WAITstate.If thedurationofthepreviousconnectionwaslessthanthe MSL,andif bothsidesusedtheCCoptions,thentheexistingconnectioninthe TIME_WAIT state is closed, and the new connection is allowed to proceed.If we didn't make thischange, and T /TCP usedtheunmodifiedin_pcbconnect, the application wouldreceivean"addressalreadyinuse"errorwhentheexistingPCBinthe TIME_WAIT state was encountered. in_pcbconnect is callednot only fora TCP connect, but also when a new TCP connection request arrives,foraUDPconnect, andforaUDPsendto.Figure7.1 summarizes these Net/3 calls, before our modifications. UOPTCP input PRtJ_CONNECT tcp_usrreq Figure 7.1Summary of Net/3 calls to in_pcbconnect. 87 88T /TCPImplementation:ProtocolControlBlocksChapter7 The calls by TCP input and UDP toin_pcbconnect remain the same, but the pro-cessingof aTCPconnect(thePRU_CONNECTrequest)nowcallsthenewfunction tcp_connect(Figures12.2and12.3),whichintumcallsthenewfunction in_pcbladdr.Additionally, when a T /TCP client implicitly opens a connection using sendto orsend.msg,theresultingPRU_SENDorPRU_SEND_EOFrequestalsocalls tcp_connect.We show this new arrangement in Figure 7.2. UDPTCPinput in_pclx:onnec t in__pcbladdr tcp_usrreq PRU_CONNECT PRU_SEND PRU_SEND_EOF tcp_connect Figure 7.2New arrangement of in__pclx:onnect and in__pcbladdr. 7.2in_pcbladdrFunction Thefirstpart ofin_pcbladdr isshowninFigure7.3.Thisportionshowsonlythe arguments and the first two lines of code, which are identical to lines 138-139 on p. 736 ofVolume2. ---------------------------------in_pcb.c 136int 137in_pcbladdr(inp,nam,plocal_sin) 138structinpcb*inp; 139structmbuf*nam; 140structsockaddr_in**plocal_sin; 141{ 142structin_ifaddria; 143structsockaddr_in*ifaddr; 144structsockaddr_in*sin =mtod(nam,structsockaddr_in*); 145 146 if(nam->m_len!=sizeof(*sin)) return(EINVAL); ---------------------------------in_pcb.c Figure 7.3in_pcbladdr function: first part. 136-140The firsttwoarguments arethe same as thoseforin_pcbconnect andthethird argument is a pointer to a pointer through which the local address is retumed. The remainder of this functionis identicaltoFigures 22.25,22.26,and most of Fig-ure 22.27 of Volume 2.The finaltwolinesin Figure 22.27,lines225-226 on p. 739,are replaced with the code shown in Figure 7.4. 5e;: t 7.3 in_pcbconnectFunction89 --- ~ 3 6.....;_-.... 7.3 ------------------------------inJJcb.c 232/ * 233Don'tcallin_pcblookuphere;returninterfacein 234wplocal_sinandexittocaller,whowilldothelookup. 235 236*plocal_sin= &ia->ia_addr; 237 238 239} } return(0); ------------------------------inJJCb.c Figure 7.4in_pcbladdr function; final part. If the caller specifies awildcard local address, a pointer to the sockaddr_in struc-ture is retwned through the third argument. Basically allthat is done by in_pcbladdr is some error checking, special case han-dling of a destination address of 0.0.0.0 or 255.255.255.255, followed by an assignment of the local IP address (if the caller hasn't assigned one yet).The remainder of the process-ing required by a connect is handled by in_pcbconnect. in_pcbconnectFunction Figure7.5showsthein_pcbconnectfunction.Thisfunctionperformsacallto i n_pcbladdr, shown in the previous section, followed by thecode from Figure 22.28, p. 739 of Volume 2. Assignlocal address : : ~ 2 5 9ThelocalIPaddressiscalculatedbyin_pcbladdr, andreturnedthroughthe ifaddr pointer, if the caller hasn't bound one to the socket yet. Verity socket pair Is unique :oJ.-266in_pcblookup verifies that the socket pair is unique.1nthe normal case of a TCP client calling connect (when the client has not bound alocalport or alocal address to the socket),the local port is 0, so in_pcblookup always returns 0, since alocal port of 0 will not match any existing PCB. Bind local address and localport, If not already bound ::--271If a local address and alocal port have not been bound to the socket, in_pcbbind assignsboth.If alocal address has not been bound, but thelocalport is nonzero,the local address returned by in_pcbladdr is stored in the PCB.It is not possible to bind a local address but still have alocal port of 0, since the call toin_pcbbind to bind the local address also causes an ephemeral port to be assigned to the socket. :-J-273The foreign address and foreign port (arguments to in_pcbconnect) are stored in the PCB. 90T / TCPImplementation:ProtocolControlBlocksChapter-----------------------------------in__pcb.< 247int 248in_pcbconnect (inp,naml 249structinpcb*inp; 250structmbuf*nam; 251{ 252structsockaddr_in*ifaddr; 253structsockaddr_in*sin=mtod(nam,structsockaddr_in ); 254interror; 255 256 257 258 259 260 261 262 I * *Callinnerfunctiontoassignlocalinterfaceaddress. *I if(error=in_pcbladdr(inp,nam,&ifaddr)) return(error l; if(in_pcblookup(inp->inp_head, sin->sin_addr, sin->sin_port, -263 264 inp- >inp_laddr.s_addr?inp->inp_laddr:ifaddr->sin_addr , inp->inp_lport, 265 266 267 268 269 270 271 272 273 274 275 0 ll return(EADDRINUSE); if(inp- >inp_laddr.s_addr==INADDR_ANY){ if(inp->inp_lport==0) } (void)in_pcbbind(inp,(structmbuf*)0); inp->inp_laddr=ifaddr->sin_addr; inp->inp_faddr- sin->sin_addr; inp- >inp_fport- sin- >sin_port; return(0); } ---- ------------------------------in_pcb.c Figure 7.5in_pcbconnect function. 7.4Summary The T/ TCP modifications remove allthe code fromthe in_pcbconnect function that calculatesthe local address andcreatesanewfunctionnamedin_pcbl addrtoper-formthistask.in_pcbconnect callsthisfunction,andthencompletesthenormal connection processing.This allowsthe processing of a T / TCP client connection request (either explicit using connect or implicitusing sendto) to calli n_pcbl addrtocal-culatethelocaladdress.The T / TCPclientprocessingthenduplicatesthe processing stepsshownin Figure 7.5,but T / TCP allowsaconnectionrequestto proceedeven if there existsapreviousincarnation ofthesame connectionthatisinthe TIME_WAIT state.NormalTCPwouldnot allowthisto happen;insteadin_pcbconnect would return EADDRI NUSE from Figure 7.5. 8 T/TCPImplementation: TCPOverview 8.1Introduction This chapter covers the global changesthat are madetothe TCP datastructures and functionsforT / TCP.Twoglobalvariablesareadded:tcp_ccgen,theglobalCC counter, and tcp_do_rf c1644, aflagthat specifies whether the CC options shouldbe used.Theprotocol switch entryforTCP is modified to allowan impliedopen-close and four new variables are added to the TCP control block. Asimplechange is madetothetcp_slowtimo functionthatmeasuresthe dura-tion of every connection.Given the duration of aconnection, T / TCP will truncatethe TIME_WAIT state if the duration is less than MSL, as we described in Section 4.4. 8.2CodeIntroduction There are no new source i.les added with T / TCP, but some new variables are required. Global Variables Figure8.1showsthenewglobalvariablesthatareaddedwithT / TCP,whichwe encounter throughout the TCP functions. VariableDatatypeDescription tcp_ccgentcp_ccnext CC value to send tcpdo_rfc1644intif true (default), send CC or CCnew options Figure 8.1Global variables added with T / TCP. 91 92T /TCPImplementation:TCPOverviewWeshowedsomeexamplesofthetcp_ccgenvariableinChapter3.Wemen-tioned in Section 6.5that thetcp_cc data type istypedefed to be an unsigned long Avalueof0foratcp_ccvariablemeansundefined.Thetcp_ccgen variableis always accessed as tp->cc_send= CC_INC(tcp_ccgen); where cc_send is anew member of the TCP control block (shown later in Figure 8.3) The macro CC_INC is defined in as .. idefioeCC_INC(c)(++(c)==0?++(c):(c)) Sincethevalue isincremented beforeit isused,tcp_ccgen is initializedto0and its first value is 1. Four macrosare defined tocompare CCvalues using modular arithmetic:CC_LT, CC_LEQ,CC_GT,andCC_GEQ.ThesefourmacrosareidenticaltothefourSEQ_x.r macros defined on p. 810 of Volume 2. The variable tcp_do_rfc1644 is similar tothe variabletcp_do_rfc1323 intro-ducedin Volume2.If tcp_do_rfc1644is0,TCPwillnotsendaCCor aCCnew option to the other end. Statistics Fivenewcountersareaddedwith T /TCP,whichweshowinFigure8.2.Theseare added to the tcpstat structure, which is shown on p. 798 of Volume 2. tcpstat memberDescription tcps_taook#received SYNs with TAO OK tcps_taofail#received SYNs with CC option but fail TAO test tcps_badccecho#SYN I ACK segments with incorrect CCecho option tcps_impliedack#new SYNs that imply ACK of previous incarnation tcps_ccdrop#segments dropped because of invalid CC option Figure 8.2Additional T /TCP statistics maintained in the tcps tat structure. The nets tat program must be modified to print the values of these new members. 8.3TCPprotoswStructure Wementioned in Chapter 5that thepr_flags memberof the TCP protosw entry, inetsw [ 2](p.801ofVolume2)changeswithT /TCP.Thenewsocket-layerflag PR_IMPLOPCL must be included, along with the existing flagsPR_CONNREQUIRED and PR_WANTRCVD.In sosend, this new flag allows asendto on anunconnected socket if the caller specifies a destination address, and it causes the PRU_SEND_EOF request to be issued instead of the PRU_SEND request when the MSG_EOF flag is specified. A related change to the protosw entry, which is not required by T /TCP, is to define the function namedtcp_sysctl as the pr_sysctl member.This allows the system ::ciCCM\ 8.4TCPControlBlock93 adm.inistrator to modify the values of some of the variables that control the operation of TCP by using the sysctl program with the prefix net. inet. tcp.(The Net/3 code in Volume2onlyprovidedsysctlcontroloversomeIP,ICMP,andUDPvariables, throughthefunctionsip_sysctl,icmp_sysctl, andudp_sysctl.)Weshowthe tcp_sysctl function in Figure 12.6. 4TCPControlBlock Four new variables are added tothe TCP control block,thetcpcb structure shown on pp. 804-805 of Volume 2.Ratherthan showing the entire structure, we show only the new members in Figure 8.3. VariableDatatypeDescription t_durationu_longconnection duration in 500-ms ticks t_maxopdu_sbortMSS plus length of normal options cc_sendtcp_ccCC value to send to peer cc_recvtcp_ccCC value received from peer Figure 8.3New members of tcpcb structure added with T /TCP. t_dura tion is needed to determine il T /TCP cantruncatethe TIME_WAIT state, as we discussed in Section 4.4.Its value starts at 0 when a control block is created and is then incremented every 500 ms by tcp_slowtimo (Section 8.6). t_maxopd is maintained for convenience in the code.It is the value of the existing t_maxsegmember,plusthenumberofbytesnormallyoccupiedbyTCPoptions. t_maxseg isthenumber of bytes of dataper segment.For example, on an Ethernet with an MTUof 1500 bytes, if both timestamps and T /TCP are in use,t_maxopd will be 1460 and t_maxseg will be 1440.The difference of 20 bytes accounts for 12 bytes for thetimestampoptionplus8bytesfortheCCoption(Figure2.4).t_maxopdand t_maxseg are both calculated and stored in the tcp_mssrcvd function. The last two variables aretaken fromRFC 1644 and examples were shown of these three variables in Chapter 2.If the CCoptions were used by both hosts for aconnec-tion, cc_recv will be nonzero. Six new flags are defined forthet_flags member of the TCP control block.These areshownin Figure8.4andare in additiontothenineflagsshown inFigure24.14, p. 805 of Volume 2. t_flagsDescription TF_SENDSYNsend SYN (hidden state flag for half-synchronized connection) TF_SENDFINsend FIN (hidden state flag) TF_SENDCCNEWsend CCnew option instead of CC option for active open TFNOPUSHdo not send segment just to empty send buffer TF_RCVD_CCset when other side sends CC option in SYN TF_REQ_CChave/will request CC option in SYN Figure 8.4New t_f lags values with T /TCP. 94T /TCPImplementation:TCPOverviewChapter . Don't confusethe T /TCP flagTF_SENDFIN,which means TCP needstosendaFe\ with the existing flag TF_SENTFIN, which means a FIN has been sent. The names TF_SENDSYN and TF_SENDFIN are taken fromBobBraden's T/ TCP impleme:rta tion.TheFreeBSDimplementationchangedthesetwonamestoTF_NEEDSYNan.. TF_NEEDFIN.Wechose the formernames, since the newflagsspecifythat the control flap must be smt, whereas the latter have the incorrect implication of needing a SYN or a FIN t o ~ received.Be careful, however,because with the chosen names there is only one character dr-ference between the T / TCP TF_SENDFIN flag and the existing TF_SENTFIN flag (which indr cates that TCP has already sent aFIN). Wedescribe the TF_NOPUSH and TF_SENDCCNEW flagsin the next chapter, F i ~ 9.3 and 9.7 respectively. 8.5tcp_ initFunction No explicit initialization is requiredof any T /TCP variables,andthetcp_ini tfunc-tion in Volume 2 is unchanged.The globaltcp_ccgen is an uninitialized external tha defaults to0 by the rules of C.This is OK because the cc_INC macrodefined in Sec-tion 8.2 increments the variable before using it, so the first value of tcp_ccgen after a reboot will be 1. T /TCP also requires that the TAO cache be cleared on a reboot, but that is handled implicitlybecausetheIProutingtableisinitializedonareboot.Eachtimeanew rtentry structure is addedtotherouting table,rtrequest initializes the structure to 0(p.610of Volume 2).This means the three TAO variables in the rmxp_tao struc-ture(Figure 6.3)default to 0.An initialvalue of 0fortao_cc is requiredby T /TCP when a new TAO entry is created for a new host. 8.6tcp_ slowtimoFunction A one-line addition is made to one of thetwo TCP timing functions: for each TCP con-trol blockthet_duration member is incremented each time the 500-mstimeris pro-cessed, the tcp_slowtimo function shown on p. 823 of Volume 2.The following line tp->t_duration++; is added between lines 94 and 95 of this figure.The purpose of this variable is to mea-surethe length of each connectionin 500-msticks.If the connectionduration isless than the MSL, the TIME_WAIT state can be truncated, as discussed in Section 4.4. Relatedtothisoptimizationistheadditionofthefollowingconstanttothe header: fdefineTCPTV_TWTRUNC8/ *RTOfactortotruncateTIME_WAIT*/ We'llsee in Figures 11.17 and 11.19that when aT /TCP connection is activelyclosed, andthevalue of t_duration is lessthan TCPTV_MSL(sixty500-msticks,or 30 sec-onds), thenthe duration of the TIME_WAIT state is the current retransmission timeout Summary95 RTO)times TCPTV_TWTRUNC.On aLAN,where the RI'O is normally 3 clockticks or 1.5 seconds, this decreases the TIME_WAIT state to 12 seconds. Summary TTCP adds two new global variables (tcp_ccgen andtcp_do_rfc1644), four new members to the TCP control block, and five new counters to the TCP statistics structure. Thetcp_slowtimo functionisalsochangedtocounttheduration of each TCP connectionin 500-ms clockticks.This durationdetermines whether T /TCP cantrun-cate the TIME_WAIT state if the connection is actively closed. ., , 9 T/TCPImplementation: TCPOutput Introduction Thischapterdescribesthechangesmadetothetcp_outputfunctiontosupport T / TCP.This function is called fromnwnerous places within TCP to determine if aseg-mentshouldbesent,andthentosendoneif necessary.Thefollowingchangesare made with T / TCP: The two hidden state flags can tum on the TH_SYN and TH_FIN flags. T / TCP allowsmultiple segmentsto be sent in the SYN_SENT state, but only if we know that the peer understands T /TCP. SendersillywindowavoidancemusttakeintoaccountthenewTF_NOPUSH flag, which we described in Section 3.6. The newT/ TCP options (CC, CCnew, and CCecho) can be sent. $2t cp_ outputFunction Automatic Variables Two new automatic variables are declared within tcp_output: structrmxp_taotaop; structrmxp_taotao_poncached; 97 98T /TCPImplementation:TCPOutputChapter -The first is apointer to the TAO cache forthe peer.Uno TAO cache entry exists (whic shouldn't happen),taop points totao_noncached and thislatter structure is initia ized to 0 (therefore itstao_cc value is undefined). Add Hidden State Flags At the beginning of tcp_output the TCP flags corresponding to the current connectior stateare fetchedfromthetcp_outflags array.Figure 2.7showsthe flagsforeact state.The code shown in Figure 9.1logically ORs in the TH_FIN flagand the flag, if the corresponding hidden state flag is on.

71again: 72sendalot=0; 73off=tp->snQ_nxt- tp->snd_una; 74win=min(tp->snd_wnd,tp->snd_cwnd); 75flags=tcp_outflags[tp->t_state); 76/ * 77*Modifystandardflags,addingSYNorFINifrequestedbythe 78*hiddenstateflags. 79 80 81 82 83 * I if(tp->t_flags& TF_SENDFIN) flagsI=TH_FIN; if(tp->t_flags&TF_SENDSYN) flagsI =TH_SYN; ________________.: _____________________________________________outpul , Figure 9.1tcp_output: add hidden state flags. This code is located on pp. 853-854 of Volume 2. Don't Resend SYN in SYN_SENT State Figure 9.2 fetchesthe TAO cache for this peer and a check is made to determine whether a SYN has already been sent.This code is located at the beginning of Figure 26.3, p. 85::; of Volume 2. Fetch TAO cache entry 117-119The TAO cache forthe peer is fetched, and if one doesn't exist,the automatic vari-able tao_noncached is used, and is initialized to 0. If thisall-zero entryisused,it isnevermodified.Thereforethetao_noncacbed structure could be statically allocated and initialized to 0, instead of being set to 0 by bzero. Check if client request exceeds MSS 121-1JJU the state indicates that aSYNis to be sent,and if aSYN has already been sent, then theTH_SYNflagisturned off.Thiscan occur whenthe application sendsmore thanoneMSSofdatatoapeerusing T /TCP(Section3.6).If thepeerunderstands T /TCP, then multiple segments can be sent, but only the first one should have the SYK flag set.If we don't know that the peer understands T /TCP (tao_ccsent is 0)then we do not send multiple segments of data until the three-way handshake is complete. tcp_outputFunction99 -, - . - -----------------------------tcp_output.c 116len=min(so->so_snd.sb_cc,win)- off; 117 118 119 if((taop=tcp_gettaocache(tp->t_inpcb)) taop=&tao_noncached; bzero(taop,sizeof(*taop)); 120) 121I* NULL){ TurnoffSYNbitifit hasalreadybeensent. !23Also,ifthesegmentcontainsdata,andifintheSYN-SENTstate, !24andif wedon'tknowthatforeignhostsupportsTAO,suppress 125sendingsegment . .:26 I 127if((flags&TH_SYN)&&SEQ_GT(tp->snQ_nxt,tp->snd_una)){ 128flags&= 129off--,len++; 130if(len>0&&tp->t_state==TCPS_SYN_SENT&& 131taop->tao_ccsent==0) 132return(0); .133) !.34if (lent_flags&(TF_REQ_TSTMPITF_NOOPT}) (flags& TH_RST)==0&& ==TF_REQ_TSTMP&& ((flags& TH_ACK)==0I I 287 288 289 290 291(tp->t_flags& TF_RCVD_TSTMP))){ ------------------------------tcp_output.c Figure 9.5tcp_output: send atimestamp option? : : c - ~ :With T /TCPthefirsthalf of thethirdtestchangesbecause wewantto sendthe timestamp option on all initial segments from the client tothe server (inthecase of a multisegment request,as shown in Figure 3.9),not just the first segmentthat contains the SYN.The new test for all these initial segments is the absence of the ACK flag. Slrld TfrCP CC options The first test for sending one of the three new CC options is that the TF_REQ_CC flag is on (which is enabled bytcp_newtcpcb if the globaltcp_do_rfc1644 is nonzero), andthe TF_NOOPTflagis off,andtheRSTflagisnoton.WhichCCoptiontosend depends on the status of the SYN flag and the ACK flagin the outgoingsegment.This givesfour potential combinations, the first two of which are shown in Figure 9.6.(This code goes between lines 268-269 on p. 873 of Volume 2.) TheTF_NOOPTflag iscontrolledby the newTCP_NOOPTsocket option.This socket option appeared in Thomas Skibo's RFC 1323 code (Section 12.7).As noted in Volume 2, this flag (but not the socket option) has been in the Berkeley code since 4.2BSD, but there has normally been 102T /TCPImplementation:TCPOutput Chapter9 no waytotum it on.lf the optionis set, TCPdoes not sendany options with its SYN.The optionwasaddedtocopewithnonconformingTCPimplementationsthatdonotignore unknown TCP options (since the RFC 1323 changes added two new TCP options). The T/TCP changes do not change thecodethat detenmnes whether the MSS option should be sent (p.872 of Volume 2).This code does not send anMSS option if the TF_NOOPT flagis set.But Bob Braden notes in his RFC 1323 code that there ISreaUy no reason to suppress send ing an MSS option.The MSS option was part of the original RFC 793 specification. -2-99---,-.-------------------------tcp_lftltput.c 300SendCC-familyoptionsif oursidewantstousethem(TF _REQ_CC), notaRST. 301optionsareallowed(!TF_NOOPT)andit's 302. , 303if((tp->t_flags&(TF_REQ_CCITF_NOOPT) )==TF _REQ_CC&& 304(flags& TH_RST)==0){ 305switch(flags&(TH_SYNITH_ACK)){ 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 ; ThisisanormalACK(noSYN); sendccif wereceivedCCfromourpeer. * I caseT H ~ A C K : if(!(tp->t_flags&TF_RCVD_CC)) break; 1FALLTHROUGH 1 f WecanonlygethereinT/ 'l'CP' aSYN_SENT*state,when we'resendinganon-SYNsegmentwithoutwaitingfor theACKofourSYN.Acheckearlierinthisfunction assuresthatweonlydothisifourpeerunderstandsT/ TCP. ., case0: opt[optlen++)=TCPOPT_NOP; opt[optlen++)= TCPOPT_NOP; opt[optlen++]- TCPOPT_CC; opt[optlen++]= TCPOLEN_CC; (u_int32_t)& opt[optlen)= htonl(tp->cc_send); optlen+=4; break; ------------------------------tcp_output.c Figure 9.6tcp_output: send one of the CC options, first part. SYN off, ACK on JlD-JlJU the SYN flag is off but the ACIcc_send); optlen+=4; break; ,.. ThisisaSYN,ACK(serverresponsetoclientactiveopen). *SendCCandCCechoifwereceivedCCorCCnewfrompeer. ., case(TH_SYNITH_ACK): if(tp->t_flags& TF_RCVD_CC)( opt[optlen++]- TCPOPT_NOP; opt [ optlen++]- TCPOPT_NOP; opt(optlen++)- TCPOPT_CC; opt(optlen++]- TCPOLEN_CC; } * (u_int32_t*)&opt(optlen]=htonl(tp->cc_send); optlen+=4; opt[optlen++J-opt(optlen++J-opt(optlen++J-opt[optlen++J-(u_int32_t ) optlen+=4; TCPOPT_NOP; TCPOPT_NOP; TCPOPT_CCECHO; TCPOLEN_CC; & opt(opt len]= htonl(tp->cc_recv); break; 361} 362} 363hdrlen+=optlen; ________.:._ __;__ ___________________ tcp_output.c Figure 9.7tcp_output: sendone of the CC options, second part. SYN on, ACK off (client active open) _ =-340The SYN flag is on and the ACK flag is off when the client performs an active open. The code in Figure 12.3 setsthe flagTF_SENDCCNEW if aCCnew option should be sent instead of a CC option, and also sets the value of cc_send. 104T /TCPImplementation:TCPOutputChapter9 SYN on, ACK on (server response to client SYN) 341-360If both the SYN flag and the ACK flag are on, this is a server's response to the peer's active open.If the peer sent either aCC or aCCnew option (TF_RCVD_cc is set), then we send both a CC option (cc_send) and a CCecho of the peer's CC value (cc_recv). Adjust TCPheader length for TCP options 363The length of the TCP header is increased by all the TCP options (if any). Adjust Data Length for TCP Options ... t_maxopd is a new member of thetcpcb structure and is the maximum length of data and options in anormal TCP segment.It is possible forthe options in aSYN segment (Figures 2.2 and 2.3)to require more room than the options in a non-SYN segment (Fig-ure 2.4), since both the window scale option and the CCecho option appear only in SYN segments.The code in Figure 9.8 adjusts the amount of data to send, based on the size of the TCP options.This code replaces lines 270-277 on p. 873 of Volume 2. ------------------------------tcp_output.c 364/ * 365*Adjustdatalengthifinsertionofoptionswill 366bumpthepacketlengthbeyondthelength. 367*CleartheFINbitbecausewecutoffthetailof 368thesegment. 369* / 370if(len+optlen> { 371/ * 372*Ifthereisstill moretosend,don'tclosetheconnection. 373*I 374flags&=-TH_FIN; 375len=- optlen; 376sendalot= 1; 377} ------------------------------tcp_output.c Figure 9.8tcp_output: adjust amount of data to send based on size ofTCP options. 364-377If thesizeof thedata(len) plusthesizeoftheoptionsexceedst_rnaxopd,the amount of datatosend isadjusteddownward,theFIN flagis turned off (in caseit is on), and sendalot is turned on (which forces another loop through tcp_output after the current segment is sent). This code is not specificto T/TCP.It should be used with any TCP option that appears on a segment carrying data (e.g., the RFC 1323 timestamp option). 9.3Summary T / TCP adds about 100linestothe SOD-linetcp_output function.Most of thiscode involves sending the new T /TCP options, CC, CCnew, and CCecho. Additionally, with T /TCP tcp_output allows multiple segments to be sent inthe SYN_SENT state, if the peer understands T /TCP. 70 T/TCPImplementation: TCPFunctions 10.1Introduction This chapter coversthe miscellaneous TCP functions that change with T /TCP.That is, allthefunctionsotherthantcp_output(previouschapter),tcp_input,and tcp_usrreq(nexttwochapters).Thischapterdefinestwonewfunctions tcp_rtlookup and tcp_get taocache lookup entries in the TAO cache. Thetcp_closefunctionismodifiedtosavetheround-triptimeestimators (smoothed estimators of the mean and mean deviation) in the routing table when a con-nection is closed that used T /TCP.Normally these estimators are saved only if at least 16full-sizedsegments were sent on the connection.T /TCP,however,normally sends much less data, but these estimators shouldbe maintained across diHerent connections to the same peer. The handling of the MSS option also changes with T /TCP.Some of this change is to clean upthe overloadedtcp_mss functionfromNet/3, dividingit intoone function that calculatesthe MSSto send (tcp_msssend) and another functionthat processes a receivedMSSoption(tcp_mssrcvd).T/TCP alsosavesthelast MSSvalue received fromthat peer in the TAO cache.This initializes the sending MSSwhen T / TCPsends data with a SYN, before receiving the server's SYN and MSS. Thetcp_dooptions functionfromNet/3 is changedtorecognizethethreenew T /TCP options: CC, CCnew, and CCecho. 10.2tcp_newtcpcbFunction This functionis calledwhen anew socket is created by the PRU_ATTACHrequest.The five lines of code in Figure 10.1replace lines 177-178 on p. 833 of Volume 2. 105 106T /TCPImplementation:TCPFunctions Chapter 10

180tp->t_maxseg= tp->t_maxopd= tcp_mssdflt; 181 182 183 184 if(tcp_do_rfcl323) tp->t_flags=(TF_REQ_SCALEITF_REQ_TSTMP): if(tcp_do_rfc1644) tp->t_flagsI=TF_REQ_CC; _____________________:.________________________________________ tcp_subr.c Figun 10.1tcp_newtcpcb function; T /TCP changes 180AsmentionedwithregardtoFigure8.3,t_maxopdisthemaximumnumberof bytes of data plus TCP options that are sent in each segment.It, along with t_maxseg, bothdefaultto512(tcp_mssdfl t).Sincethetwoareequal,thisassumesnoTCP optionswiUbesentineachsegment.InFigures10.13and10.14,shownlater, t_maxseg is decreased if either the timestamp option or the CC option (or both) will be sent in each segment. 183-184If the global tcp_do_rfc1644 is nonzero (it defaults to 1),the TF_REQ_CCflagis set,whichcausestcp_outputtosendaCCoraCCnewoptionwithaSYN(Fig-ure 9.6). 10.3tcp_ rtlookupFunction The firstoperation performed bytcp_mss (p.898of Volume2)istofetchthe cached routeforthisconnection(which is storedintheinp_route member of the Internet PCB),callingrtalloc to look up the route if one has not been cached yet forthis con-nection.This operation is now placed into aseparate function,tcp_rtlookup, which we showin Figure10.3.This is done becausethe same operationisperformedmore often by T / TCP since the routing table entry forthe connection contains the TAO infor-mation. 438-452If aroute is not yet cached for this connection, rtalloc calculates the route.But a routecanonlybecalculatedif theforeignaddressinthePCBisnonzero.Before rtalloc is called, the sockaddr_in structure within the route structure is filled in. Figure 10.2 shows the route structure, one of which is contained within each Inter-net PCB. -------------------------------------------------------------- route.h 46structroute{ 47structrtentryro_rt; 48structsockaddrro_dst; tpointertostructwithinformation; /*destinationofthisroutet 49); -------------------------------------------------------------- route.lr Figure 10.2route structure. Figure 10.4 summarizes these structures, assuming the foreign address is 128.32.33.5. 10.3tcp_rtlookupFunction107 - --------------------------- - --tcp_subr.c 432structrtentry 433tcp_rtlookup(inp) 434structinpcb*inp; 435{ 436structroute*ro; 437structrtentryrt; 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 ro=&inp->inp_route; rt= ro->ro_rt; if(rt==NULL){ } I*Norouteyet,sotrytoacquireone*/ if(inp->inp_faddr.s_addr!=INADDR_ANY){ ro->ro_dst.sa_family= AF_INET; ro->ro_dst.sa_len= sizeof(ro->ro_dst); } ((structsockaddr_in*)&ro->ro_dst)->sin_addr-inp->inp_faddr; rtalloc(ro); rt=ro->ro_rt; return.( rt) ; }

.... rteDtry{ ) of 12832.33.5 Figure 10.3tcp_rtlookup function. iupcb(} inp_faddr iop_fport socket inp_laddrpair inp_lport inp_ppcb ... ro_rt .... ...... 1620 .!l 12832 331 s

> i 0 0 .- tcpch{} op_route route(} Figure 10.4Summary of cached route within Internet PCB. 108T /TCPImplementation:TCPFunctions_Chapter 10 10.4tcp_ gettaocacheFunction The TAO information for agiven host is maintainedinthat host'srouting tableentry, specificallyinthermx_filler fieldof thert_metrics structure(Section6.5).The functiontcp_gettaocache, shown in Figure 10.5, returns apointer to the host's TAO cache. -----------------------------tcp_subr.c 458structrmxp_tao 459tcp_qettaocache(inp) 460structinpcb*inp; 461( 462structrtentry*rt= tcp_rtlookup(inp); 463/ *Makesurethisisahostrouteandisup.*/ 464if(rt==NULLI I 465(rt->rt_flags&(RTF_UPIRTF_HOST))l=(RTF_UPRTF_HOST)) 466return(NULL); 467 468) return(rmx_taop(rt->rt_rmx));

Figure 10.5tcp_gettaocache function. 46D-46Btcp_rtlookup returns the pointer to the foreign host's rtentry structure.lf that succeedsandif boththeRTF_UPandRTF_HOSTflagsareon,thermx_taopmacro (Figure 6.3) returns apointer to the rmxp_tao structure. 10.5RetransmissionTimeoutCalculations Net/3 TCPcalculatestheretransmissiontimeout(RTO)bymeasuringtheround-trip timeofdatasegmentsandkeepingtrackofasmoothedRITestimator(srtt)anda smoothed mean deviation estimator (rttvar).The mean deviation is agood approxima-tionof the standard deviation,but easier to compute since,unlike the standard devia-tion,themeandeviationdoesnotrequiresquarerootcalculations.Oacobson1988] providesadditionaldetailsontheseRTTmeasurements,whichleadtothefollowing equations: delta = data - srtf srtt srtt + g x delta rttvarrttvar + h( I delta I - rttvar) RTO = srtt + 4 xrttvar where deltaisthe difference between the measured round-trip time just obtained (data) and the current smoothed RIT estimator (srtt);g is the gain applied to the RTT estima-tor and equals lh; and his the gain applied to the mean deviation estimator and equals V..The two gains and the multiplier 4 in the RTO calculation are purposely powers of 2, sotheycanbecalculatedusingshiftoperationsinsteadofmultiplyingordividing. :=.ection 10.5 RetransmissionTuneoutCalculations109 Chapter25ofVolume2providesdetailsonhowthesevaluesaremaintainedusing fixed-point integers. On anormal TCP connection there are usually multiple RTTsto sample when calcu-latingthetwo estimatorssrttandrttvar,at leasttwo samples giventhe minimal TCP connectionshowninFigure1.9.FurthermoreundercertainconditionsNet/3will maintain these two estimators over multiple connections between the same hosts.This is done by the tcp_close function, when a connection is closed, if at least 16 Rl'lsam-ples were obtained and if the routing table entry forthepeer is not the default route. The values are storedinthe rmxrtt and rmxrttvar members of the rtrnetrics - - -structure in the routing table entry.The two estimators srtt andrttvar are initialized to thevaluesfromtheroutingtableentry bytcp_mssrcvd (Section10.8)whenanew connection is initialized. The problem that arises with T /TCP is that a minimal connection involves only one RTT measurement, and since fewer than 16 samples is the norm, nothing is maintained between successive T /TCP connectionsbetweenthesamepeers.Thismeans T /TCP never has agood estimate of what the RTO shouldbe whenthefirstsegment is sent. Section25.8ofVolume2discusseshowtheinitializationdonebytcp_newtcpcb causes the first RTO to be 6 seconds. While it is not hard to have tcp_close save the smoothed estimators foraT/TCP connectioneveniffewerthan 16 samples are collected(we'llseethe changes in Sec-tion10.6),thequestion isthis:howarethe newestimatorsmergedwiththeprevious estimators?Unfortunately, this is still a research problem [Paxson 1995a]. Tounderstandthedifferentpossibilities,considerFigure10.6.Onehundred 400-byteUDP datagrams were sent from one of the author's hosts acrossthe Internet (on aweekday afternoon, normally the most congested time on the Internet)to the echo server on another host.Ninety-three datagrams were returned (7 were lost somewhere on the Internet) and we show the first 91of these in Figure 10.6.The samples were col-lectedover a30-minute period and the amount of time between each datagram was a uniformlydistributedrandomnumberbetween0and30seconds.TheactualRTTs were obtainedby running Tcpdump onthe client host.Thebullets are the measured RTTs.The other three solid lines {RTO,srtt,and rttvar,fromtop tobottom) are calcu- lated from the measured RTT using the formulas shown at the beginning in this section. Thecalculationsweredoneusingfloating-pointarithmetic,notwiththefixed-point integers actually used in Net/3.The RTO that is shown is the value calculated using the corresponding datapoint.That is,theRTOforthefirstdata point(about 2200ms)is calculatedusingthefirstdatapoint, and would beused asRTOforthe next segment that is sent. Although the measured RTTs average just under 800 ms (the author's client system is behind adialupPPP linktothe Internet andthe server was acrossthe country), the 26th sample has an RTTof almost 1400 ms and afewmore after thispoint are around 1000 ms.As noted in Uacobson 1994], "whenever there are competing connections shar-ingapath,transientRTTfluctuationsoftwicetheminimumarecompletelynormal (they just represent other connections starting or restarting after a loss) so it is never rea-sonable for RTO to be less than 2 x RTT." noT /TCPImplementation:TCPFunctionsChapter 10 2400 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 2030405060708090 sample Figure 10.6RTf measurements and corresponding RTO, srlt, and rltwr. When new values of the estimators are storedin the routingtable entry adecision must be made about how much of the new information is stored, versus the past his-tory.That is, the formulas are savesrtt = g x savesrtt + (1- g) x srtt saverttvar = g x saverttvar + (1- g) x rttvar This is alow-pass filterwhere g is afilter gain constant with avalue between 0 and 1, and savesrtt and saverttvar are the values stored in the routing table entry.When Net/3 updatestheroutingtablevaluesusingthese equations(whenaconnection is closed, assuming 16 samples have been made), it uses again of 0.5:the new value stored in the routing table is one-half of the old value in the routing table plus one-half of the current estimator.BobBraden's T /TCP code, however, uses a gain of 0.75. Figure 10.7 provides a comparison of the normal TCP calculations fromFigure 10.6 andthe smoothing performed with afiltergain of 0.75.The three dottedlinesare the three variables fromFigure 10.6 (RTOon top,then srtt inthe middle, and rttvar on the bottom).The three solid lines are the corresponding variables assuming that each data point is aseparate T /TCP connection (one RTT measurement per connection) and that the value saved inthe routing table between each connectionuses afilter gain of 0.75. Realizethedifference:thedottedlinesassumeasingle TCPconnection with 91RTI samples over a30-minute period,whereasthe solidlines assume 91separate T / TCP ...-.:bon 10.5RetransmissionTl.U\eoutCaJculations1ll 2400 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 rttva 0 10203040so60708090 sample Figure 10.7Comparison of TCP smoothing versus T /TCP smoothing. connections,eachwithasingleR1Tmeasurement,overthesame30-minuteperiod. The solidlinesalsoassumethatthetwoestimatorsaremergedintothetworouting table metrics beh.'Veen each of the 91 connections. The solid and dotted lines for srttdo not differ greatly,but there is alarger differ-ence between the solid and dotted lines for rttvar.The solid line forrttvar(the T /TCP case) is normally larger than the dotted line (the single TCP connection), giving a higher value for the T /TCP retransmission timeout. Other factorsaffect the RTI measurements made by T /TCP.From the client's per-spectivethemeasured RTTnormally includes either the serverprocessing time or the server's delayed-ACK timer, since the server's reply is normally delayed until either of these events occurs.In Net/3 the delayed-ACK timer expires every 200 ms and the RTT measurements are in 500-ms clockticks, sothe delay of thereplyshouldn't be alarge factor.AlsotheprocessingofT /TCPsegmentsnormallyinvolvestheslowpath through the TCP input processing (the segments are usuallynot candidates forheader prediction,forexample),which can add tothe measuredRTTvalues.(Thedifference betweenthe slow path and the fast path, however,is probably negligible compared to the200-msdelayed-ACKtimer.)Finally, if thevaluesstoredintheroutingtable are "old"(say,theywerelastupdatedanhourago),perhapsthecurrentmeasurements should just replace the values in the routing table when the current transaction is com-plete, instead of merging in the new measurements with the existing values. As noted in RFC1644, more research is needed into the dynamics of TCP,and espe-cially T / TCP, RTI estimation. ll2T /TCPImplementation:TCPFunctions ChapterlO 10.6tcp_ closeFunction Theonlychangerequiredtotcp_close istosavetheRTTestimatorsforaT/ TCP transaction,evenif 16sampleshave not beenobtained.Wedescribedthereasoning behind this in the previous section.Figure 10.8 shows the code.

252if(SEQ_LT(tp->iss+so->so_snd.sb_hiwat16,tp->snQ_max)&& 253(rt= inp->inp_route.ro_rt)&& 254((structsockaddr_in)rt_key(rt))->sin_addr.s_addr!= ( I*pp.895-896ofVolume2t 304}elseif(tp->cc_recv!=0&& 305(rt=inp->inp_route.ro_rt)&& 306((structsockaddr_in*)rt_key(rt))->sin_addr.s_addr1:INADDR_ANY)( 307I * 308*Fortransactionsweneedtokeeptrackofsrtt andrttvar 309*evenifwedon'thaveenough'dataforabove. 310* I 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334} u_long . l; if((rt->rt_rmx.rmx_locks&RTV_RTT)==0)( i=tp->t_srtt* } (RTM,_RTTUNITI(PR_SLOWHZ*TCP_RTT_SCALE)); if(rt->rt_rmx.rmx_rtt&&i) else I* *Filterthisupdateto3/4theoldplus 114thenewvalues,convertingscale. *I rt->rt_rmx.rmx_rtt= (3rt->rt_rmx.rmx_rtt+ilI4; rt->rt_rmx.rmx_rtt=i; if((rt->rt_rmx.rmx_locks&RTV_RTTVARl==0)( i=tp->t_rttvar* } (RTM_RTTUNITI*TCP_RTTVAR_SCALE)l; if(rt->rt_rmx.rmx_rttvar&&i) rt->rt_rmx.rmx_rttvar= (3*rt->rt_rmx.rmx_rttvar+ilI4; else rt->rt_rmx.rmx_rttvar= i; Figure 10.8tcp_close function: save R1T estimators forT/TCP transaction. Update for TITCP transactions only 304-311The metrics in the routingtable entry are updated only if T /TCP was used on the connection (cc_recv is nonzero), arouting table entry exists,andtheroute is not the tcp_msssendFunction113 default.Also,thetwoRITestimatorsareupdatedonlyif theyarenotlocked(the RTV_RTT and RTV_RTTVAR bits). Update RTT 'l.Z- 324t_srtt is stored as 500-ms clock ticks x 8 and rmx_r tt is stored as microseconds. Thereforet_srt tismultipliedby1,000,000(RTM_RTTUNIT)anddividedby2 (ticks/ second)times 8.If avalue forrmx_rtt alreadyexists,the newvalue isthree-quarters the old value plus one-quarter of the new value.This is afilter gain of 0.75, as discussed in the previous section.Otherwise the newvalue is stored in rmx_rtt. Update mean deviation ::;s - JJ4The samealgorithm is applied to the mean deviationestimator.It too is storedas microseconds, requiring a conversion from the t_rt tvarunits of ticks x 4. 10.7tcp_massendFunction In Net /3 there is asinglefunction,tcp_mss(Section 27.5 of Volume 2),which is called bytcp_input when an MSSoptionis processed,and bytcp_output when anMSS option is about to be sent.With T / TCP this function is renamed tcp_mssrcvd and it is calledbytcp_input afteraSYNisreceived(Figure10.18,shownlater,whether an MSS option is contained in the SYN or not), and by the PRU_SEND and PRU_SEND_EOF requests(Figure12.4),whenanimpliedconnectisperformed.Anewfunction, tcp_msssend, whichwe showin Figure 10.9, is called only bytcp_output when an MSS option is sent. -------------------------------tcp_input.c 1911int 1912 1913structtcpcb tp; 1914{ 1915structrtentry*rt; 1916externint 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927) rt= tcp_rtlookup(tp->t_inpcb); if(rt:: NULL) return(tcp_mssdflt); I * Ifthere'sanmtuassociatedwiththeroute,useit, *elseusetheoutgoinginterfacemtu. ., if(rt- >rt_rmx.rmx_mtu) return(rt->rt_rmx. rmx_mtu- sizeof(structtcpiphdr)); return(rt->rt_ifp->if_mtu- sizeof(structtcpiphdr)); - - -----------------------------tcp_input.c Figun 10.9tCp.J!ISssend function: return MSS value to send in MSS option. 114T /TCPImplementation:TCPFunctions Chapter 10 Fetch routing table entry 1917-1919Theroutingtableissearchedforthe peerhost, and if anentry is notfound,the defaultof 512(tcp_mssdflt) isreturned.Aroutingtableentryshouldalwaysbe found, unless the peer is unreachable. Return MSS 192o-1926IftheroutingtablehasanassociatedMTU(thermx_mtumemberofthe rt_metricsstructure,whichcanbestoredbythesystemadministratorusingthe route program), that value is returned.Otherwise the value returnedisthe outgoing interface MTUminus 40(e.g.,1460for anEthernet).The outgoing interface is known, since the route has been determined by tcp_rtlookup. Another way for the MTU metric to be storedin the routing table entry isthrough path MTU discovery (Section 24.2 of Volume 1), although Net/ 3 does not yet support this. Thisfunctiondiffersfromthe normalBSDbehavior.TheNet/3code(p.900of Volume 2)always announces an MSSof 512 (tcp_mssdflt) if the peer is nonlocal (as determined by the in_localaddr function) and if the rmx_mtu metric is 0. The intent of the MSS option is to tell the other end how large a segment the sender ofthe option is prepared to receive.RFC 793 states that the MSS option "communicates the maximumreceivesegment size at the TCPwhichsendsthis segment."On some implementations this could be limited by the maximum size IP datagram that the host is capableofreassembling.Onmostcurrentsystems,however,thereasonablelimitis based on the outgoing interface MTU, since TCP performance can degrade if fragmenta-tion occurs and fragments are lost. ThefollowingcommentsarefromBobBraden'sT/TCPsourcecodechanges: "Using TCPoptionsunfortunatelyrequires considerable changestoBSD,becauseits handling of MSS was incorrect.BSD always sent an MSS option, and for a nonlocal net-workthisoption contained 536.This is amisunderstanding of theintent of the MSS option, which is to tellthe sender what the receiver is prepared to handle.The sending host should then decide what MSStouse, considering both the MSSoption it received and the path.When we have MTU discovery, the path is likely tohave an MTU larger than 536; then the BSD usage will kill throughput.Hence, this routine only determines what MSS option should be SENT:the localinterface MTU minus 40."(The values 536 in these comments shouldbe 512.) We'll see in the next section (Figure 10.12) that the receiver of the MSS option is the one that reduces the MSS to 512 if the peer is nonlocal. 10.8tcp_mssrcvdFunction tcp_mssrcvd is called by tcp_input after aSYNis received,and by the PRU_SEND andPRU_ SEND_EOFrequests,when animplied connect is performed.It is similar to thetcp_mssfunctionfromVolume2,but withenoughdifferencesthatweneedto present the entirefunction.The main goalof thisfunctionisto set the two variables Section 10.8Function115 t_maxseg(themaximum amount of datathat we sendper segment)andt_Jilaxopd (themaximum amount of data plus optionsthatwe send per segment).Figure10.10 shows the first part. ----i---------------------------tcp_mpuJ.c 1755vod 1756offer) 1757structtcpcbtp; 1758intoffer; 1759( 1760 1761 1762 1763 struct struct int u_long rtentryrt; ifnet*ifp; rtt,mss; bufsize; 1764structinpcb*inp; 1765structsocket so; 1766structrmxp_tao*taop; 1767intorigoffer= offer; 1768externinttcp_msedf1t; 1769externinttcp_do_rfc1323; 1770externinttcp_do_rfc1644 ; 1771inp= tp->t_inpcb; 1772if((rt=tcp_rtlookup(inp))==NULL)( 1773tp->t_maxseg= 1774return; 1775} 1776ifp= rt->rt_ifp; 1777so= inp->inp_socket; 1778taop= rmx_taop(rt->rt_rmx); 1779, . 1780Offer:=-1meanswehaven'treceivedaSYNyet; 1781usecachedvalueinthatcase. 1782. , 1783if(offer==-1) 1784offer- taop->tao_mssopt; 1785, . 1786Offer==0meansthattherewasnoMSSontheSYNsegment, 1787ornovalueintheTAOCache.Weusetcp_mssdf1t. 1788*/ 1789if(offer=:Ol 1790offer= tcp_mssdflt; 1791else /* Sanitycheck:makesurethatmaxopdwillbelarge *enoughtoallowsomedataonsegmentsevenif all *theoptionspaceisused(40bytes).Otherwise funnythingsmayhappenintcp_output. ., offer=max(offer,64); "1.792 1793 1794 1795 1796 1797 1798 1799 =offer; -------------------------------tcp_input.c Figure 10.10tcp_mssrcvd function: first part. ll6T / TCPImplementation:TCPFunctions Chapter 10 Get route to peer and Its TAO cache 1111-1111tcp_rtlookup findsthe route to the peer.If for some reasonthis fails,t_maxseg and t_maxopd are both set to 512 (tcp_mssdfl t). 1778-1799taop pointstotheTAO cache forthispeerthatiscontainedintheroutingtable entry.If tcp_mssrcvd is calledbecausetheprocesshascalledsendto (animplied connect,aspart ofthePRU_SENDandPRU_SEND_EOFrequests),offer issettothe value stored in the TAO cache.If this TAO value is 0,offer is setto 512.The value in the TAO cache is updated. Thenextpartofthefunction,showninFigure10.11,isidenticaltop.899of Volume 2. -------------------------------tcp_inprlt.c 1800I 1801*Whilewe' r ehere,checkifthere'saninitialrtt 1802orrttvar.Convertfromtheroute-tableunits 1803toscaledmul tiplesoftheslowtimeouttimer. 1804 I 1805if(tp->t_srtt==0&&(rtt- rt->rt_rmx.rmx_rtt))( 1806I * 1807*XXXthelockbitforRTTindicatesthatthevalue 1808isalsoaminimumvalue;thisissubjecttotime. 1809 I 1810if(rt->rt_rmx.rmx_locks&RTV_RTT) 1811tp->t_rttmin= rtt I(RTM_RTTONcrTIPR_SLOWHZ); 1812tp->t_srtt=rttI(RTM_R'l"l'UNITI(PR_SLOWHZTCP_RTT_SCALE)); 1813if(rt->rt_rmx.rmx_rttvar) 1814tp->t_rttvar=r t->rt_rmx.rmx_rttvarI 1815(Rnt_RTTUNITI(PR,_SLOWHZTCP_RTTVAR_SCALE)); 1816else 1817I *defaultvariationis+- 1rtt *I 1818tp->t_rttvar= 1819tp->t_srttTCP_RTTVAR_SCALEITCP_RTT_SCALE; 1820TCPT_RANGESET(tp->t_ rxtcur, 1821((tp->t_srtt>>2)+tp->t_rttvar)>>1, 1822tp->t_rttmin,TCPTV_REXMTMAX); 1823} -------------------------------tcp_input.c Figure lO.lltcp_JIIssrcvd function: initialize R1T variables fromrouting table metrics. 1soo-1s2JIf therearenoRTTmeasurementsyetfortheconnection(t_srtt is0)andthe rrnx_rtt metric is nonzero, then the variablest_srtt, t_rttvar, and t_rxtcur are initialized fromthe metrics stored in the routing table entry. 1806-1811lf theRTV_RTTbit in the routing metriclockflag isset,itindicatesthat rrnx_rtt should also beused to initialize the minimum RTT forthis connection( t_rt tmin).By defaultt_rttmi nisinitializedtotwoticks,sothisprovidesawayforthesystem administrator to override this default. Thenext part oftcp_mssrcvd, shown in Figure 10.12, setsthevalueof the auto-matic variable ross. Section 10.8Function117 -------------------------------tcp_input.c 1824,. 1825Ifthere'sanmtuassociatedwiththeroute,useit. 1826*/ 1827if(rt->rt_rmx.rmx_mtu) 1828mss-sizeof(structtcpiphdr); 1829else{ 1830mss= ifp->if_mtu- sizeof(structtcpiphdr); 1831if(!in_1ocaladdr(inp->inp_faddr)) 1832mss= min(mss,tcp_mssdflt); 1833) 1834mss= min(mss,offer); 1835/* 1836t_maxopdcontainsthemaximumlengthofdataANDoptions 1837*inasegment;t_maxsegistheamountofdatainanormal 1838segment.Weneedtostorethisvalueapart 1839fromt_maxseg,becausenoweverysegmentcancontainoptions 1840thereforewenormallyhavesomewhatlessdatainsegments. 1841*/ 1842tp->t_maxopd= mss; -------------------------------tcp_input.c Figure 10.12tcp_mssrcvd function: calculate value of mss variable. 1824-1834If there is an MTU associated with the route (the rmx_mtu metric) then that value is used.Otherwise mss is calculated as the outgoing interface MTUminus 40.Addition-ally, if the peer is on a different network or perhaps adifferent subnet (as determined by the in_localaddr function),then the maximum value of mss is 512 (tcp_mssdflt). WhenanMTUisstored intheroutingtableentry,thelocal-nonlocaltest is not per-formed. Sett_maxopd 1835-1842t_maxopd is set to mss, the maximum segment size, including data and options. Thenextpieceofcode,shownin Figure10.13,reducesmssbythesizeofthe options that will appear in every segment. Decrease maaIf timestamp option to be used 1843-1856mssisdecreased by the size of the timestamp option (TCPOLEN_TSTAMP_APPA, or 12 bytes) if either of the following is true: 1.our end willrequest the timestamp option (TF_REQ_TSTAMP) and we have not received an MSS option from the other end yet (origoffer equals -1), or 2.we have received atimestamp option from the other end. Asthecommentinthecodenotes,sincetcp_mssrcvdiscalledattheendof tcp_dooptions(Figure 10.18),after allthe optionshave been processed, the second test is OK. 118T /TCPImplementation:TCPFunctionsChapter 10 -------------------------------tcp_input.c 1843, . 1844Adjustmsstoleavespacefortheusualoptions.We're 1845calledfromtheendoftcp_dooptionssowecanusethe 1846REQ/ RCVDflagstoseeif optionswillbeused. 1847. , 1848, . 1849IncaseofT/ TCP,origoffer==-1indicatesthatnosegments 1850werereceivedyet(i.e.,clienthascalledsendto).Inthis, 1851casewejustguess,otherwisewedothesameasbeforeT/ TCP. ... 1852t 1853if(( tp->t_flags&(TF_REQ_TSTMPITF_NOOPT) l==TF_RBQ_TSTMP&& 1854(origoffer==-1I I 1855(tp->t_flags&TF_RCVD_TSTMP)==TF_RCVD_TSTMP)) 1856mss-=TCPOLEN_TSTAMP_APPA; 1857if((tp->t_f1ags&(TF_REQ_CCITF_NOOPT))==TF_REQ_CC&& 1858(origoffer==-1II 1859(tp->t_flags& TF_RCVD_CC)==TF_RCVD_CC)) 1860mss-=TCPOLEN_CC_APPA; 1861lif(MCLBYTES&(MCLBYTES- 1))==0 1862if(mss>MCLBYTES) 1863mss&=- (MCLBYTES- 1); 1864le1se 1865if(mss>MCLBYTES ) 1866mss= mssIMCLBYTESMCLBYTES; 1867lendif -------------------------------tcp_input.c Figu.re 10.13tcp_JIISsrcvd function: decrease mss based on options. Decrease ... If CC option to be used 1857-1860Similar logic can reduce the value of mss by 8 bytes (TCPOLEN_CC_APPA). The term APPA in the names of thetwo lengths is because Appendix A of RFC 1323 contained the suggestion that the timestamp option be preceded bytwo NOPs,to align the two 4-byte timestampvalues on 4-byte boundaries.Whilethere is an Appendix A toRFC1644, it says nothing about alignment of the options.Nevertheless, it makes sense for the code to precede each of the three CC options with two NOPs, as is done in Figure 9.6. Round MSS down to multiple of MCLBYTBS 1861-1867mss is rounded down to amultiple of MCLBYTES,the size in bytes of an mbuf clus-ter (often 1024 or 2048). This code is an awful attempt to optimize by using logical operations, instead of a divide and multiply, if HCLBYTES is a power of 2.It has been around since Net/1 and should be cleaned up. Figure 10.14 shows the finalpart of tcp_mssrcvd, which sets the send buffer size and the receive buffer size. Section 10.8 Function119

-------------------------------tcp_input.c I*1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 Ifthere'sapipesize,changethesocketbuffer tothatsize.Makethesocketbuffersanintegral numberofmssunits;ifthemssislargerthan thesocketbuffer,decreasethemss. *I if((bufsize=rt->r t_rmx.rmx_sendpipe)0) bufsize= so->so_snd.sb_hiwat; if(bufsizesb_max) bufsize= (void)sbreserve(&so->so_snd,bufsize); tp->t_maxseg- mss; if((bufsize= rt->rt_rmx .rmx_r ecvpipe)==0) buf size=so->so_rcv.sb_hiwat; if(bufsize>mas){ ) / * bufsize= roundup(bufsize,mss); if(bufsize>sb_max) bufsize=(void)sbreserve(&so->so_rcv,bufsize); Don'tforceslow-startonlocalnetwork. * I if(!in_localaddr(inp->inp_faddr)) tp->snd_cwnd= mss; if(rt->rt_rmx.rmx_ssthresh)( } ,.. There'ssomesortofgatewayorinterface bufferlimitonthepa th.Usethistoset theslowstartthreshhold,butsetthe thresholdtonolessthan2*mss. *I tp->snd_ssthresh= max(2*mss,rt->rt _rmx.rmx_ssthresh); ---------------------------------tcp_input.c 1907} Figure 10.14tcp_mssrcvd function: set send and receive buffer sizes. Modify socket send buffer size 1868-1883The rmx_sendpipe and r mx_r ecvpipemetrics can be set by the system adminis-tratorusingtheroute program.bufsizeis settothevalueof thermx_sendpipe metric (if defined) or the current value of the socket send buffer's high-water mark.If thevalueof bufsize islessthan rnss,then rnssisreducedtothevalue of bufsize. 120T /TCPImplementation:TCPFunctionsChapter 10 (This is awayto forceasmaller MSSthanthedefault foragiven destination.)Other-wisethevalueof bufsize isroundedup tothenextmultipleof mss.(The socket buffers should always be a multiple of the segment size.)The upper bound is sb_max, which is 262,144 in Net/3.The socket buffer's high-water mark is set by sbreserve. Sett_.II&Xtleg 1884t_maxseg is set tothe maximum amount of data(excluding normal options) that TCP will send to the peer. Modify socket receive buffer size ... 1885-1892Similarlogicisappliedtothe socketreceivebuffer high-watermark.Foralocal connection on an Ethernet, forexample, assuming both timestamps and the CC option are in use,t_maxopd will be 1460 and t_maxseg will be 1440 (Figure 2.4).The socket's send buffer size and receive buffer size will both be rounded up fromtheir defaults of 8192 (Figure 16.4, p. 477 of Volume 2)to 8640 (1440 x 6). Slow start for non local peer only 1893-1897If the peer isnot on alocal network (in_localaddr isfalse)slow start is initiated by setting the congestion window (snd_cwnd) to one segment. Forcing slow start only if the peer is nonlocal is achange with T /TCP.This allows aT / TCP client or server to send multiple segments to a local peer, without mcurring the additional RTf latenciesrequiredby slowstart (Section3.6).ln Net/3, slowstartwasalwaysperformed (p. 902 of Volume 2). Set slow start threshold 1898-1906If theslow start thresholdmetric (rmx_ssthresh) isnonzero,snd_ssthresh is set to that value. Wecanseetheinteractionofthe receivebuffersizewiththeMSSandthe TAO cache in Figures 3.1and 3.3.In the firstfigure the client performs an implied connect, the PRU_SEND_EOF request callstcp_mssrcvd with an offer of - 1, and the function finds atao.....mssoptof 0forthe server (sincethe client just rebooted).The default of 512 is used, and with only the CC option in use (we disabled timestamps forthe exam-ples in Chapter 2)this value is decreased by 8 bytes (the options) to become 504.Note that 8192roundeduptoamultiple of 504 becomes 8568, whichisthewindow adver-tised by the client's SYN.When the server calls tcp