1 TCP(7P) Protocols TCP(7P) 2 3 NAME 4 tcp, TCP - Internet Transmission Control Protocol 5 6 SYNOPSIS 7 #include <sys/socket.h> 8 #include <netinet/in.h> 9 #include <netinet/tcp.h> 10 11 s = socket(AF_INET, SOCK_STREAM, 0); 12 s = socket(AF_INET6, SOCK_STREAM, 0); 13 t = t_open("/dev/tcp", O_RDWR); 14 t = t_open("/dev/tcp6", O_RDWR); 15 16 DESCRIPTION 17 TCP is the virtual circuit protocol of the Internet protocol family. It 18 provides reliable, flow-controlled, in-order, two-way transmission of 19 data. It is a byte-stream protocol layered above the Internet Protocol 20 (IP), or the Internet Protocol Version 6 (IPv6), the Internet protocol 21 family's internetwork datagram delivery protocol. 22 23 Programs can access TCP using the socket interface as a SOCK_STREAM 24 socket type, or using the Transport Level Interface (TLI) where it 25 supports the connection-oriented (BT_COTS_ORD) service type. 26 27 A checksum over all data helps TCP provide reliable communication. Using 28 a window-based flow control mechanism that makes use of positive 29 acknowledgements, sequence numbers, and a retransmission strategy, TCP 30 can usually recover when datagrams are damaged, delayed, duplicated or 31 delivered out of order by the underlying medium. 32 33 TCP provides several socket options, defined in <netinet/tcp.h> and 34 described throughout this document, which may be set using 35 setsockopt(3SOCKET) and read using getsockopt(3SOCKET). The level 36 argument for these calls is the protocol number for TCP, available from 37 getprotobyname(3SOCKET). IP level options may also be used with TCP. 38 See ip(7P) and ip6(7P). 39 40 Listening And Connecting 41 TCP uses IP's host-level addressing and adds its own per-host collection 42 of "port addresses". The endpoints of a TCP connection are identified by 43 the combination of an IPv4 or IPv6 address and a TCP port number. 44 Although other protocols, such as the User Datagram Protocol (UDP), may 45 use the same host and port address format, the port space of these 46 protocols is distinct. See inet(7P) and inet6(7P) for details on the 47 common aspects of addressing in the Internet protocol family. 48 49 Sockets utilizing TCP are either "active" or "passive". Active sockets 50 initiate connections to passive sockets. Passive sockets must have their 51 local IPv4 or IPv6 address and TCP port number bound with the 52 bind(3SOCKET) system call after the socket is created. If an active 53 socket has not been bound by the time connect(3SOCKET) is called, then 54 the operating system will choose a local address and port for the 55 application. By default, TCP sockets are active. A passive socket is 56 created by calling the listen(3SOCKET) system call after binding, which 57 establishes a queueing parameter for the passive socket. Connections to 58 the passive socket can then be received using the accept(3SOCKET) system 59 call. Active sockets use the connect(3SOCKET) call after binding to 60 initiate connections. 61 62 If incoming connection requests include an IP source route option, then 63 the reverse source route will be used when responding. 64 65 By using the special value INADDR_ANY with IPv4, or the unspecified 66 address (all zeroes) with IPv6, the local IP address can be left 67 unspecified in the bind() call by either active or passive TCP sockets. 68 This feature is usually used if the local address is either unknown or 69 irrelevant. If left unspecified, the local IP address will be bound at 70 connection time to the address of the network interface used to service 71 the connection. For passive sockets, this is the destination address 72 used by the connecting peer. For active sockets, this is usually an 73 address on the same subnet as the destination or default gateway address, 74 although the rules can be more complex. See Source Address Selection in 75 inet6(7P) for a detailed discussion of how this works in IPv6. 76 77 Note that no two TCP sockets can be bound to the same port unless the 78 bound IP addresses are different. IPv4 INADDR_ANY and IPv6 unspecified 79 addresses compare as equal to any IPv4 or IPv6 address. For example, if 80 a socket is bound to INADDR_ANY or the unspecified address and port N, no 81 other socket can bind to port N, regardless of the binding address. This 82 special consideration of INADDR_ANY and the unspecified address can be 83 changed using the socket option SO_REUSEADDR. If SO_REUSEADDR is set on 84 a socket doing a bind, IPv4 INADDR_ANY and the IPv6 unspecified address 85 do not compare as equal to any IP address. This means that as long as 86 the two sockets are not both bound to INADDR_ANY, the unspecified 87 address, or the same IP address, then the two sockets can be bound to the 88 same port. 89 90 If an application does not want to allow another socket using the 91 SO_REUSEADDR option to bind to a port its socket is bound to, the 92 application can set the socket-level (SOL_SOCKET) option SO_EXCLBIND on a 93 socket. The option values of 0 and 1 mean enabling and disabling the 94 option respectively. Once this option is enabled on a socket, no other 95 socket can be bound to the same port. 96 97 Sending And Receiving Data 98 Once a connection has been established, data can be exchanged using the 99 read(2) and write(2) system calls. If, after sending data, the local TCP 100 receives no acknowledgements from its peer for a period of time (for 101 example, if the remote machine crashes), the connection is closed and an 102 error is returned. 103 104 When a peer is sending data, it will only send up to the advertised 105 "receive window", which is determined by how much more data the recipient 106 can fit in its buffer. Applications can use the socket-level option 107 SO_RCVBUF to increase or decrease the receive buffer size. Similarly, 108 the socket-level option SO_SNDBUF can be used to allow TCP to buffer more 109 unacknowledged and unsent data locally. 110 111 Under most circumstances, TCP will send data when it is written by the 112 application. When outstanding data has not yet been acknowledged, 113 though, TCP will gather small amounts of output to be sent as a single 114 packet once an acknowledgement has been received. Usually referred to as 115 Nagle's Algorithm (RFC 896), this behavior helps prevent flooding the 116 network with many small packets. 117 118 However, for some highly interactive clients (such as remote shells or 119 windowing systems that send a stream of keypresses or mouse events), this 120 batching may cause significant delays. To disable this behavior, TCP 121 provides a boolean socket option, TCP_NODELAY. 122 123 Conversely, for other applications, it may be desirable for TCP not to 124 send out any data until a full TCP segment can be sent. To enable this 125 behavior, an application can use the TCP-level socket option TCP_CORK. 126 When set to a non-zero value, TCP will only send out a full TCP segment. 127 When TCP_CORK is set to zero after it has been enabled, all currently 128 buffered data is sent out (as permitted by the peer's receive window and 129 the current congestion window). 130 131 TCP provides an urgent data mechanism, which may be invoked using the 132 out-of-band provisions of send(3SOCKET). The caller may mark one byte as 133 "urgent" with the MSG_OOB flag to send(3SOCKET). This sets an "urgent 134 pointer" pointing to this byte in the TCP stream. The receiver on the 135 other side of the stream is notified of the urgent data by a SIGURG 136 signal. The SIOCATMARK ioctl(2) request returns a value indicating 137 whether the stream is at the urgent mark. Because the system never 138 returns data across the urgent mark in a single read(2) call, it is 139 possible to advance to the urgent data in a simple loop which reads data, 140 testing the socket with the SIOCATMARK ioctl() request, until it reaches 141 the mark. 142 143 Congestion Control 144 TCP follows the congestion control algorithm described in RFC 2581, and 145 also supports the initial congestion window (cwnd) changes in RFC 3390. 146 The initial cwnd calculation can be overridden by the socket option 147 TCP_INIT_CWND. An application can use this option to set the initial 148 cwnd to a specified number of TCP segments. This applies to the cases 149 when the connection first starts and restarts after an idle period. The 150 process must have the PRIV_SYS_NET_CONFIG privilege if it wants to 151 specify a number greater than that calculated by RFC 3390. 152 153 The operating system also provides alternative algorithms that may be 154 more appropriate for your application, including the CUBIC congestion 155 control algorithm described in RFC 8312. These can be configured system- 156 wide using ipadm(1M), or on a per-connection basis with the TCP-level 157 socket option TCP_CONGESTION, whose argument is the name of the algorithm 158 to use (for example "cubic"). If the requested algorithm does not exist, 159 then setsockopt() will fail, and errno will be set to ENOENT. 160 161 TCP Keep-Alive 162 Since TCP determines whether a remote peer is no longer reachable by 163 timing out waiting for acknowledgements, a host that never sends any new 164 data may never notice a peer that has gone away. While consumers can 165 avoid this problem by sending their own periodic heartbeat messages 166 (Transport Layer Security does this, for example), TCP describes an 167 optional keep-alive mechanism in RFC 1122. Applications can enable it 168 using the socket-level option SO_KEEPALIVE. When enabled, the first 169 keep-alive probe is sent out after a TCP connection is idle for two 170 hours. If the peer does not respond to the probe within eight minutes, 171 the TCP connection is aborted. An application can alter the probe 172 behavior using the following TCP-level socket options: 173 174 TCP_KEEPALIVE_THRESHOLD 175 Determines the interval for sending the first 176 probe. The option value is specified as an 177 unsigned integer in milliseconds. The system 178 default is controlled by the TCP ndd parameter 179 tcp_keepalive_interval. The minimum value is ten 180 seconds. The maximum is ten days, while the 181 default is two hours. 182 183 TCP_KEEPALIVE_ABORT_THRESHOLD 184 If TCP does not receive a response to the probe, 185 then this option determines how long to wait 186 before aborting a TCP connection. The option 187 value is an unsigned integer in milliseconds. 188 The value zero indicates that TCP should never 189 time out and abort the connection when probing. 190 The system default is controlled by the TCP ndd 191 parameter tcp_keepalive_abort_interval. The 192 default is eight minutes. 193 194 TCP_KEEPIDLE This option, like TCP_KEEPALIVE_THRESHOLD, 195 determines the interval for sending the first 196 probe, except that the option value is an 197 unsigned integer in seconds. It is provided 198 primarily for compatibility with other Unix 199 flavors. 200 201 TCP_KEEPCNT This option specifies the number of keep-alive 202 probes that should be sent without any response 203 from the peer before aborting the connection. 204 205 TCP_KEEPINTVL This option specifies the interval in seconds 206 between successive, unacknowledged keep-alive 207 probes. 208 209 Additional Configuration 210 illumos supports TCP Extensions for High Performance (RFC 7323) which 211 includes the window scale and timestamp options, and Protection Against 212 Wrap Around Sequence Numbers (PAWS). Note that if timestamps are 213 negotiated on a connection, received segments without timestamps on that 214 connection are silently dropped per the suggestion in the RFC. illumos 215 also supports Selective Acknowledgment (SACK) capabilities (RFC 2018) and 216 Explicit Congestion Notification (ECN) mechanism (RFC 3168). 217 218 Turn on the window scale option in one of the following ways: 219 220 o An application can set SO_SNDBUF or SO_RCVBUF size in the 221 setsockopt() option to be larger than 64K. This must be done 222 before the program calls listen() or connect(), because the 223 window scale option is negotiated when the connection is 224 established. Once the connection has been made, it is too 225 late to increase the send or receive window beyond the 226 default TCP limit of 64K. 227 228 o For all applications, use ndd(1M) to modify the configuration 229 parameter tcp_wscale_always. If tcp_wscale_always is set to 230 1, the window scale option will always be set when connecting 231 to a remote system. If tcp_wscale_always is 0, the window 232 scale option will be set only if the user has requested a 233 send or receive window larger than 64K. The default value of 234 tcp_wscale_always is 1. 235 236 o Regardless of the value of tcp_wscale_always, the window 237 scale option will always be included in a connect 238 acknowledgement if the connecting system has used the option. 239 240 Turn on SACK capabilities in the following way: 241 242 o Use ndd to modify the configuration parameter 243 tcp_sack_permitted. If tcp_sack_permitted is set to 0, TCP 244 will not accept SACK or send out SACK information. If 245 tcp_sack_permitted is set to 1, TCP will not initiate a 246 connection with SACK permitted option in the SYN segment, but 247 will respond with SACK permitted option in the SYN|ACK 248 segment if an incoming connection request has the SACK 249 permitted option. This means that TCP will only accept SACK 250 information if the other side of the connection also accepts 251 SACK information. If tcp_sack_permitted is set to 2, it will 252 both initiate and accept connections with SACK information. 253 The default for tcp_sack_permitted is 2 (active enabled). 254 255 Turn on the TCP ECN mechanism in the following way: 256 257 o Use ndd to modify the configuration parameter 258 tcp_ecn_permitted. If tcp_ecn_permitted is set to 0, then 259 TCP will not negotiate with a peer that supports ECN 260 mechanism. If tcp_ecn_permitted is set to 1 when initiating 261 a connection, TCP will not tell a peer that it supports ECN 262 mechanism. However, it will tell a peer that it supports ECN 263 mechanism when accepting a new incoming connection request if 264 the peer indicates that it supports ECN mechanism in the SYN 265 segment. If tcp_ecn_permitted is set to 2, in addition to 266 negotiating with a peer on ECN mechanism when accepting 267 connections, TCP will indicate in the outgoing SYN segment 268 that it supports ECN mechanism when TCP makes active outgoing 269 connections. The default for tcp_ecn_permitted is 1. 270 271 Turn on the timestamp option in the following way: 272 273 o Use ndd to modify the configuration parameter 274 tcp_tstamp_always. If tcp_tstamp_always is 1, the timestamp 275 option will always be set when connecting to a remote 276 machine. If tcp_tstamp_always is 0, the timestamp option 277 will not be set when connecting to a remote system. The 278 default for tcp_tstamp_always is 0. 279 280 o Regardless of the value of tcp_tstamp_always, the timestamp 281 option will always be included in a connect acknowledgement 282 (and all succeeding packets) if the connecting system has 283 used the timestamp option. 284 285 Use the following procedure to turn on the timestamp option only when the 286 window scale option is in effect: 287 288 o Use ndd to modify the configuration parameter 289 tcp_tstamp_if_wscale. Setting tcp_tstamp_if_wscale to 1 will 290 cause the timestamp option to be set when connecting to a 291 remote system, if the window scale option has been set. If 292 tcp_tstamp_if_wscale is 0, the timestamp option will not be 293 set when connecting to a remote system. The default for 294 tcp_tstamp_if_wscale is 1. 295 296 Protection Against Wrap Around Sequence Numbers (PAWS) is always used 297 when the timestamp option is set. 298 299 The operating system also supports multiple methods of generating initial 300 sequence numbers. One of these methods is the improved technique 301 suggested in RFC 1948. We HIGHLY recommend that you set sequence number 302 generation parameters as close to boot time as possible. This prevents 303 sequence number problems on connections that use the same connection-ID 304 as ones that used a different sequence number generation. The 305 svc:/network/initial:default service configures the initial sequence 306 number generation. The service reads the value contained in the 307 configuration file /etc/default/inetinit to determine which method to 308 use. 309 310 The /etc/default/inetinit file is an unstable interface, and may change 311 in future releases. 312 313 EXAMPLES 314 Example 1: Connecting to a server 315 $ gcc -std=c99 -Wall -lsocket -o client client.c 316 $ cat client.c 317 #include <sys/socket.h> 318 #include <netinet/in.h> 319 #include <netinet/tcp.h> 320 #include <netdb.h> 321 #include <stdio.h> 322 #include <string.h> 323 #include <unistd.h> 324 325 int 326 main(int argc, char *argv[]) 327 { 328 struct addrinfo hints, *gair, *p; 329 int fd, rv, rlen; 330 char buf[1024]; 331 int y = 1; 332 333 if (argc != 3) { 334 fprintf(stderr, "%s <host> <port>\n", argv[0]); 335 return (1); 336 } 337 338 memset(&hints, 0, sizeof (hints)); 339 hints.ai_family = PF_UNSPEC; 340 hints.ai_socktype = SOCK_STREAM; 341 342 if ((rv = getaddrinfo(argv[1], argv[2], &hints, &gair)) != 0) { 343 fprintf(stderr, "getaddrinfo() failed: %s\n", 344 gai_strerror(rv)); 345 return (1); 346 } 347 348 for (p = gair; p != NULL; p = p->ai_next) { 349 if ((fd = socket( 350 p->ai_family, 351 p->ai_socktype, 352 p->ai_protocol)) == -1) { 353 perror("socket() failed"); 354 continue; 355 } 356 357 if (connect(fd, p->ai_addr, p->ai_addrlen) == -1) { 358 close(fd); 359 perror("connect() failed"); 360 continue; 361 } 362 363 break; 364 } 365 366 if (p == NULL) { 367 fprintf(stderr, "failed to connect to server\n"); 368 return (1); 369 } 370 371 freeaddrinfo(gair); 372 373 if (setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, &y, 374 sizeof (y)) == -1) { 375 perror("setsockopt(SO_KEEPALIVE) failed"); 376 return (1); 377 } 378 379 while ((rlen = read(fd, buf, sizeof (buf))) > 0) { 380 fwrite(buf, rlen, 1, stdout); 381 } 382 383 if (rlen == -1) { 384 perror("read() failed"); 385 } 386 387 fflush(stdout); 388 389 if (close(fd) == -1) { 390 perror("close() failed"); 391 } 392 393 return (0); 394 } 395 $ ./client 127.0.0.1 8080 396 hello 397 $ ./client ::1 8080 398 hello 399 400 Example 2: Accepting client connections 401 $ gcc -std=c99 -Wall -lsocket -o server server.c 402 $ cat server.c 403 #include <sys/socket.h> 404 #include <netinet/in.h> 405 #include <netinet/tcp.h> 406 #include <netdb.h> 407 #include <stdio.h> 408 #include <string.h> 409 #include <unistd.h> 410 #include <arpa/inet.h> 411 412 void 413 logmsg(struct sockaddr *s, int bytes) 414 { 415 char dq[INET6_ADDRSTRLEN]; 416 417 switch (s->sa_family) { 418 case AF_INET: { 419 struct sockaddr_in *s4 = (struct sockaddr_in *)s; 420 inet_ntop(AF_INET, &s4->sin_addr, dq, sizeof (dq)); 421 fprintf(stdout, "sent %d bytes to %s:%d\n", 422 bytes, dq, ntohs(s4->sin_port)); 423 break; 424 } 425 case AF_INET6: { 426 struct sockaddr_in6 *s6 = (struct sockaddr_in6 *)s; 427 inet_ntop(AF_INET6, &s6->sin6_addr, dq, sizeof (dq)); 428 fprintf(stdout, "sent %d bytes to [%s]:%d\n", 429 bytes, dq, ntohs(s6->sin6_port)); 430 break; 431 } 432 default: 433 fprintf(stdout, "sent %d bytes to unknown client\n", 434 bytes); 435 break; 436 } 437 } 438 439 int 440 main(int argc, char *argv[]) 441 { 442 struct addrinfo hints, *gair, *p; 443 int sfd, cfd; 444 int slen, wlen, rv; 445 446 if (argc != 3) { 447 fprintf(stderr, "%s <port> <message>\n", argv[0]); 448 return (1); 449 } 450 451 slen = strlen(argv[2]); 452 453 memset(&hints, 0, sizeof (hints)); 454 hints.ai_family = PF_UNSPEC; 455 hints.ai_socktype = SOCK_STREAM; 456 hints.ai_flags = AI_PASSIVE; 457 458 if ((rv = getaddrinfo(NULL, argv[1], &hints, &gair)) != 0) { 459 fprintf(stderr, "getaddrinfo() failed: %s\n", 460 gai_strerror(rv)); 461 return (1); 462 } 463 464 for (p = gair; p != NULL; p = p->ai_next) { 465 if ((sfd = socket( 466 p->ai_family, 467 p->ai_socktype, 468 p->ai_protocol)) == -1) { 469 perror("socket() failed"); 470 continue; 471 } 472 473 if (bind(sfd, p->ai_addr, p->ai_addrlen) == -1) { 474 close(sfd); 475 perror("bind() failed"); 476 continue; 477 } 478 479 break; 480 } 481 482 if (p == NULL) { 483 fprintf(stderr, "server failed to bind()\n"); 484 return (1); 485 } 486 487 freeaddrinfo(gair); 488 489 if (listen(sfd, 1024) != 0) { 490 perror("listen() failed"); 491 return (1); 492 } 493 494 fprintf(stdout, "waiting for clients...\n"); 495 496 for (int times = 0; times < 5; times++) { 497 struct sockaddr_storage stor; 498 socklen_t alen = sizeof (stor); 499 struct sockaddr *addr = (struct sockaddr *)&stor; 500 501 if ((cfd = accept(sfd, addr, &alen)) == -1) { 502 perror("accept() failed"); 503 continue; 504 } 505 506 wlen = 0; 507 508 do { 509 wlen += write(cfd, argv[2] + wlen, slen - wlen); 510 } while (wlen < slen); 511 512 logmsg(addr, wlen); 513 514 if (close(cfd) == -1) { 515 perror("close(cfd) failed"); 516 } 517 } 518 519 if (close(sfd) == -1) { 520 perror("close(sfd) failed"); 521 } 522 523 fprintf(stdout, "finished.\n"); 524 525 return (0); 526 } 527 $ ./server 8080 $'hello\n' 528 waiting for clients... 529 sent 6 bytes to [::ffff:127.0.0.1]:59059 530 sent 6 bytes to [::ffff:127.0.0.1]:47448 531 sent 6 bytes to [::ffff:127.0.0.1]:54949 532 sent 6 bytes to [::ffff:127.0.0.1]:55186 533 sent 6 bytes to [::1]:62256 534 finished. 535 536 DIAGNOSTICS 537 A socket operation may fail if: 538 539 EISCONN A connect() operation was attempted on a socket 540 on which a connect() operation had already been 541 performed. 542 543 ETIMEDOUT A connection was dropped due to excessive 544 retransmissions. 545 546 ECONNRESET The remote peer forced the connection to be 547 closed (usually because the remote machine has 548 lost state information about the connection due 549 to a crash). 550 551 ECONNREFUSED The remote peer actively refused connection 552 establishment (usually because no process is 553 listening to the port). 554 555 EADDRINUSE A bind() operation was attempted on a socket with 556 a network address/port pair that has already been 557 bound to another socket. 558 559 EADDRNOTAVAIL A bind() operation was attempted on a socket with 560 a network address for which no network interface 561 exists. 562 563 EACCES A bind() operation was attempted with a 564 "reserved" port number and the effective user ID 565 of the process was not the privileged user. 566 567 ENOBUFS The system ran out of memory for internal data 568 structures. 569 570 SEE ALSO 571 svcs(1), ndd(1M), svcadm(1M), ioctl(2), read(2), write(2), 572 accept(3SOCKET), bind(3SOCKET), connect(3SOCKET), 573 getprotobyname(3SOCKET), getsockopt(3SOCKET), listen(3SOCKET), 574 send(3SOCKET), smf(5), inet(7P), inet6(7P), ip(7P), ip6(7P) 575 576 K. Ramakrishnan, S. Floyd, and D. Black, The Addition of Explicit 577 Congestion Notification (ECN) to IP, RFC 3168, September 2001. 578 579 M. Mathias, J. Mahdavi, S. Ford, and A. Romanow, TCP Selective 580 Acknowledgement Options, RFC 2018, October 1996. 581 582 S. Bellovin, Defending Against Sequence Number Attacks, RFC 1948, May 583 1996. 584 585 D. Borman, B. Braden, V. Jacobson, and R. Scheffenegger, Ed., TCP 586 Extensions for High Performance, RFC 7323, September 2014. 587 588 Jon Postel, Transmission Control Protocol - DARPA Internet Program 589 Protocol Specification, RFC 793, Network Information Center, SRI 590 International, Menlo Park, CA., September 1981. 591 592 NOTES 593 The tcp service is managed by the service management facility, smf(5), 594 under the service identifier svc:/network/initial:default. 595 596 Administrative actions on this service, such as enabling, disabling, or 597 requesting restart, can be performed using svcadm(1M). The service's 598 status can be queried using the svcs(1) command. 599 600 illumos January 7, 2019 illumos