1 TCP(7P) Protocols TCP(7P)
2
3 NAME
4 tcp, TCP - Internet Transmission Control Protocol
5
6 SYNOPSIS
7 #include <sys/socket.h>
8 #include <netinet/in.h>
9 #include <netinet/tcp.h>
10
11 s = socket(AF_INET, SOCK_STREAM, 0);
12 s = socket(AF_INET6, SOCK_STREAM, 0);
13 t = t_open("/dev/tcp", O_RDWR);
14 t = t_open("/dev/tcp6", O_RDWR);
15
16 DESCRIPTION
17 TCP is the virtual circuit protocol of the Internet protocol family. It
18 provides reliable, flow-controlled, in-order, two-way transmission of
19 data. It is a byte-stream protocol layered above the Internet Protocol
20 (IP), or the Internet Protocol Version 6 (IPv6), the Internet protocol
21 family's internetwork datagram delivery protocol.
22
23 Programs can access TCP using the socket interface as a SOCK_STREAM
24 socket type, or using the Transport Level Interface (TLI) where it
25 supports the connection-oriented (BT_COTS_ORD) service type.
26
27 A checksum over all data helps TCP provide reliable communication. Using
28 a window-based flow control mechanism that makes use of positive
29 acknowledgements, sequence numbers, and a retransmission strategy, TCP
30 can usually recover when datagrams are damaged, delayed, duplicated or
31 delivered out of order by the underlying medium.
32
33 TCP provides several socket options, defined in <netinet/tcp.h> and
34 described throughout this document, which may be set using
35 setsockopt(3SOCKET) and read using getsockopt(3SOCKET). The level
36 argument for these calls is the protocol number for TCP, available from
37 getprotobyname(3SOCKET). IP level options may also be used with TCP.
38 See ip(7P) and ip6(7P).
39
40 Listening And Connecting
41 TCP uses IP's host-level addressing and adds its own per-host collection
42 of "port addresses". The endpoints of a TCP connection are identified by
43 the combination of an IPv4 or IPv6 address and a TCP port number.
44 Although other protocols, such as the User Datagram Protocol (UDP), may
45 use the same host and port address format, the port space of these
46 protocols is distinct. See inet(7P) and inet6(7P) for details on the
47 common aspects of addressing in the Internet protocol family.
48
49 Sockets utilizing TCP are either "active" or "passive". Active sockets
50 initiate connections to passive sockets. Passive sockets must have their
51 local IPv4 or IPv6 address and TCP port number bound with the
52 bind(3SOCKET) system call after the socket is created. If an active
53 socket has not been bound by the time connect(3SOCKET) is called, then
54 the operating system will choose a local address and port for the
55 application. By default, TCP sockets are active. A passive socket is
56 created by calling the listen(3SOCKET) system call after binding, which
57 establishes a queueing parameter for the passive socket. Connections to
58 the passive socket can then be received using the accept(3SOCKET) system
59 call. Active sockets use the connect(3SOCKET) call after binding to
60 initiate connections.
61
62 If incoming connection requests include an IP source route option, then
63 the reverse source route will be used when responding.
64
65 By using the special value INADDR_ANY with IPv4, or the unspecified
66 address (all zeroes) with IPv6, the local IP address can be left
67 unspecified in the bind() call by either active or passive TCP sockets.
68 This feature is usually used if the local address is either unknown or
69 irrelevant. If left unspecified, the local IP address will be bound at
70 connection time to the address of the network interface used to service
71 the connection. For passive sockets, this is the destination address
72 used by the connecting peer. For active sockets, this is usually an
73 address on the same subnet as the destination or default gateway address,
74 although the rules can be more complex. See Source Address Selection in
75 inet6(7P) for a detailed discussion of how this works in IPv6.
76
77 Note that no two TCP sockets can be bound to the same port unless the
78 bound IP addresses are different. IPv4 INADDR_ANY and IPv6 unspecified
79 addresses compare as equal to any IPv4 or IPv6 address. For example, if
80 a socket is bound to INADDR_ANY or the unspecified address and port N, no
81 other socket can bind to port N, regardless of the binding address. This
82 special consideration of INADDR_ANY and the unspecified address can be
83 changed using the socket option SO_REUSEADDR. If SO_REUSEADDR is set on
84 a socket doing a bind, IPv4 INADDR_ANY and the IPv6 unspecified address
85 do not compare as equal to any IP address. This means that as long as
86 the two sockets are not both bound to INADDR_ANY, the unspecified
87 address, or the same IP address, then the two sockets can be bound to the
88 same port.
89
90 If an application does not want to allow another socket using the
91 SO_REUSEADDR option to bind to a port its socket is bound to, the
92 application can set the socket-level (SOL_SOCKET) option SO_EXCLBIND on a
93 socket. The option values of 0 and 1 mean enabling and disabling the
94 option respectively. Once this option is enabled on a socket, no other
95 socket can be bound to the same port.
96
97 Sending And Receiving Data
98 Once a connection has been established, data can be exchanged using the
99 read(2) and write(2) system calls. If, after sending data, the local TCP
100 receives no acknowledgements from its peer for a period of time (for
101 example, if the remote machine crashes), the connection is closed and an
102 error is returned.
103
104 When a peer is sending data, it will only send up to the advertised
105 "receive window", which is determined by how much more data the recipient
106 can fit in its buffer. Applications can use the socket-level option
107 SO_RCVBUF to increase or decrease the receive buffer size. Similarly,
108 the socket-level option SO_SNDBUF can be used to allow TCP to buffer more
109 unacknowledged and unsent data locally.
110
111 Under most circumstances, TCP will send data when it is written by the
112 application. When outstanding data has not yet been acknowledged,
113 though, TCP will gather small amounts of output to be sent as a single
114 packet once an acknowledgement has been received. Usually referred to as
115 Nagle's Algorithm (RFC 896), this behavior helps prevent flooding the
116 network with many small packets.
117
118 However, for some highly interactive clients (such as remote shells or
119 windowing systems that send a stream of keypresses or mouse events), this
120 batching may cause significant delays. To disable this behavior, TCP
121 provides a boolean socket option, TCP_NODELAY.
122
123 Conversely, for other applications, it may be desirable for TCP not to
124 send out any data until a full TCP segment can be sent. To enable this
125 behavior, an application can use the TCP-level socket option TCP_CORK.
126 When set to a non-zero value, TCP will only send out a full TCP segment.
127 When TCP_CORK is set to zero after it has been enabled, all currently
128 buffered data is sent out (as permitted by the peer's receive window and
129 the current congestion window).
130
131 TCP provides an urgent data mechanism, which may be invoked using the
132 out-of-band provisions of send(3SOCKET). The caller may mark one byte as
133 "urgent" with the MSG_OOB flag to send(3SOCKET). This sets an "urgent
134 pointer" pointing to this byte in the TCP stream. The receiver on the
135 other side of the stream is notified of the urgent data by a SIGURG
136 signal. The SIOCATMARK ioctl(2) request returns a value indicating
137 whether the stream is at the urgent mark. Because the system never
138 returns data across the urgent mark in a single read(2) call, it is
139 possible to advance to the urgent data in a simple loop which reads data,
140 testing the socket with the SIOCATMARK ioctl() request, until it reaches
141 the mark.
142
143 Congestion Control
144 TCP follows the congestion control algorithm described in RFC 2581, and
145 also supports the initial congestion window (cwnd) changes in RFC 3390.
146 The initial cwnd calculation can be overridden by the socket option
147 TCP_INIT_CWND. An application can use this option to set the initial
148 cwnd to a specified number of TCP segments. This applies to the cases
149 when the connection first starts and restarts after an idle period. The
150 process must have the PRIV_SYS_NET_CONFIG privilege if it wants to
151 specify a number greater than that calculated by RFC 3390.
152
153 The operating system also provides alternative algorithms that may be
154 more appropriate for your application, including the CUBIC congestion
155 control algorithm described in RFC 8312. These can be configured system-
156 wide using ipadm(1M), or on a per-connection basis with the TCP-level
157 socket option TCP_CONGESTION, whose argument is the name of the algorithm
158 to use (for example "cubic"). If the requested algorithm does not exist,
159 then setsockopt() will fail, and errno will be set to ENOENT.
160
161 TCP Keep-Alive
162 Since TCP determines whether a remote peer is no longer reachable by
163 timing out waiting for acknowledgements, a host that never sends any new
164 data may never notice a peer that has gone away. While consumers can
165 avoid this problem by sending their own periodic heartbeat messages
166 (Transport Layer Security does this, for example), TCP describes an
167 optional keep-alive mechanism in RFC 1122. Applications can enable it
168 using the socket-level option SO_KEEPALIVE. When enabled, the first
169 keep-alive probe is sent out after a TCP connection is idle for two
170 hours. If the peer does not respond to the probe within eight minutes,
171 the TCP connection is aborted. An application can alter the probe
172 behavior using the following TCP-level socket options:
173
174 TCP_KEEPALIVE_THRESHOLD
175 Determines the interval for sending the first
176 probe. The option value is specified as an
177 unsigned integer in milliseconds. The system
178 default is controlled by the TCP ndd parameter
179 tcp_keepalive_interval. The minimum value is ten
180 seconds. The maximum is ten days, while the
181 default is two hours.
182
183 TCP_KEEPALIVE_ABORT_THRESHOLD
184 If TCP does not receive a response to the probe,
185 then this option determines how long to wait
186 before aborting a TCP connection. The option
187 value is an unsigned integer in milliseconds.
188 The value zero indicates that TCP should never
189 time out and abort the connection when probing.
190 The system default is controlled by the TCP ndd
191 parameter tcp_keepalive_abort_interval. The
192 default is eight minutes.
193
194 TCP_KEEPIDLE This option, like TCP_KEEPALIVE_THRESHOLD,
195 determines the interval for sending the first
196 probe, except that the option value is an
197 unsigned integer in seconds. It is provided
198 primarily for compatibility with other Unix
199 flavors.
200
201 TCP_KEEPCNT This option specifies the number of keep-alive
202 probes that should be sent without any response
203 from the peer before aborting the connection.
204
205 TCP_KEEPINTVL This option specifies the interval in seconds
206 between successive, unacknowledged keep-alive
207 probes.
208
209 Additional Configuration
210 illumos supports TCP Extensions for High Performance (RFC 7323) which
211 includes the window scale and timestamp options, and Protection Against
212 Wrap Around Sequence Numbers (PAWS). Note that if timestamps are
213 negotiated on a connection, received segments without timestamps on that
214 connection are silently dropped per the suggestion in the RFC. illumos
215 also supports Selective Acknowledgment (SACK) capabilities (RFC 2018) and
216 Explicit Congestion Notification (ECN) mechanism (RFC 3168).
217
218 Turn on the window scale option in one of the following ways:
219
220 o An application can set SO_SNDBUF or SO_RCVBUF size in the
221 setsockopt() option to be larger than 64K. This must be done
222 before the program calls listen() or connect(), because the
223 window scale option is negotiated when the connection is
224 established. Once the connection has been made, it is too
225 late to increase the send or receive window beyond the
226 default TCP limit of 64K.
227
228 o For all applications, use ndd(1M) to modify the configuration
229 parameter tcp_wscale_always. If tcp_wscale_always is set to
230 1, the window scale option will always be set when connecting
231 to a remote system. If tcp_wscale_always is 0, the window
232 scale option will be set only if the user has requested a
233 send or receive window larger than 64K. The default value of
234 tcp_wscale_always is 1.
235
236 o Regardless of the value of tcp_wscale_always, the window
237 scale option will always be included in a connect
238 acknowledgement if the connecting system has used the option.
239
240 Turn on SACK capabilities in the following way:
241
242 o Use ndd to modify the configuration parameter
243 tcp_sack_permitted. If tcp_sack_permitted is set to 0, TCP
244 will not accept SACK or send out SACK information. If
245 tcp_sack_permitted is set to 1, TCP will not initiate a
246 connection with SACK permitted option in the SYN segment, but
247 will respond with SACK permitted option in the SYN|ACK
248 segment if an incoming connection request has the SACK
249 permitted option. This means that TCP will only accept SACK
250 information if the other side of the connection also accepts
251 SACK information. If tcp_sack_permitted is set to 2, it will
252 both initiate and accept connections with SACK information.
253 The default for tcp_sack_permitted is 2 (active enabled).
254
255 Turn on the TCP ECN mechanism in the following way:
256
257 o Use ndd to modify the configuration parameter
258 tcp_ecn_permitted. If tcp_ecn_permitted is set to 0, then
259 TCP will not negotiate with a peer that supports ECN
260 mechanism. If tcp_ecn_permitted is set to 1 when initiating
261 a connection, TCP will not tell a peer that it supports ECN
262 mechanism. However, it will tell a peer that it supports ECN
263 mechanism when accepting a new incoming connection request if
264 the peer indicates that it supports ECN mechanism in the SYN
265 segment. If tcp_ecn_permitted is set to 2, in addition to
266 negotiating with a peer on ECN mechanism when accepting
267 connections, TCP will indicate in the outgoing SYN segment
268 that it supports ECN mechanism when TCP makes active outgoing
269 connections. The default for tcp_ecn_permitted is 1.
270
271 Turn on the timestamp option in the following way:
272
273 o Use ndd to modify the configuration parameter
274 tcp_tstamp_always. If tcp_tstamp_always is 1, the timestamp
275 option will always be set when connecting to a remote
276 machine. If tcp_tstamp_always is 0, the timestamp option
277 will not be set when connecting to a remote system. The
278 default for tcp_tstamp_always is 0.
279
280 o Regardless of the value of tcp_tstamp_always, the timestamp
281 option will always be included in a connect acknowledgement
282 (and all succeeding packets) if the connecting system has
283 used the timestamp option.
284
285 Use the following procedure to turn on the timestamp option only when the
286 window scale option is in effect:
287
288 o Use ndd to modify the configuration parameter
289 tcp_tstamp_if_wscale. Setting tcp_tstamp_if_wscale to 1 will
290 cause the timestamp option to be set when connecting to a
291 remote system, if the window scale option has been set. If
292 tcp_tstamp_if_wscale is 0, the timestamp option will not be
293 set when connecting to a remote system. The default for
294 tcp_tstamp_if_wscale is 1.
295
296 Protection Against Wrap Around Sequence Numbers (PAWS) is always used
297 when the timestamp option is set.
298
299 The operating system also supports multiple methods of generating initial
300 sequence numbers. One of these methods is the improved technique
301 suggested in RFC 1948. We HIGHLY recommend that you set sequence number
302 generation parameters as close to boot time as possible. This prevents
303 sequence number problems on connections that use the same connection-ID
304 as ones that used a different sequence number generation. The
305 svc:/network/initial:default service configures the initial sequence
306 number generation. The service reads the value contained in the
307 configuration file /etc/default/inetinit to determine which method to
308 use.
309
310 The /etc/default/inetinit file is an unstable interface, and may change
311 in future releases.
312
313 EXAMPLES
314 Example 1: Connecting to a server
315 $ gcc -std=c99 -Wall -lsocket -o client client.c
316 $ cat client.c
317 #include <sys/socket.h>
318 #include <netinet/in.h>
319 #include <netinet/tcp.h>
320 #include <netdb.h>
321 #include <stdio.h>
322 #include <string.h>
323 #include <unistd.h>
324
325 int
326 main(int argc, char *argv[])
327 {
328 struct addrinfo hints, *gair, *p;
329 int fd, rv, rlen;
330 char buf[1024];
331 int y = 1;
332
333 if (argc != 3) {
334 fprintf(stderr, "%s <host> <port>\n", argv[0]);
335 return (1);
336 }
337
338 memset(&hints, 0, sizeof (hints));
339 hints.ai_family = PF_UNSPEC;
340 hints.ai_socktype = SOCK_STREAM;
341
342 if ((rv = getaddrinfo(argv[1], argv[2], &hints, &gair)) != 0) {
343 fprintf(stderr, "getaddrinfo() failed: %s\n",
344 gai_strerror(rv));
345 return (1);
346 }
347
348 for (p = gair; p != NULL; p = p->ai_next) {
349 if ((fd = socket(
350 p->ai_family,
351 p->ai_socktype,
352 p->ai_protocol)) == -1) {
353 perror("socket() failed");
354 continue;
355 }
356
357 if (connect(fd, p->ai_addr, p->ai_addrlen) == -1) {
358 close(fd);
359 perror("connect() failed");
360 continue;
361 }
362
363 break;
364 }
365
366 if (p == NULL) {
367 fprintf(stderr, "failed to connect to server\n");
368 return (1);
369 }
370
371 freeaddrinfo(gair);
372
373 if (setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, &y,
374 sizeof (y)) == -1) {
375 perror("setsockopt(SO_KEEPALIVE) failed");
376 return (1);
377 }
378
379 while ((rlen = read(fd, buf, sizeof (buf))) > 0) {
380 fwrite(buf, rlen, 1, stdout);
381 }
382
383 if (rlen == -1) {
384 perror("read() failed");
385 }
386
387 fflush(stdout);
388
389 if (close(fd) == -1) {
390 perror("close() failed");
391 }
392
393 return (0);
394 }
395 $ ./client 127.0.0.1 8080
396 hello
397 $ ./client ::1 8080
398 hello
399
400 Example 2: Accepting client connections
401 $ gcc -std=c99 -Wall -lsocket -o server server.c
402 $ cat server.c
403 #include <sys/socket.h>
404 #include <netinet/in.h>
405 #include <netinet/tcp.h>
406 #include <netdb.h>
407 #include <stdio.h>
408 #include <string.h>
409 #include <unistd.h>
410 #include <arpa/inet.h>
411
412 void
413 logmsg(struct sockaddr *s, int bytes)
414 {
415 char dq[INET6_ADDRSTRLEN];
416
417 switch (s->sa_family) {
418 case AF_INET: {
419 struct sockaddr_in *s4 = (struct sockaddr_in *)s;
420 inet_ntop(AF_INET, &s4->sin_addr, dq, sizeof (dq));
421 fprintf(stdout, "sent %d bytes to %s:%d\n",
422 bytes, dq, ntohs(s4->sin_port));
423 break;
424 }
425 case AF_INET6: {
426 struct sockaddr_in6 *s6 = (struct sockaddr_in6 *)s;
427 inet_ntop(AF_INET6, &s6->sin6_addr, dq, sizeof (dq));
428 fprintf(stdout, "sent %d bytes to [%s]:%d\n",
429 bytes, dq, ntohs(s6->sin6_port));
430 break;
431 }
432 default:
433 fprintf(stdout, "sent %d bytes to unknown client\n",
434 bytes);
435 break;
436 }
437 }
438
439 int
440 main(int argc, char *argv[])
441 {
442 struct addrinfo hints, *gair, *p;
443 int sfd, cfd;
444 int slen, wlen, rv;
445
446 if (argc != 3) {
447 fprintf(stderr, "%s <port> <message>\n", argv[0]);
448 return (1);
449 }
450
451 slen = strlen(argv[2]);
452
453 memset(&hints, 0, sizeof (hints));
454 hints.ai_family = PF_UNSPEC;
455 hints.ai_socktype = SOCK_STREAM;
456 hints.ai_flags = AI_PASSIVE;
457
458 if ((rv = getaddrinfo(NULL, argv[1], &hints, &gair)) != 0) {
459 fprintf(stderr, "getaddrinfo() failed: %s\n",
460 gai_strerror(rv));
461 return (1);
462 }
463
464 for (p = gair; p != NULL; p = p->ai_next) {
465 if ((sfd = socket(
466 p->ai_family,
467 p->ai_socktype,
468 p->ai_protocol)) == -1) {
469 perror("socket() failed");
470 continue;
471 }
472
473 if (bind(sfd, p->ai_addr, p->ai_addrlen) == -1) {
474 close(sfd);
475 perror("bind() failed");
476 continue;
477 }
478
479 break;
480 }
481
482 if (p == NULL) {
483 fprintf(stderr, "server failed to bind()\n");
484 return (1);
485 }
486
487 freeaddrinfo(gair);
488
489 if (listen(sfd, 1024) != 0) {
490 perror("listen() failed");
491 return (1);
492 }
493
494 fprintf(stdout, "waiting for clients...\n");
495
496 for (int times = 0; times < 5; times++) {
497 struct sockaddr_storage stor;
498 socklen_t alen = sizeof (stor);
499 struct sockaddr *addr = (struct sockaddr *)&stor;
500
501 if ((cfd = accept(sfd, addr, &alen)) == -1) {
502 perror("accept() failed");
503 continue;
504 }
505
506 wlen = 0;
507
508 do {
509 wlen += write(cfd, argv[2] + wlen, slen - wlen);
510 } while (wlen < slen);
511
512 logmsg(addr, wlen);
513
514 if (close(cfd) == -1) {
515 perror("close(cfd) failed");
516 }
517 }
518
519 if (close(sfd) == -1) {
520 perror("close(sfd) failed");
521 }
522
523 fprintf(stdout, "finished.\n");
524
525 return (0);
526 }
527 $ ./server 8080 $'hello\n'
528 waiting for clients...
529 sent 6 bytes to [::ffff:127.0.0.1]:59059
530 sent 6 bytes to [::ffff:127.0.0.1]:47448
531 sent 6 bytes to [::ffff:127.0.0.1]:54949
532 sent 6 bytes to [::ffff:127.0.0.1]:55186
533 sent 6 bytes to [::1]:62256
534 finished.
535
536 DIAGNOSTICS
537 A socket operation may fail if:
538
539 EISCONN A connect() operation was attempted on a socket
540 on which a connect() operation had already been
541 performed.
542
543 ETIMEDOUT A connection was dropped due to excessive
544 retransmissions.
545
546 ECONNRESET The remote peer forced the connection to be
547 closed (usually because the remote machine has
548 lost state information about the connection due
549 to a crash).
550
551 ECONNREFUSED The remote peer actively refused connection
552 establishment (usually because no process is
553 listening to the port).
554
555 EADDRINUSE A bind() operation was attempted on a socket with
556 a network address/port pair that has already been
557 bound to another socket.
558
559 EADDRNOTAVAIL A bind() operation was attempted on a socket with
560 a network address for which no network interface
561 exists.
562
563 EACCES A bind() operation was attempted with a
564 "reserved" port number and the effective user ID
565 of the process was not the privileged user.
566
567 ENOBUFS The system ran out of memory for internal data
568 structures.
569
570 SEE ALSO
571 svcs(1), ndd(1M), svcadm(1M), ioctl(2), read(2), write(2),
572 accept(3SOCKET), bind(3SOCKET), connect(3SOCKET),
573 getprotobyname(3SOCKET), getsockopt(3SOCKET), listen(3SOCKET),
574 send(3SOCKET), smf(5), inet(7P), inet6(7P), ip(7P), ip6(7P)
575
576 K. Ramakrishnan, S. Floyd, and D. Black, The Addition of Explicit
577 Congestion Notification (ECN) to IP, RFC 3168, September 2001.
578
579 M. Mathias, J. Mahdavi, S. Ford, and A. Romanow, TCP Selective
580 Acknowledgement Options, RFC 2018, October 1996.
581
582 S. Bellovin, Defending Against Sequence Number Attacks, RFC 1948, May
583 1996.
584
585 D. Borman, B. Braden, V. Jacobson, and R. Scheffenegger, Ed., TCP
586 Extensions for High Performance, RFC 7323, September 2014.
587
588 Jon Postel, Transmission Control Protocol - DARPA Internet Program
589 Protocol Specification, RFC 793, Network Information Center, SRI
590 International, Menlo Park, CA., September 1981.
591
592 NOTES
593 The tcp service is managed by the service management facility, smf(5),
594 under the service identifier svc:/network/initial:default.
595
596 Administrative actions on this service, such as enabling, disabling, or
597 requesting restart, can be performed using svcadm(1M). The service's
598 status can be queried using the svcs(1) command.
599
600 illumos January 7, 2019 illumos