tcp
,
TCP
—
Internet Transmission Control Protocol
#include
<sys/socket.h>
#include
<netinet/in.h>
#include
<netinet/tcp.h>
s = socket(AF_INET, SOCK_STREAM, 0);
s = socket(AF_INET6, SOCK_STREAM, 0);
t = t_open("/dev/tcp", O_RDWR);
t = t_open("/dev/tcp6", O_RDWR);
TCP is the virtual circuit protocol of the Internet protocol family. It provides
reliable, flow-controlled, in-order, two-way transmission of data. It is a
byte-stream protocol layered above the Internet Protocol
(
IP), or the Internet Protocol Version 6
(
IPv6), the Internet protocol family's
internetwork datagram delivery protocol.
Programs can access TCP using the socket interface as a
SOCK_STREAM
socket type, or using the
Transport Level Interface (
TLI) where it supports
the connection-oriented (
BT_COTS_ORD
)
service type.
A checksum over all data helps TCP provide reliable communication. Using a
window-based flow control mechanism that makes use of positive
acknowledgements, sequence numbers, and a retransmission strategy, TCP can
usually recover when datagrams are damaged, delayed, duplicated or delivered
out of order by the underlying medium.
TCP provides several socket options, defined in
<netinet/tcp.h>
and described throughout this document, which may be set using
setsockopt(3SOCKET) and read using
getsockopt(3SOCKET). The
level argument for these calls is the
protocol number for TCP, available from
getprotobyname(3SOCKET). IP level options may
also be used with TCP. See
ip(7P) and
ip6(7P).
TCP uses IP's host-level addressing and adds its own per-host collection of
“port addresses”. The endpoints of a TCP connection are
identified by the combination of an IPv4 or IPv6 address and a TCP port
number. Although other protocols, such as the User Datagram Protocol
(
UDP), may use the same host and port address
format, the port space of these protocols is distinct. See
inet(7P) and
inet6(7P) for details on the common aspects of
addressing in the Internet protocol family.
Sockets utilizing TCP are either “active” or
“passive”. Active sockets initiate connections to passive
sockets. Passive sockets must have their local IPv4 or IPv6 address and TCP
port number bound with the
bind(3SOCKET) system
call after the socket is created. If an active socket has not been bound by
the time
connect(3SOCKET) is called, then the
operating system will choose a local address and port for the application. By
default, TCP sockets are active. A passive socket is created by calling the
listen(3SOCKET) system call after binding, which
establishes a queueing parameter for the passive socket. Connections to the
passive socket can then be received using the
accept(3SOCKET) system call. Active sockets use
the
connect(3SOCKET) call after binding to
initiate connections.
If incoming connection requests include an IP source route option, then the
reverse source route will be used when responding.
By using the special value
INADDR_ANY
with
IPv4, or the unspecified address (all zeroes) with IPv6, the local IP address
can be left unspecified in the
bind
() call
by either active or passive TCP sockets. This feature is usually used if the
local address is either unknown or irrelevant. If left unspecified, the local
IP address will be bound at connection time to the address of the network
interface used to service the connection. For passive sockets, this is the
destination address used by the connecting peer. For active sockets, this is
usually an address on the same subnet as the destination or default gateway
address, although the rules can be more complex. See
Source Address Selection in
inet6(7P) for a detailed discussion of how this
works in IPv6.
Note that no two TCP sockets can be bound to the same port unless the bound IP
addresses are different. IPv4
INADDR_ANY
and IPv6 unspecified addresses compare as equal to any IPv4 or IPv6 address.
For example, if a socket is bound to
INADDR_ANY
or the unspecified address and
port
N, no other socket can bind to port
N, regardless of the binding address. This
special consideration of
INADDR_ANY
and the
unspecified address can be changed using the socket option
SO_REUSEADDR
. If
SO_REUSEADDR
is set on a socket doing a
bind, IPv4
INADDR_ANY
and the IPv6
unspecified address do not compare as equal to any IP address. This means that
as long as the two sockets are not both bound to
INADDR_ANY
, the unspecified address, or the
same IP address, then the two sockets can be bound to the same port.
If an application does not want to allow another socket using the
SO_REUSEADDR
option to bind to a port its
socket is bound to, the application can set the socket-level
(
SOL_SOCKET
) option
SO_EXCLBIND
on a socket. The option values
of 0 and 1 mean enabling and disabling the option respectively. Once this
option is enabled on a socket, no other socket can be bound to the same port.
Once a connection has been established, data can be exchanged using the
read(2) and
write(2)
system calls. If, after sending data, the local TCP receives no
acknowledgements from its peer for a period of time (for example, if the
remote machine crashes), the connection is closed and an error is returned.
When a peer is sending data, it will only send up to the advertised
“receive window”, which is determined by how much more data the
recipient can fit in its buffer. Applications can use the socket-level option
SO_RCVBUF
to increase or decrease the
receive buffer size. Similarly, the socket-level option
SO_SNDBUF
can be used to allow TCP to
buffer more unacknowledged and unsent data locally.
Under most circumstances, TCP will send data when it is written by the
application. When outstanding data has not yet been acknowledged, though, TCP
will gather small amounts of output to be sent as a single packet once an
acknowledgement has been received. Usually referred to as Nagle's Algorithm
(RFC 896), this behavior helps prevent flooding the network with many small
packets.
However, for some highly interactive clients (such as remote shells or windowing
systems that send a stream of keypresses or mouse events), this batching may
cause significant delays. To disable this behavior, TCP provides a boolean
socket option,
TCP_NODELAY
.
Conversely, for other applications, it may be desirable for TCP not to send out
any data until a full TCP segment can be sent. To enable this behavior, an
application can use the TCP-level socket option
TCP_CORK
. When set to a non-zero value, TCP
will only send out a full TCP segment. When
TCP_CORK
is set to zero after it has been
enabled, all currently buffered data is sent out (as permitted by the peer's
receive window and the current congestion window).
TCP provides an urgent data mechanism, which may be invoked using the
out-of-band provisions of
send(3SOCKET). The
caller may mark one byte as “urgent” with the
MSG_OOB
flag to
send(3SOCKET). This sets an “urgent
pointer” pointing to this byte in the TCP stream. The receiver on the
other side of the stream is notified of the urgent data by a
SIGURG
signal. The
SIOCATMARK
ioctl(2) request returns a value indicating
whether the stream is at the urgent mark. Because the system never returns
data across the urgent mark in a single
read(2)
call, it is possible to advance to the urgent data in a simple loop which
reads data, testing the socket with the
SIOCATMARK
ioctl
() request, until it reaches the mark.
TCP follows the congestion control algorithm described in RFC 2581, and also
supports the initial congestion window (cwnd) changes in RFC 3390. The initial
cwnd calculation can be overridden by the socket option
TCP_INIT_CWND
. An application can use this
option to set the initial cwnd to a specified number of TCP segments. This
applies to the cases when the connection first starts and restarts after an
idle period. The process must have the
PRIV_SYS_NET_CONFIG
privilege if it wants
to specify a number greater than that calculated by RFC 3390.
The operating system also provides alternative algorithms that may be more
appropriate for your application, including the CUBIC congestion control
algorithm described in RFC 8312. These can be configured system-wide using
ipadm(1M), or on a per-connection basis with the
TCP-level socket option
TCP_CONGESTION
,
whose argument is the name of the algorithm to use (for example
“cubic”). If the requested algorithm does not exist, then
setsockopt
() will fail, and
errno will be set to
ENOENT
.
Since TCP determines whether a remote peer is no longer reachable by timing out
waiting for acknowledgements, a host that never sends any new data may never
notice a peer that has gone away. While consumers can avoid this problem by
sending their own periodic heartbeat messages (Transport Layer Security does
this, for example), TCP describes an optional keep-alive mechanism in RFC
1122. Applications can enable it using the socket-level option
SO_KEEPALIVE
. When enabled, the first
keep-alive probe is sent out after a TCP connection is idle for two hours. If
the peer does not respond to the probe within eight minutes, the TCP
connection is aborted. An application can alter the probe behavior using the
following TCP-level socket options:
-
-
TCP_KEEPALIVE_THRESHOLD
- Determines the interval for sending the first probe. The option value is
specified as an unsigned integer in milliseconds. The system default is
controlled by the TCP
ndd
parameter
tcp_keepalive_interval
. The minimum
value is ten seconds. The maximum is ten days, while the default is two
hours.
-
-
TCP_KEEPALIVE_ABORT_THRESHOLD
- If TCP does not receive a response to the probe, then this option
determines how long to wait before aborting a TCP connection. The option
value is an unsigned integer in milliseconds. The value zero indicates
that TCP should never time out and abort the connection when probing. The
system default is controlled by the TCP
ndd
parameter
tcp_keepalive_abort_interval. The default is
eight minutes.
-
-
TCP_KEEPIDLE
- This option, like
TCP_KEEPALIVE_THRESHOLD
, determines the
interval for sending the first probe, except that the option value is an
unsigned integer in seconds. It is provided
primarily for compatibility with other Unix flavors.
-
-
TCP_KEEPCNT
- This option specifies the number of keep-alive probes that should be sent
without any response from the peer before aborting the connection.
-
-
TCP_KEEPINTVL
- This option specifies the interval in seconds between successive,
unacknowledged keep-alive probes.
illumos supports TCP Extensions for High Performance (RFC 7323) which includes
the window scale and timestamp options, and Protection Against Wrap Around
Sequence Numbers (
PAWS). Note that if timestamps
are negotiated on a connection, received segments without timestamps on that
connection are silently dropped per the suggestion in the RFC. illumos also
supports Selective Acknowledgment (
SACK)
capabilities (RFC 2018) and Explicit Congestion Notification
(
ECN) mechanism (RFC 3168).
Turn on the window scale option in one of the following ways:
- An application can set
SO_SNDBUF
or
SO_RCVBUF
size in the
setsockopt
() option to be larger than
64K. This must be done before the program
calls listen
() or
connect
(), because the window scale
option is negotiated when the connection is established. Once the
connection has been made, it is too late to increase the send or receive
window beyond the default TCP limit of 64K.
- For all applications, use ndd(1M) to modify
the configuration parameter
tcp_wscale_always
. If
tcp_wscale_always
is set to
1, the window scale option will always be set
when connecting to a remote system. If
tcp_wscale_always
is
0, the window scale option will be set only
if the user has requested a send or receive window larger than 64K. The
default value of tcp_wscale_always
is
1.
- Regardless of the value of
tcp_wscale_always
, the window scale
option will always be included in a connect acknowledgement if the
connecting system has used the option.
Turn on SACK capabilities in the following way:
- Use
ndd
to modify the configuration
parameter tcp_sack_permitted
. If
tcp_sack_permitted
is set to
0, TCP will not accept SACK or send out SACK
information. If tcp_sack_permitted
is
set to 1, TCP will not initiate a connection
with SACK permitted option in the SYN
segment, but will respond with SACK permitted option in the
SYN|ACK segment if an incoming connection
request has the SACK permitted option. This means that TCP will only
accept SACK information if the other side of the connection also accepts
SACK information. If tcp_sack_permitted
is set to 2, it will both initiate and accept
connections with SACK information. The default for
tcp_sack_permitted
is
2 (active enabled).
Turn on the TCP ECN mechanism in the following way:
- Use
ndd
to modify the configuration
parameter tcp_ecn_permitted
. If
tcp_ecn_permitted
is set to
0, then TCP will not negotiate with a peer
that supports ECN mechanism. If
tcp_ecn_permitted
is set to
1 when initiating a connection, TCP will not
tell a peer that it supports ECN mechanism.
However, it will tell a peer that it supports
ECN mechanism when accepting a new incoming
connection request if the peer indicates that it supports
ECN mechanism in the
SYN segment. If
tcp_ecn_permitted
is set to 2, in
addition to negotiating with a peer on ECN
mechanism when accepting connections, TCP will indicate in the outgoing
SYN segment that it supports
ECN mechanism when TCP makes active outgoing
connections. The default for
tcp_ecn_permitted
is 1.
Turn on the timestamp option in the following way:
- Use
ndd
to modify the configuration
parameter tcp_tstamp_always
. If
tcp_tstamp_always
is
1, the timestamp option will always be set
when connecting to a remote machine. If
tcp_tstamp_always
is
0, the timestamp option will not be set when
connecting to a remote system. The default for
tcp_tstamp_always
is
0.
- Regardless of the value of
tcp_tstamp_always
, the timestamp option
will always be included in a connect acknowledgement (and all succeeding
packets) if the connecting system has used the timestamp option.
Use the following procedure to turn on the timestamp option only when the window
scale option is in effect:
- Use
ndd
to modify the configuration
parameter tcp_tstamp_if_wscale
. Setting
tcp_tstamp_if_wscale
to
1 will cause the timestamp option to be set
when connecting to a remote system, if the window scale option has been
set. If tcp_tstamp_if_wscale
is
0, the timestamp option will not be set when
connecting to a remote system. The default for
tcp_tstamp_if_wscale
is
1.
Protection Against Wrap Around Sequence Numbers
(
PAWS) is always used when the timestamp option
is set.
The operating system also supports multiple methods of generating initial
sequence numbers. One of these methods is the improved technique suggested in
RFC 1948. We
HIGHLY recommend that you set
sequence number generation parameters as close to boot time as possible. This
prevents sequence number problems on connections that use the same
connection-ID as ones that used a different sequence number generation. The
svc:/network/initial:default service configures
the initial sequence number generation. The service reads the value contained
in the configuration file
/etc/default/inetinit to determine which
method to use.
The
/etc/default/inetinit file is an unstable
interface, and may change in future releases.
$ gcc -std=c99 -Wall -lsocket -o client client.c
$ cat client.c
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <netdb.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
struct addrinfo hints, *gair, *p;
int fd, rv, rlen;
char buf[1024];
int y = 1;
if (argc != 3) {
fprintf(stderr, "%s <host> <port>\n", argv[0]);
return (1);
}
memset(&hints, 0, sizeof (hints));
hints.ai_family = PF_UNSPEC;
hints.ai_socktype = SOCK_STREAM;
if ((rv = getaddrinfo(argv[1], argv[2], &hints, &gair)) != 0) {
fprintf(stderr, "getaddrinfo() failed: %s\n",
gai_strerror(rv));
return (1);
}
for (p = gair; p != NULL; p = p->ai_next) {
if ((fd = socket(
p->ai_family,
p->ai_socktype,
p->ai_protocol)) == -1) {
perror("socket() failed");
continue;
}
if (connect(fd, p->ai_addr, p->ai_addrlen) == -1) {
close(fd);
perror("connect() failed");
continue;
}
break;
}
if (p == NULL) {
fprintf(stderr, "failed to connect to server\n");
return (1);
}
freeaddrinfo(gair);
if (setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, &y,
sizeof (y)) == -1) {
perror("setsockopt(SO_KEEPALIVE) failed");
return (1);
}
while ((rlen = read(fd, buf, sizeof (buf))) > 0) {
fwrite(buf, rlen, 1, stdout);
}
if (rlen == -1) {
perror("read() failed");
}
fflush(stdout);
if (close(fd) == -1) {
perror("close() failed");
}
return (0);
}
$ ./client 127.0.0.1 8080
hello
$ ./client ::1 8080
hello
$ gcc -std=c99 -Wall -lsocket -o server server.c
$ cat server.c
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <netdb.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <arpa/inet.h>
void
logmsg(struct sockaddr *s, int bytes)
{
char dq[INET6_ADDRSTRLEN];
switch (s->sa_family) {
case AF_INET: {
struct sockaddr_in *s4 = (struct sockaddr_in *)s;
inet_ntop(AF_INET, &s4->sin_addr, dq, sizeof (dq));
fprintf(stdout, "sent %d bytes to %s:%d\n",
bytes, dq, ntohs(s4->sin_port));
break;
}
case AF_INET6: {
struct sockaddr_in6 *s6 = (struct sockaddr_in6 *)s;
inet_ntop(AF_INET6, &s6->sin6_addr, dq, sizeof (dq));
fprintf(stdout, "sent %d bytes to [%s]:%d\n",
bytes, dq, ntohs(s6->sin6_port));
break;
}
default:
fprintf(stdout, "sent %d bytes to unknown client\n",
bytes);
break;
}
}
int
main(int argc, char *argv[])
{
struct addrinfo hints, *gair, *p;
int sfd, cfd;
int slen, wlen, rv;
if (argc != 3) {
fprintf(stderr, "%s <port> <message>\n", argv[0]);
return (1);
}
slen = strlen(argv[2]);
memset(&hints, 0, sizeof (hints));
hints.ai_family = PF_UNSPEC;
hints.ai_socktype = SOCK_STREAM;
hints.ai_flags = AI_PASSIVE;
if ((rv = getaddrinfo(NULL, argv[1], &hints, &gair)) != 0) {
fprintf(stderr, "getaddrinfo() failed: %s\n",
gai_strerror(rv));
return (1);
}
for (p = gair; p != NULL; p = p->ai_next) {
if ((sfd = socket(
p->ai_family,
p->ai_socktype,
p->ai_protocol)) == -1) {
perror("socket() failed");
continue;
}
if (bind(sfd, p->ai_addr, p->ai_addrlen) == -1) {
close(sfd);
perror("bind() failed");
continue;
}
break;
}
if (p == NULL) {
fprintf(stderr, "server failed to bind()\n");
return (1);
}
freeaddrinfo(gair);
if (listen(sfd, 1024) != 0) {
perror("listen() failed");
return (1);
}
fprintf(stdout, "waiting for clients...\n");
for (int times = 0; times < 5; times++) {
struct sockaddr_storage stor;
socklen_t alen = sizeof (stor);
struct sockaddr *addr = (struct sockaddr *)&stor;
if ((cfd = accept(sfd, addr, &alen)) == -1) {
perror("accept() failed");
continue;
}
wlen = 0;
do {
wlen += write(cfd, argv[2] + wlen, slen - wlen);
} while (wlen < slen);
logmsg(addr, wlen);
if (close(cfd) == -1) {
perror("close(cfd) failed");
}
}
if (close(sfd) == -1) {
perror("close(sfd) failed");
}
fprintf(stdout, "finished.\n");
return (0);
}
$ ./server 8080 $'hello\n'
waiting for clients...
sent 6 bytes to [::ffff:127.0.0.1]:59059
sent 6 bytes to [::ffff:127.0.0.1]:47448
sent 6 bytes to [::ffff:127.0.0.1]:54949
sent 6 bytes to [::ffff:127.0.0.1]:55186
sent 6 bytes to [::1]:62256
finished.
A socket operation may fail if:
-
-
EISCONN
- A
connect
() operation was attempted on
a socket on which a connect
() operation
had already been performed.
-
-
ETIMEDOUT
- A connection was dropped due to excessive retransmissions.
-
-
ECONNRESET
- The remote peer forced the connection to be closed (usually because the
remote machine has lost state information about the connection due to a
crash).
-
-
ECONNREFUSED
- The remote peer actively refused connection establishment (usually because
no process is listening to the port).
-
-
EADDRINUSE
- A
bind
() operation was attempted on a
socket with a network address/port pair that has already been bound to
another socket.
-
-
EADDRNOTAVAIL
- A
bind
() operation was attempted on a
socket with a network address for which no network interface exists.
-
-
EACCES
- A
bind
() operation was attempted with a
“reserved” port number and the effective user ID of the
process was not the privileged user.
-
-
ENOBUFS
- The system ran out of memory for internal data structures.
svcs(1),
ndd(1M),
svcadm(1M),
ioctl(2),
read(2),
write(2),
accept(3SOCKET),
bind(3SOCKET),
connect(3SOCKET),
getprotobyname(3SOCKET),
getsockopt(3SOCKET),
listen(3SOCKET),
send(3SOCKET),
smf(5),
inet(7P),
inet6(7P),
ip(7P),
ip6(7P)
K. Ramakrishnan,
S. Floyd, and D. Black,
The Addition of Explicit Congestion Notification (ECN) to
IP, RFC 3168, September
2001.
M. Mathias,
J. Mahdavi, S. Ford, and
A. Romanow, TCP Selective
Acknowledgement Options, RFC 2018,
October 1996.
S. Bellovin,
Defending Against Sequence Number Attacks,
RFC 1948, May 1996.
D. Borman,
B. Braden, V. Jacobson, and
R. Scheffenegger, Ed., TCP
Extensions for High Performance, RFC 7323,
September 2014.
Jon Postel,
Transmission Control Protocol - DARPA Internet Program
Protocol Specification, RFC 793,
Network Information Center, SRI International, Menlo Park,
CA., September 1981.
The
tcp service is managed by the service
management facility,
smf(5), under the service
identifier
svc:/network/initial:default.
Administrative actions on this service, such as enabling, disabling, or
requesting restart, can be performed using
svcadm(1M). The service's status can be queried
using the
svcs(1) command.