You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
374 lines
9.7 KiB
374 lines
9.7 KiB
.\"
|
|
.\" epoll by Davide Libenzi ( efficient event notification retrieval )
|
|
.\" Copyright (C) 2003 Davide Libenzi
|
|
.\"
|
|
.\" This program is free software; you can redistribute it and/or modify
|
|
.\" it under the terms of the GNU General Public License as published by
|
|
.\" the Free Software Foundation; either version 2 of the License, or
|
|
.\" (at your option) any later version.
|
|
.\"
|
|
.\" This program is distributed in the hope that it will be useful,
|
|
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
.\" GNU General Public License for more details.
|
|
.\"
|
|
.\" You should have received a copy of the GNU General Public License
|
|
.\" along with this program; if not, write to the Free Software
|
|
.\" Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
|
|
.\"
|
|
.\" Davide Libenzi <davidel@xmailserver.org>
|
|
.\"
|
|
.\"
|
|
.TH EPOLL 4 "23 October 2002" Linux "Linux Programmer's Manual"
|
|
.SH NAME
|
|
epoll \- I/O event notification facility
|
|
.SH SYNOPSIS
|
|
.B #include <sys/epoll.h>
|
|
.SH DESCRIPTION
|
|
.B epoll
|
|
is a variant of$
|
|
.BR poll (2)
|
|
that can be used either as Edge or Level Triggered interface and scales
|
|
well to large numbers of watched fds. Three system calls are provided to
|
|
set up and control an
|
|
.B epoll
|
|
set:$
|
|
.BR epoll_create (2) ,
|
|
.BR epoll_ctl (2) ,
|
|
.BR epoll_wait (2) .
|
|
|
|
An
|
|
.B epoll
|
|
set is connected to a file descriptor created by
|
|
.BR epoll_create (2) .
|
|
Interest for certain file descriptors is then registered via$
|
|
.BR epoll_ctl(2) .
|
|
Finally, the actual wait is started by$
|
|
.BR epoll_wait (2) .
|
|
|
|
.SH NOTES
|
|
The
|
|
.B epoll
|
|
event distribution interface is able to behave both as Edge Triggered
|
|
( ET ) and Level Triggered ( LT ). The difference between ET and LT
|
|
event distribution mechanism can be described as follows. Suppose that
|
|
this scenario happens :
|
|
.TP
|
|
.B 1
|
|
The file descriptor that represent the read side of a pipe (
|
|
.B RFD
|
|
) is added inside the
|
|
.B epoll
|
|
device.
|
|
.TP
|
|
.B 2
|
|
Pipe writer writes 2Kb of data on the write side of the pipe.
|
|
.TP
|
|
.B 3
|
|
A call to
|
|
.BR epoll_wait (2)
|
|
is done that will return
|
|
.B RFD
|
|
as ready file descriptor.
|
|
.TP
|
|
.B 4
|
|
The pipe reader reads 1Kb of data from
|
|
.BR RFD .
|
|
.TP
|
|
.B 5
|
|
A call to
|
|
.BR epoll_wait (2)
|
|
is done.
|
|
.PP
|
|
|
|
If the
|
|
.B RFD
|
|
file descriptor has been added to the
|
|
.B epoll
|
|
interface using the
|
|
.B EPOLLET
|
|
flag, the call to
|
|
.BR epoll_wait (2)
|
|
done in step
|
|
.B 5
|
|
will probably hang because of the available data still present in the file
|
|
input buffers and the remote peer might be expecting a response based on the
|
|
data it already sent. The reason for this is that Edge Triggered event
|
|
distribution delivers events only when events happens on the monitored file.
|
|
So, in step
|
|
.B 5
|
|
the caller might end up waiting for some data that is already present inside
|
|
the input buffer. In the above example, an event on
|
|
.B RFD
|
|
will be generated because of the write done in
|
|
.B 2
|
|
, and the event is consumed in
|
|
.BR 3 .
|
|
Since the read operation done in
|
|
.B 4
|
|
does not consume the whole buffer data, the call to
|
|
.BR epoll_wait (2)
|
|
done in step
|
|
.B 5
|
|
might lock indefinitely. The
|
|
.B epoll
|
|
interface, when used with the
|
|
.B EPOLLET
|
|
flag ( Edge Triggered )
|
|
should use non-blocking file descriptors to avoid having a blocking
|
|
read or write starve the task that is handling multiple file descriptors.
|
|
The suggested way to use
|
|
.B epoll
|
|
as an Edge Triggered (
|
|
.B EPOLLET
|
|
) interface is below, and possible pitfalls to avoid follow.
|
|
.SR
|
|
.TP$
|
|
.B i
|
|
with non-blocking file descriptors
|
|
.TP$
|
|
.B ii
|
|
by going to wait for an event only after
|
|
.BR read (2)
|
|
or$
|
|
.BR write (2)
|
|
return EAGAIN
|
|
.SE
|
|
.PP
|
|
On the contrary, when used as a Level Triggered interface,
|
|
.B epoll
|
|
is by all means a faster
|
|
.BR poll (2),
|
|
and can be used wherever the latter is used since it shares the
|
|
same semantics. Since even with the Edge Triggered
|
|
.B epoll
|
|
multiple events can be generated up on receival of multiple chunks of data,
|
|
the caller has the option to specify the
|
|
.B EPOLLONESHOT
|
|
flag, to tell
|
|
.B epoll
|
|
to disable the associated file descriptor after the receival of an event with
|
|
.BR epoll_wait (2).
|
|
When the
|
|
.B EPOLLONESHOT
|
|
flag is specified, it is caller responsibility to rearm the file descriptor using
|
|
.BR epoll_ctl (2)
|
|
with
|
|
.BR EPOLL_CTL_MOD.
|
|
|
|
.SH EXAMPLE FOR SUGGESTED USAGE
|
|
|
|
While the usage of
|
|
.B epoll
|
|
when employed like a Level Triggered interface does have the same
|
|
semantics of
|
|
.BR poll (2),
|
|
an Edge Triggered usage requires more clarifiction to avoid stalls
|
|
in the application event loop. In this example, listener is a
|
|
non-blocking socket on which
|
|
.BR listen (2)
|
|
has been called. The function do_use_fd() uses the new ready
|
|
file descriptor until EAGAIN is returned by either
|
|
.BR read (2)
|
|
or
|
|
.BR write (2).
|
|
An event driven state machine application should, after having received
|
|
EAGAIN, record its current state so that at the next call to do_use_fd()
|
|
it will continue to
|
|
.BR read (2)
|
|
or
|
|
.BR write (2)
|
|
from where it stopped before. $
|
|
|
|
.nf
|
|
struct epoll_event ev, *events;
|
|
|
|
for(;;) {
|
|
nfds = epoll_wait(kdpfd, events, maxevents, -1);
|
|
|
|
for(n = 0; n < nfds; ++n) {
|
|
if(events[n].data.fd == listener) {
|
|
client = accept(listener, (struct sockaddr *) &local,
|
|
&addrlen);
|
|
if(client < 0){
|
|
perror("accept");
|
|
continue;
|
|
}
|
|
setnonblocking(client);
|
|
ev.events = EPOLLIN | EPOLLET;
|
|
ev.data.fd = client;
|
|
if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, &ev) < 0) {
|
|
fprintf(stderr, "epoll set insertion error: fd=%d\n",
|
|
client);
|
|
return -1;
|
|
}
|
|
}
|
|
else
|
|
do_use_fd(events[n].data.fd);
|
|
}
|
|
}
|
|
.fi
|
|
|
|
When used as an Edge triggered interface, for performance reasons, it is
|
|
possible to add the file descriptor inside the epoll interface (
|
|
.B EPOLL_CTL_ADD
|
|
) once by specifying (
|
|
.BR EPOLLIN | EPOLLOUT
|
|
). This allows you to avoid
|
|
continuously switching between
|
|
.B EPOLLIN
|
|
and
|
|
.B EPOLLOUT
|
|
calling
|
|
.BR epoll_ctl (2)
|
|
with
|
|
.BR EPOLL_CTL_MOD .
|
|
|
|
.SH QUESTIONS AND ANSWERS (from linux-kernel)
|
|
|
|
.SR
|
|
.TP$
|
|
.B Q1$
|
|
What happens if you add the same fd to an epoll_set twice?
|
|
.TP
|
|
.B A1$
|
|
You will probably get EEXIST. However, it is possible that two
|
|
threads may add the same fd twice. This is a harmless condition.
|
|
.TP
|
|
.B Q2$
|
|
Can two
|
|
.B epoll
|
|
sets wait for the same fd? If so, are events reported
|
|
to both
|
|
.B epoll
|
|
sets fds?
|
|
.TP
|
|
.B A2
|
|
Yes. However, it is not recommended. Yes it would be reported to both.
|
|
.TP
|
|
.B Q3
|
|
Is the
|
|
.B epoll
|
|
fd itself poll/epoll/selectable?
|
|
.TP
|
|
.B A3
|
|
Yes.
|
|
.TP
|
|
.B Q4$
|
|
What happens if the
|
|
.B epoll
|
|
fd is put into its own fd set?
|
|
.TP
|
|
.B A4
|
|
It will fail. However, you can add an
|
|
.B epoll
|
|
fd inside another epoll fd set.$
|
|
.TP
|
|
.B Q5
|
|
Can I send the
|
|
.B epoll
|
|
fd over a unix-socket to another process?
|
|
.TP
|
|
.B A5
|
|
No.
|
|
.TP
|
|
.B Q6
|
|
Will the close of an fd cause it to be removed from all
|
|
.B epoll
|
|
sets automatically?
|
|
.TP
|
|
.B A6
|
|
Yes.
|
|
.TP
|
|
.B Q7$
|
|
If more than one event comes in between
|
|
.BR epoll_wait (2)
|
|
calls, are they combined or reported separately?
|
|
.TP
|
|
.B A7
|
|
They will be combined.
|
|
.TP
|
|
.B Q8
|
|
Does an operation on an fd affect the already collected but not yet reported
|
|
events?
|
|
.TP
|
|
.B A8
|
|
You can do two operations on an existing fd. Remove would be meaningless for
|
|
this case. Modify will re-read available I/O.
|
|
.TP
|
|
.B Q9
|
|
Do I need to continuously read/write an fd until EAGAIN when using the
|
|
.B EPOLLET
|
|
flag ( Edge Triggered behaviour ) ?
|
|
.TP
|
|
.B A9
|
|
No you don't. Receiving an event from
|
|
.BR epoll_wait (2)
|
|
should suggest to you that such file descriptor is ready for the requested I/O
|
|
operation. You have simply to consider it ready until you will receive the
|
|
next EAGAIN. When and how you will use such file descriptor is entirely up
|
|
to you. Also, the condition that the read/write I/O space is exhausted can
|
|
be detected by checking the amount of data read/write from/to the target
|
|
file descriptor. For example, if you call
|
|
.BR read (2)
|
|
by asking to read a certain amount of data and
|
|
.BR read (2)
|
|
returns a lower number of bytes, you can be sure to have exhausted the read
|
|
I/O space for such file descriptor. Same is valid when writing using the
|
|
.BR write (2)
|
|
function.
|
|
.SE
|
|
|
|
.SH POSSIBLE PITFALLS AND WAYS TO AVOID THEM
|
|
.SR
|
|
.TP
|
|
.B o Starvation ( Edge Triggered )
|
|
.PP
|
|
If there is a large amount of I/O space, it is possible that by trying to drain
|
|
it the other files will not get processed causing starvation. This
|
|
is not specific to
|
|
.BR epoll .
|
|
.PP
|
|
.PP
|
|
The solution is to maintain a ready list and mark the file descriptor as ready
|
|
in its associated data structure, thereby allowing the application to
|
|
remember which files need to be processed but still round robin amongst
|
|
all the ready files. This also supports ignoring subsequent events you
|
|
receive for fd's that are already ready.
|
|
.PP
|
|
|
|
.TP
|
|
.B o If using an event cache... $
|
|
.PP
|
|
If you use an event cache or store all the fd's returned from
|
|
.BR epoll_wait (2),
|
|
then make sure to provide a way to mark its closure dynamically (ie- caused by$
|
|
a previous event's processing). Suppose you receive 100 events from$
|
|
.BR epoll_wait (2),
|
|
and in eventi #47 a condition causes event #13 to be closed.$
|
|
If you remove the structure and close() the fd for event #13, then your$
|
|
event cache might still say there are events waiting for that fd causing$
|
|
confusion.
|
|
.PP
|
|
.PP$
|
|
One solution for this is to call, during the processing of event 47,
|
|
.BR epoll_ctl ( EPOLL_CTL_DEL )
|
|
to delete fd 13 and close(), then mark its associated
|
|
data structure as removed and link it to a cleanup list. If you find another
|
|
event for fd 13 in your batch processing, you will discover the fd had been
|
|
previously removed and there will be no confusion.
|
|
.PP
|
|
|
|
.SE
|
|
.SH CONFORMING TO
|
|
.BR epoll (4)
|
|
is a new API introduced in Linux kernel
|
|
.BR 2.5.44 .
|
|
Its interface should be finalized in Linux kernel
|
|
.BR 2.5.66 .
|
|
.SH "SEE ALSO"
|
|
.BR epoll_create (2)
|
|
.BR epoll_ctl (2)
|
|
.BR epoll_wait (2)
|
|
|