Source code analysis of poll and epoll kernel in Linux system

Time:2021-7-22

The use of poll and epoll should not be mentioned any more. When FD is large, epoll is more efficient than poll. Let’s analyze the kernel source code to see why.

Analysis of pollPoll system call:

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

The corresponding implementation code is:

[fs/select.c -->sys_ poll] asmlinkage long sys_ poll(struct pollfd __ user * ufds, unsigned int nfds, long timeout){ struct poll_ wqueues table; int fdcount, err;  unsigned int i; struct poll_ list *head; struct poll_ list *walk;/*  Do a sanity check on NFDS... * // * the number of NFDS given by the user cannot exceed the maximum number of FD supported by a struct file structure (default is 256) * / if (NFDS > current > files > Max_ fdset && nfds > OPEN_ MAX) return -EINVAL;  if (timeout) { /* Careful about overflow in the intermediate values */ if ((unsigned long) timeout < MAX_ SCHEDULE_ TIMEOUT / HZ)timeout = (unsigned long)(timeout*HZ+999)/1000+1;  else /* Negative or overflow */ timeout = MAX_ SCHEDULE_ TIMEOUT; }  poll_ initwait(&table);

Among them, poll_ Initwait is more critical. Literally, it should be the initialization variable table. Note that table is a key variable in the whole process of poll execution. And struct poll_ Table actually contains only one function pointer:

[fs/poll.h] /* * structures and helpers for f_op->poll implementations */ typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, structpoll_table_struct *); typedef struct poll_table_struct {poll_queue_proc qproc; } poll_table;

Now let’s take a look at poll_ What exactly is initwait doing

[fs/select.c] void __ pollwait(struct file *filp, wait_ queue_ head_ t *wait_ address, poll_ table *p);  void poll_ initwait(struct poll_ wqueues *pwq) { &(pwq->pt)->qproc = __ pollwait; /* This trip has been "translated" by me, so it's convenient to watch * / PWQ - > error = 0; pwq->table = NULL; }

Need C / C + + Linux server architect learning materials plus q-skirt 3223296726 (materials include C / C + +, Linux, golang technology, nginx, zeromq, mysql, redis, fastdfs, mongodb, ZK, streaming media, CDN, P2P, k8s, docker, TCP / IP, Xie Cheng, dpdk, ffmpeg, etc.), free sharing

Source code analysis of poll and epoll kernel in Linux system

Obviously, poll_ The main action of initwait is to poll the members of the table variable_ Set the callback function corresponding to table__ pollwait。 This one__ Pollwait is not only required for the poll system call, but also for the select system call__ Pollwait, to put it bluntly, is a “Royal” callback function for asynchronous operation of the operating system. Of course, epoll doesn’t use this. It adds a callback function to achieve the purpose of efficient operation. This will be discussed later. Let’s not discuss it__ The specific implementation of pollwait, or continue to look at sys_ poll:

[fs/select.c -->sys_poll] head = NULL; walk = NULL; i = nfds; err = -ENOMEM;while(i!=0) { struct poll_list *pp; pp = kmalloc(sizeof(struct poll_list)+ sizeof(struct pollfd)* (i>POLLFD_PER_PAGE?POLLFD_PER_PAGE:i), GFP_KERNEL); if(pp==NULL) goto out_fds; pp->next=NULL;pp->len = (i>POLLFD_PER_PAGE?POLLFD_PER_PAGE:i); if (head == NULL) head = pp; else walk->next = pp;walk = pp;if (copy_from_user(pp->entries, ufds + nfds-i, sizeof(struct pollfd)*pp->len)) { err = -EFAULT; goto out_fds; } i -= pp->len; }fdcount = do_poll(nfds, head, &table, timeout);

This pile of code is to build a linked list. The node of each linked list is a page size (usually 4K). This linked list node points to struct poll_ List pointer control, and many struct pollfd through struct_ Access to the entries member of the list. The above loop copies the user state struct pollfd into these entries. Usually, the poll call of the user program monitors several FDS, so the above linked list usually only needs one node, that is, one page of the operating system. However, when the user passes in a lot of FD, because all struct pollfd must be copied into the kernel every time in the poll system call, parameter passing and page allocation become the performance bottleneck of the poll system call. The last sentence is do_ Poll, let’s go in:

[fs/select.c-->sys_ poll()-->do_ poll()] static void do_ pollfd(unsigned int num, struct pollfd * fdpage, poll_ table ** pwait, int *count) {   int i;  for (i = 0;  i < num;  i++) { int fd;  unsigned int mask;  struct pollfd *fdp;  mask = 0;  fdp = fdpage+i;  fd = fdp->fd;  if (fd >= 0) { struct file * file = fget(fd);  mask = POLLNVAL;  if (file !=  NULL) { mask = DEFAULT_ POLLMASK;  if (file->f_ op && file->f_ op->poll) mask = file->f_ op->poll(file, *pwait);  mask &= fdp->events | POLLERR | POLLHUP; fput(file); }  if (mask) { *pwait = NULL; (* count)++; }}  fdp->revents = mask; } }  static int do_ poll(unsigned int nfds, struct poll_ list *list, struct poll_ wqueues *wait, long timeout) { int count = 0;  poll_ table* pt = &wait->pt;  if (! timeout) pt = NULL;  for (;;) {  struct poll_ list *walk; set_ current_ state(TASK_ INTERRUPTIBLE);  walk = list;  while(walk !=  NULL) { do_ pollfd( walk->len, walk->entries, &pt, &count); walk = walk->next;}  pt = NULL; if (count || ! timeout || signal_ pending(current))break; count = wait->error;  if (count) break; timeout = schedule_ timeout(timeout); /*  Let current hang up, other processes run, and run current * /} when timeout arrives__ set_ current_ state(TASK_ RUNNING); return count; }

Pay attention to set_ current_ State and signal_ These two sentences ensure that when the user program suspends after calling the poll, the signal can make the program quickly push out the poll call, and the normal system call will not be interrupted by the signal.
Overview do_ The poll function mainly waits in the loop until count is greater than 0 before jumping out of the loop, while count mainly depends on do_ Pollfd function processing. Note the code:

while(walk != NULL) { do_pollfd( walk->len, walk->entries, &pt, &count); walk = walk->next; }

When there are a lot of FDS (for example, 1000 FDS) passed in by users, do_ Pollfd will be called many times. This is another reason for the efficiency bottleneck of pollfd. do_ Pollfd is to call their respective poll functions for each incoming FD to simplify the calling process, as follows:

struct file* file = fget(fd);file->f_op->poll(file, &(table->pt));

If FD corresponds to a socket, do_ Pollfd calls the poll implemented by the network device driver; If FD corresponds to an open file on an ext3 file system, do_ Pollfd calls the poll implemented by ext3 file system driver. In a word, this file – > F_ OP > poll is implemented by device driver. What is the usual way to implement the poll of device driver? In fact, the standard implementation of device driver is to call poll_ Wait, that is to call struct poll with the device’s own waiting queue as the parameter (usually the device has its own waiting queue, otherwise a device that does not support asynchronous operation will make people very depressed)_ The callback function of table. As a representative of the driver, let’s look at the code of socket when using TCP

[net/ipv4/tcp.c-->tcp_poll]unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait){ unsigned int mask; struct sock *sk = sock->sk;struct tcp_opt *tp = tcp_sk(sk); poll_wait(file, sk->sk_sleep, wait);

Code to see these, the rest is to determine the state, return the state value, TCP_ The core implementation of poll is poll_ Wait, and
poll_ Wait is to call struct poll_ The callback function corresponding to the table, and the callback function corresponding to the poll system call, is__ poll_ Wait, so here you can almost put TCP_ A poll is understood as a statement:

__poll_wait(file, sk->sk_sleep, wait);

It can also be seen that each socket has its own waiting queue_ Therefore, there is more than one “waiting queue for devices” mentioned above. Now let’s see__ poll_ The implementation of wait is as follows

[fs/select.c-->__poll_wait()] void __pollwait(struct file *filp, wait_queue_head_t *wait_address, poll_table *_p){ struct poll_wqueues *p = container_of(_p, struct poll_wqueues, pt); struct poll_table_page *table = p->table; if (!table || POLL_TABLE_FULL(table)) { struct poll_table_page *new_table; new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL); if (!new_table) { p->error = -ENOMEM; __set_current_state(TASK_RUNNING); return;} new_table->entry = new_table->entries; new_table->next = table; p->table = new_table; table = new_table; } /* Add a new entry */ { struct poll_table_entry * entry = table->entry; table->entry = entry+1;get_file(filp); entry->filp = filp;entry->wait_address = wait_address; init_waitqueue_entry(&entry->wait, current); add_wait_queue(wait_address,&entry->wait); }}

Source code analysis of poll and epoll kernel in Linux system

__ poll_ The function of wait is to create the data structure shown in the figure above (once)__ poll_ Wait means that only one poll is created in a device poll call_ table_ Enter), and through struct poll_ table_ The wait member of the entry hangs the current in the waiting queue of the device
The waiting queue here is wait_ Address, corresponding to TCP_ Sk > SK in poll_ sleep。 Now we can review the principle of the poll system call: register the callback function first__ poll_ Wait, and then initialize the table variable (the type is struct poll)_ Then copy the struct pollfd (mainly FD), and call all the polls corresponding to FD in turn (hang the current to the device waiting queue corresponding to each FD). After the device receives a message (network device) or fills in the file data (disk device), it will wake up the device to wait for the process on the queue. At this time, current will wake up. Current wakes up and leaves sys_ The operation of poll is relatively simple, so we won’t analyze it line by line here.

epoll

Through the above analysis, the two bottlenecks of poll operation efficiency have been found out, and now the problem is how to improve it. First of all, every poll has to copy 1000 FDS into the kernel, which is too unscientific. Why don’t the kernel save the FDS that have been copied into itself? Yes, epoll is the FD that you save and copy. Its API has already explained this point – not epoll_ FD is passed in when waiting, but through epoll_ CTL transfers all FD to the kernel and “waits” together, which saves unnecessary duplicate copies. Second, in epoll_ When waiting, it is not to add current to the device waiting queue corresponding to FD in turn, but to call a callback function when the device waiting queue wakes up (of course, this requires the “wake-up callback” mechanism), to classify the FD that generated the event into a linked list, and then return the FD on the linked list.
Analysis of epoll
Epoll is a module, so let’s look at the module’s entry eventpoll first_ init

[fs/eventpoll.c-->evetpoll_init()] static int __init eventpoll_init(void) { int error; init_MUTEX(&epsem); /* Initialize the structure used to perform safe poll wait head wake ups */ ep_poll_safewake_init(&psw); /* Allocates slab cache used to allocate "struct epitem" items */ epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),0, SLAB_HWCACHE_ALIGN|EPI_SLAB_DEBUG|SLAB_PANIC, NULL, NULL); /* Allocates slab cache used to allocate "struct eppoll_entry" */ pwq_cache = kmem_cache_create("eventpoll_pwq", sizeof(struct eppoll_entry), 0, EPI_SLAB_DEBUG|SLAB_PANIC, NULL, NULL); /* * Register the virtual file system that will be the source of inodes * for the eventpoll files */ error = register_filesystem(&eventpoll_fs_type); if (error)goto epanic;/* Mount the above commented virtual file system */ eventpoll_mnt = kern_mount(&eventpoll_fs_type); error = PTR_ERR(eventpoll_mnt); if (IS_ERR(eventpoll_mnt))goto epanic;DNPRINTK(3, (KERN_INFO "[%p] eventpoll: successfully initialized.\n", current));return 0; epanic: panic("eventpoll_init() failed\n"); }

Interestingly, this module registers a new file system called “event pollfs” during initialization_ fs_ Type structure), and then mount the file system. In addition, create two kernel caches (kmem should be created if you need to allocate small memory frequently in kernel programming)_ The cache is used as the “memory pool”) to store the struct epitem and eppoll respectively_ entry。 If you want to develop a new file system in the future, you can refer to this code. Now think about epoll_ Why does create return a new FD? Because it creates a new file in this file system called “event pollfs”! As follows:

[fs/eventpoll.c-->sys_epoll_create()] asmlinkage long sys_epoll_create(int size) { int error, fd; struct inode *inode; struct file *file; DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_create(%d)\n", current, size)); /* Sanity check on the size parameter */ error = -EINVAL; if (size <= 0) goto eexit_1; /* * Creates all the items needed to setup an eventpoll file. That is,* a file structure, and inode and a free file descriptor. */ error = ep_getfd(&fd, &inode, &file); if (error)goto eexit_1;/* Setup the file internal data structure ( "struct eventpoll" ) */ error = ep_file_init(file); if (error) goto eexit_2;

The function is simple, where EP_ Getfd looks like “get”. In fact, it calls epoll for the first time_ When creating, it is to create a new inode, a new file, and a new FD. And EP_ file_ Init creates a struct eventpoll structure and puts it into the file-
>private_ Data, note that this private_ Data will be used later. Seeing this, some people may ask why epoll developers don’t make a super map of the kernel to store the epoll handle that users want to create in epoll_ Return a pointer when create? That seems intuitive. But, take a closer look, how many linux system calls return pointers? You’ll find almost none( It is emphasized that malloc is not a system call, but the BRK called by malloc) because Linux, as the most outstanding successor of UNIX, follows one of the great advantages of UNIX, that is, everything is a file, and the input and output are files and sockets
Is a file, everything is a file, which means that the program using this operating system can be very simple, because everything is just a file operation( UNIX doesn’t do it all, plan 9 does. And there’s an advantage to using a file system: epoll_ Create returns a FD, not a damned pointer. If the pointer is wrong, you can’t judge, and FD can use current > files > FD_ Array [] to find its authenticity. epoll_ Create OK, it’s epoll_ CTL, we omit the critical code:

[fs/eventpoll.c-->sys_epoll_ctl()] asmlinkage long sys_epoll_ctl(int epfd, int op, int fd, struct epoll_event __user *event) { int error; struct file *file, *tfile; struct eventpoll *ep; struct epitem *epi; struct epoll_event epds;.... epi = ep_find(ep, tfile, fd);error = -EINVAL;switch (op) {case EPOLL_CTL_ADD:if (!epi) { epds.events |= POLLERR | POLLHUP; error = ep_insert(ep, &epds, tfile, fd); } else error = -EEXIST; break; case EPOLL_CTL_DEL: if (epi) error = ep_remove(ep, epi);elseerror = -ENOENT; break;case EPOLL_CTL_MOD: if (epi) { epds.events |= POLLERR | POLLHUP; error = ep_modify(ep, epi, &epds); } elseerror = -ENOENT; break;}

It turns out that in a large structure (no matter what it is now), EP is the first step_ Find. If struct epitem is found and the user operation is add, then – eexist is returned; If del, then EP_ remove。 If you can’t find the struct epitem and the user operation is add, you can select EP_ Insert creates and inserts a. It’s very straightforward. What is the “big structure”? Look at EP_ For the calling method of find, the EP parameter should be a pointer to the “big structure”. Look at EP = file > private_ Data, we understand that this “big structure” is the one in epoll_ The struct eventpoll created when create. Let’s look at EP for details_ Find implementation, found that the original is the RBR member of struct eventpoll (struct RB)_ This is the root of a red black tree! The red and black trees are all structure epitems. Now it is clear that a newly created epoll file has a struct eventpoll structure on which a red black tree is attached, and the red black tree is the epoll for each epoll_ CTL is the storage place of FD! Now that the data structure is clear, let’s look at the core:

[fs/eventpoll.c-->sys_epoll_wait()] asmlinkage long sys_epoll_wait(int epfd, struct epoll_event __user *events, int maxevents, int timeout) { int error; struct file *file; struct eventpoll *ep; DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_wait(%d, %p, %d, %d)\n",current, epfd, events, maxevents, timeout)); /* The maximum number of event must be greater than zero */ if (maxevents <= 0) return -EINVAL;/* Verify that the area passed by the user is writeable */ if ((error = verify_area(VERIFY_WRITE, events, maxevents * sizeof(structepoll_event))))goto eexit_1; /* Get the "struct file *" for the eventpoll file */ error = -EBADF; file = fget(epfd); if (!file) goto eexit_1; /* * We have to check that the file structure underneath the fd * the user passed to us _is_ an eventpoll file. */error = -EINVAL; if (!IS_FILE_EPOLL(file)) goto eexit_2; /* * At this point it is safe to assume that the "private_data" contains * our own data structure. */ ep = file->private_data;/* Time to fish for events ... */ error = ep_poll(ep, events, maxevents, timeout); eexit_2: fput(file);eexit_1:DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_wait(%d, %p, %d, %d) =%d\n", current, epfd, events, maxevents, timeout, error)); return error; }

Repeat the old trick from File > private_ Data, and then call EP_ poll

[fs/eventpoll.c-->sys_epoll_wait()->ep_poll()] static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, long timeout) {int res, eavail;unsigned long flags; long jtimeout; wait_queue_t wait; /* * Calculate the timeout by checking for the "infinite" value ( -1 ) * and the overflow condition. The passed timeout is in milliseconds, * that why (t * HZ) / 1000. */ jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ? MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;retry: write_lock_irqsave(&ep->lock, flags); res = 0; if (list_empty(&ep->rdllist)) { /* * We don't have any available event to return to the caller. * We need to sleep here, and we will be wake up by * ep_poll_callback() when events will become available.*/ init_waitqueue_entry(&wait, current); add_wait_queue(&ep->wq, &wait); for (;;) { /* * We don't want to sleep if the ep_poll_callback() sends us * a wakeup in between. That's why we set the task state * to TASK_INTERRUPTIBLE before doing the checks. */ set_current_state(TASK_INTERRUPTIBLE); if (!list_empty(&ep->rdllist) || !jtimeout) break; if (signal_pending(current)) { res = -EINTR; break; } write_unlock_irqrestore(&ep->lock, flags); jtimeout = schedule_timeout(jtimeout); write_lock_irqsave(&ep->lock, flags);} remove_wait_queue(&ep->wq, &wait); set_current_state(TASK_RUNNING);}

It’s a big circle again, but this big circle is better than the poll one, because when you look at it carefully, it does nothing except sleep and judge whether EP – > rdllist is empty! Of course, it is efficient to do nothing, but who makes EP – > rdllist not empty? The answer is EP_ Callback function set when insert

[fs/eventpoll.c-->sys_epoll_ctl()-->ep_insert()] static int ep_insert(struct eventpoll *ep, struct epoll_event *event, struct file *tfile, int fd) { int error, revents, pwake = 0; unsigned long flags; struct epitem *epi;struct ep_pqueue epq; error = -ENOMEM; if (!(epi = EPI_MEM_ALLOC())) goto eexit_1; /* Item initialization follow here ... */EP_RB_INITNODE(&epi->rbn);INIT_LIST_HEAD(&epi->rdllink); INIT_LIST_HEAD(&epi->fllink); INIT_LIST_HEAD(&epi->txlink); INIT_LIST_HEAD(&epi->pwqlist); epi->ep = ep; EP_SET_FFD(&epi->ffd, tfile, fd); epi->event = *event; atomic_set(&epi->usecnt, 1);epi->nwait = 0; /* Initialize the poll table using the queue callback */ epq.epi = epi; init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);/** Attach the item to the poll hooks and get current event bits. * We can safely use the file* here because its usage count has * been increased by the caller of this function. */ revents = tfile->f_op->poll(tfile, &epq.pt);

We pay attention to init_ poll_ funcptr(&epq.pt, ep_ ptable_ queue_ proc); This line is actually & (EPQ. PT) – > qproc = EP_ ptable_ queue_ proc; Then tfile – > F_ OP > poll (tfile, & EPQ. PT) is actually calling the poll method of the monitored file (called “target file” in epoll), and this poll is actually calling poll_ Wait (remember pol_ Wait? Every device driver that supports poll calls EP_ ptable_ queue_ proc。 This is a difficult call relationship because it is not a direct call at the language level. ep_ Insert also puts struct epitem into F in struct file_ ep_ Links are connected to the table for easy searching. Fllink in struct epitem is responsible for this mission.

[fs/eventpoll.c-->ep_ptable_queue_proc()] static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,poll_table *pt) { struct epitem *epi = EP_ITEM_FROM_EPQUEUE(pt); struct eppoll_entry *pwq; if (epi->nwait >= 0 && (pwq = PWQ_MEM_ALLOC())) { init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); pwq->whead = whead; pwq->base = epi; add_wait_queue(whead, &pwq->wait);list_add_tail(&pwq->llink, &epi->pwqlist); epi->nwait++; } else { /* We have to signal that an error occurred */epi->nwait = -1; } }

The code above is EP_ The most important thing to do in insert: create struct eppoll_ Enter, set its wake-up callback function to
ep_ poll_ Call back, and then join the device waiting queue (note that the whead here is the waiting queue for each device driver mentioned in the previous chapter). Only in this way, when the device is ready, wake up the waiting on the waiting queue, EP_ poll_ The callback will be called. Every time the poll system call is called, the operating system will hang the current (current process) on the waiting queue of all devices corresponding to fd. It can be imagined that when FD is more than 1000, this “hanging” method is very cumbersome; Each time epoll is called_ Wait is not so wordy, epoll is only in epoll_ In CTL, hang current once (the first time is inevitable) and give each FD a command “call callback function when it’s ready”. If the device has an event, it will put FD into rdllist through callback function, and call epoll every time_ Wait just collects FD in rdllist – epoll cleverly uses callback function to realize a more efficient event driven model. Now we can guess EP_ poll_ What will callback do? It must insert the epitems (representing each FD) of the received events on the red black tree into EP – > rdllist_ When wait returns, the rdllist is ready for FD!

[fs/eventpoll.c-->ep_poll_callback()] static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key) { int pwake = 0; unsigned long flags; struct epitem *epi = EP_ITEM_FROM_WAIT(wait); struct eventpoll *ep = epi->ep; DNPRINTK(3, (KERN_INFO "[%p] eventpoll: poll_callback(%p) epi=%pep=%p\n", current, epi->file, epi, ep)); write_lock_irqsave(&ep->lock, flags); /* * If the event mask does not contain any poll(2) event, we consider the * descriptor to be disabled. This condition is likely the effect of the * EPOLLONESHOT bit that disables the descriptor when an event is received, * until the next EPOLL_CTL_MOD will be issued.*/ if (!(epi->event.events & ~EP_PRIVATE_BITS)) goto is_disabled; /* If this file is already in the ready list we exit soon */ if (EP_IS_LINKED(&epi->rdllink)) goto is_linked; list_add_tail(&epi->rdllink, &ep->rdllist);is_linked: /* * Wake up ( if active ) both the eventpoll wait list and the ->poll() * wait list. */ if (waitqueue_active(&ep->wq)) wake_up(&ep->wq); if (waitqueue_active(&ep->poll_wait)) pwake++; is_disabled: write_unlock_irqrestore(&ep->lock, flags); /* We have to call this outside the lock */if (pwake) ep_poll_safewake(&psw, &ep->poll_wait); return 1; }

What really matters is the list_ add_ tail(&epi->rdllink, &ep->rdllist); In a word, put struct epitem in rdllist of struct eventpoll. Now we can draw the core data structure diagram of epoll

Source code analysis of poll and epoll kernel in Linux system

Epollet is unique to epoll

Epollet is the unique flag of epoll system call. Et is the meaning of edge trigger. The specific meaning and application can be found in Google. With epollet, repeated events will not always disturb the judgment of the program, so they are often used. What is the principle of epollet? Epoll attaches FD to a callback function. When the device corresponding to FD has a message, it puts FD into the rdllist list. In this way, epoll_ Just check the rdllist list to find out which FD has events. Let’s look at EP_ The last few lines of poll Code:

[fs/eventpoll.c->ep_poll()] /* * Try to transfer events to user space. In case we get 0 events and * there's still timeout left over, we go trying again in search of * more luck. */if (!res && eavail && !(res = ep_events_transfer(ep, events, maxevents)) && jtimeout) goto retry; return res; }

Copy FD in rdllist to user space. This task is EP_ events_ What transfer does:

[fs/eventpoll.c->ep_events_transfer()] static int ep_events_transfer(struct eventpoll *ep,struct epoll_event __user *events, int maxevents) { int eventcnt = 0; struct list_head txlist; INIT_LIST_HEAD(&txlist); /* * We need to lock this because we could be hit by * eventpoll_release_file() and epoll_ctl(EPOLL_CTL_DEL). */down_read(&ep->sem);/* Collect/extract ready items */ if (ep_collect_ready_items(ep, &txlist, maxevents) > 0) {/* Build result set in userspace */eventcnt = ep_send_events(ep, &txlist, events);/* Reinject ready items into the ready list */ep_reinject_items(ep, &txlist); }up_read(&ep->sem); return eventcnt;}

There is very little code, among which EP_ collect_ ready_ Items move FD in rdllist to txlist (rdllist is empty after moving), and then
ep_ send_ Events copies FD in the txlist to the user space, and then EP_ reinject_ Items “returns” a portion of FD from the txlist to
Rdllist so that it can be found in rdllist next time. Among them, EP_ send_ The implementation of events is as follows

[fs/eventpoll.c->ep_send_events()] static int ep_send_events(struct eventpoll *ep, struct list_head *txlist, struct epoll_event __user *events) { int eventcnt = 0; unsigned int revents; struct list_head *lnk; struct epitem *epi; /* * We can loop without lock because this is a task private list. * The test done during the collection loop will guarantee us that * another task will not try to collect this file. Also, items * cannot vanish during the loop because we are holding "sem". */ list_for_each(lnk, txlist) { epi = list_entry(lnk, struct epitem, txlink); /* * Get the ready file event set. We can safely use the file * because we are holding the "sem" in read and this will * guarantee that both the file and the item will not vanish. */ revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL); /* * Set the return event set for the current file descriptor. * Note that only the task task was successfully able to link * the item to its "txlist" will write this field. */ epi->revents = revents & epi->event.events; if (epi->revents) { if (__put_user(epi->revents, &events[eventcnt].events) || __put_user(epi->event.data, &events[eventcnt].data)) return -EFAULT; if (epi->event.events & EPOLLONESHOT) epi->event.events &= EP_PRIVATE_BITS; eventcnt++; } } return eventcnt;}

There is nothing to see about this copy implementation, but please note that inventions = epi – > FFD. File – > F_ op->poll(epi->ffd.file, NULL); In this line, the poll is very cunning. It sets the second parameter to null to call. Let’s take a look at how device drivers usually implement poll

static unsigned int scull_p_poll(struct file *filp, poll_table *wait){struct scull_pipe *dev = filp->private_data;unsigned int mask = 0;/** The buffer is circular; it is considered full* if "wp" is right behind "rp" and empty if the* two are equal.*/down(&dev->sem);poll_wait(filp, &dev->inq, wait);poll_wait(filp, &dev->outq, wait);if (dev->rp != dev->wp)mask |= POLLIN | POLLRDNORM; /* readable */if (spacefree(dev))mask |= POLLOUT | POLLWRNORM; /* writable */up(&dev->sem);return mask;}

The above code is extracted from “Linux device driver (Third Edition)”, which is absolutely classic. The device first needs to hang the current (current process) on the INQ and outq queues (the “hang” operation is done by the wait callback function pointer), and then wait for the device to wake up. After waking up, it can get the event mask through the mask (note the mask parameter, It’s responsible for the event mask. If wait is null, poll_ What will wait do?

[include/linux/poll.h->poll_wait] static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address,poll_table *p) { if (p && wait_address) p->qproc(filp, wait_address, p); }

If poll_ The table is empty and does nothing. Let’s go back to EP_ send_ I don’t want to sleep, I just want to get the event mask. Then copy the event mask to the user space. ep_ send_ When events is finished, it’s EP’s turn_ reinject_ Items:

[fs/eventpoll.c->ep_reinject_items] static void ep_reinject_items(struct eventpoll *ep, struct list_head *txlist) { int ricnt = 0, pwake = 0; unsigned long flags; struct epitem *epi; write_lock_irqsave(&ep->lock, flags); while (!list_empty(txlist)) { epi = list_entry(txlist->next, struct epitem, txlink);/* Unlink the current item from the transfer list */ EP_LIST_DEL(&epi->txlink);/** If the item is no more linked to the interest set, we don't * have to push it inside the ready list because the following * ep_release_epitem() is going to drop it. Also, if the current * item is set to have an Edge Triggered behaviour, we don't have * to push it back either. */ if (EP_RB_LINKED(&epi->rbn) && !(epi->event.events & EPOLLET) && (epi->revents & epi->event.events) && !EP_IS_LINKED(&epi->rdllink)) { list_add_tail(&epi->rdllink, &ep->rdllist);ricnt++; } } if (ricnt) { /** Wake up ( if active ) both the eventpoll wait list and the ->poll() * wait list. */ if (waitqueue_active(&ep->wq)) wake_up(&ep->wq); if (waitqueue_active(&ep->poll_wait)) pwake++; }write_unlock_irqrestore(&ep->lock, flags); /* We have to call this outside the lock */ if (pwake) ep_poll_safewake(&psw, &ep->poll_wait); }

ep_ reinject_ Items put a part of FD in the txlist back to rdllist. Then, which part of FD should be put back? Look at if (EP) above_ RB_ LINKED(&epi->rbn) && !( Epi – > event. Events & epollet) & this judgment is which FD “not marked with epollet” (marked with red code) and “events are concerned” (marked with blue code) are put back to rdllist. So next time epoll_ Of course, wait will copy FD in rdllist to users. for instance. Suppose a socket is only connected and has not received or received data, then its poll event mask is always pollout (see the driver example above). Call epoll every time_ Wait always returns the bollout event, because its FD is always put back to rdllist; If someone writes a lot of data to the socket at this time, causing the socket to be blocked (unable to write), then (epi – > events & epi – > event. Events) & &! EP_ IS_ The judgment in linked (& EPI > rdlink)) {is not tenable (there is no bollout), and FD will not be put back into rdllist, epoll_ Wait will no longer return the user Polo event. Now, we add epollet to this socket, and then connect, no data is sent or received. At this time, if (EP)_ RB_ LINKED(&epi->rbn) && !( Epi – > event. Events & epollet) & judgment is not tenable, so epoll_ Wait will only return a pollout notification to the user (because this FD will not return to rdllist), and the next epoll_ There won’t be any event notification in wait.

Recommended Today

Talk about spring Kafka’s retry

order This article mainly talks about the retry of spring for Kafka AbstractRetryingMessageListenerAdapter spring-kafka-1.2.3.RELEASE-sources.jar!/org/springframework/kafka/listener/adapter/AbstractRetryingMessageListenerAdapter.javaThere are two main implementation classes: retryingacknowledgingmessagelisteneradapter and retryingmessagelisteneradapter RetryingAcknowledgingMessageListenerAdapter public class RetryingAcknowledgingMessageListenerAdapter<K, V> extends AbstractRetryingMessageListenerAdapter<K, V, AcknowledgingMessageListener<K, V>> implements AcknowledgingMessageListener<K, V> { private final AcknowledgingMessageListener<K, V> delegate; /** * Construct an instance with the provided template and delegate. The exception will […]