How to process a request like nginx does
After having helped a handful of people with the project, I’ve decided to write this blog post about my webserv project I did 4 months ago at Codam.
Lots of example code is used to show why mimicking nginx its behavior accurately is hard, explaining stuff that was a mystery to me at the time along the way, like why you absolutely should use getaddrinfo()
.
At the bottom of the page is a nice Python description of the ugly C++ code, which can hopefully save a few people the weeks I had to spend figuring this stuff out back then. :-)
Understanding nginx
nginx its How nginx processes a request page is very technical, so let’s break it down step by step.
Setup
- Start by creating an
nginx_docker
directory, andcd
into it. - Create an
a.html
file in there containing just the lettera
, and ab.html
containing just the letterb
. - Create an
nginx.conf
file in there containing this:
server {
listen 8080;
root /code;
index a.html;
}
- Create a
Dockerfile
file in there containing this:
# Use a tiny Linux distribution
FROM alpine:3.13
# Install curl and nginx
RUN apk update && apk add curl nginx
# Fixes "nginx: [emerg] open() "/run/nginx/nginx.pid" failed
# (2: No such file or directory)"
RUN mkdir --parents /run/nginx
# nginx looks for default.conf, so link it to where our nginx.conf will be
# in our volume once we run the container
# -f stands for force, replacing the file that was there
# -s stands for symbolic link, which is like a Shortcut on Windows, or a C pointer
RUN ln -fs /code/nginx.conf /etc/nginx/conf.d/default.conf
Running
- Build and run the nginx container with
docker build --tag nginx . && docker run --rm --interactive --tty --name nginx --volume $(pwd):/code nginx
- We’re now inside of it, so run nginx as a background process with the command
nginx
, and check that it is running withps
- Run
curl localhost:8080
to request the contents ofa.html
, which is just the lettera
Explanation
I used the index
directive in the nginx.conf configuration file to let this virtual server (also called a server directive) default to returning a.html
:
server {
listen 8080;
root /code;
index a.html;
}
We can add another virtual server that listens on port 8081
by adding this in nginx.conf, so that curl localhost:8081
prints b
.
You don’t need to restart nginx or the container, but you do need to run nginx -s reload
in order to notify nginx to use the updated configuration file.
server {
listen 8081;
root /code;
index b.html;
}
If you let both virtual servers listen on port 8080, rather than having them listen on port 8080 and 8081 respectively, you’ll get this error:
nginx: [warn] conflicting server name “” on 127.0.0.1:8080, ignored
This error makes sense, as nginx is basically complaining that if a HTTP request were to come in, it wouldn’t know which virtual server it is meant to go to.
There are two ways to get rid of this error, which is what this blog post is about.
Solution 1: Using different addresses
So far we’ve only specified the port, and not the address in either of these listen
directives, since it’s optional, and listens to any address by default.
If we do explicitly give the second virtual server an address, we can have both virtual servers use the same port, where curl localhost:8080
will print a
and curl 127.0.0.2:8080
will print b
:
server {
listen 8080;
root /code;
index a.html;
}
server {
listen 127.0.0.2:8080;
root /code;
index b.html;
}
Solution 2: Using different server_names
nginx its documentation of the server directive explains that the Host
HTTP header can be used to decide which virtual server a request should go to:
Sets configuration for a virtual server. There is no clear separation between IP-based (based on the IP address) and name-based (based on the “Host” request header field) virtual servers. Instead, the listen directives describe all addresses and ports that should accept connections for the server, and the server_name directive lists all server names.
If you look at this quote with this error we mentioned earlier in mind, when you let both virtual servers listen on the same address and on the same port:
nginx: [warn] conflicting server name “” on 127.0.0.1:8080, ignored
The conflicting server name ""
part hints that both virtual servers have the default name, namely an empty string.
So if we give the second virtual server a different server_name
, we can get rid of the error:
server {
listen 8080;
root /code;
index a.html;
}
server {
listen 8080;
server_name bar;
root /code;
index b.html;
}
Okay, but what does this achieve?
Well, to be able to appreciate it, we have to look at what we’re actually sending with curl.
If we tell curl to be verbose with the -v
flag, we get a lot of useful information, so we run curl -v localhost:8080
:
/ # curl -v localhost:8080
* Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET / HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.79.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx
< Date: Thu, 11 Apr 2024 09:21:42 GMT
< Content-Type: text/html
< Content-Length: 2
< Last-Modified: Wed, 10 Apr 2024 17:52:17 GMT
< Connection: keep-alive
< ETag: "6616d1d1-2"
< Accept-Ranges: bytes
<
a
* Connection #0 to host localhost left intact
/ #
This is a lot of text, but let’s start with some high-level observations:
- Every line starting with
*
is nginx’s log telling you what it’s currently doing - Every line starting with
>
is a HTTP header that we’re sending with curl - Every line starting with
<
is a HTTP header from the server’s response - The
a
at the bottom is the response’s body
What’s relevant right now is what curl is sending to the server, so we’ll just look at the >
lines:
- We’re connecting to address
127.0.0.1
on port8080
- We’re doing a GET request to the server’s
/
route, which is the root directory - The
Host
header has the valuelocalhost:8080
The Host
header value of localhost:8080
doesn’t match the second virtual server’s bar
server_name, nor the first virtual server’s default empty name.
So the reason the response is a
, and not b
, even though they both listen on localhost:8080
, is because nginx just defaults to the first virtual server if none of the candidate server names match the Host
value.
We can prove this by giving the Host
header a value of bar
by running curl -v localhost:8080 --header 'Host: bar'
:
/ # curl -v localhost:8080 --header 'Host: bar'
* Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET / HTTP/1.1
> Host: bar
> User-Agent: curl/7.79.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx
< Date: Thu, 11 Apr 2024 09:22:36 GMT
< Content-Type: text/html
< Content-Length: 2
< Last-Modified: Wed, 10 Apr 2024 17:52:20 GMT
< Connection: keep-alive
< ETag: "6616d1d4-2"
< Accept-Ranges: bytes
<
b
* Connection #0 to host localhost left intact
/ #
So this time around we see > Host: bar
instead of > Host: localhost:8080
, and the response’s body being b
instead of a
.
The important takeaway here is that nginx can’t always know which virtual server a request belongs to before the Host
header has been read.
So if you’re trying to mimic nginx in your own web server, I recommend reading the entire header into a map, before deciding which Server class instance an incoming request belongs to. I personally even chose to read the entire body before deciding which Server instance any request belongs to.
My C++ implementation
If you’re starting out writing a web server in C++ right now, I highly recommend you make a small client.cpp
and server.cpp
to experiment in, to get yourself familiar with the fundamentals. This C network programming playlist by Jacob Sorber was an excellent introduction for me, and showcases how to write a simple client and server.
I also recommend creating a program for every single function you’re planning to use that you’re not 100% familiar with, like my experiments/getaddrinfo.cpp, experiments/readdir.cpp, experiments/stat.cpp, and so on.
Why getaddrinfo()
nginx accepts this config, where curl localhost:8080
prints a
and curl localhost:8080 --header 'Host: foo'
prints b
:
server {
listen localhost:8080;
root /code;
index a.html;
}
server {
listen 127.0.0.1:8080;
server_name foo;
root /code;
index b.html;
}
Mimicking this behavior requires getaddrinfo()
. This is because even though the addresses localhost
and 127.0.0.1
are different strings, they refer to the same address on your computer.
You could of course hardcode translating any occurrence of localhost
to 127.0.0.1
, but getaddrinfo()
its job is to translate it (and other aliases) for you!
So getaddrinfo()
tells you that both virtual servers have identical address:port
pairs, which allows you to only call socket()
and bind()
once.
If you were to call bind()
for both of them, you’d get the error Address already in use
, and there’d be no way for your program to know which virtual servers should and shouldn’t share a socket.
Python pseudocode
While working on my web server my main priority was accuracy, and not readability, but hopefully this Python pseudocode can help you achieve both. :-)
getaddrinfo()
From Config.cpp its _fillBindInfoToServerIndices()
here:
# Used to throw if a `server_name` is seen for a second time on the same `address:port`
names_of_bind_info = {}
for server_index, server in enumerate(servers):
bind_infos_in_server = set()
for listen in server.listens:
# getaddrinfo() returns a linked list, but my C++ code is lazy
# and just assumes the first node is the only one we need
# This seemed to work™, but is almost certainly the wrong way to do it
bind_info = getaddrinfo()
# Raise an error if a server has both `listen localhost:8080;` and `listen 127.0.0.1:8080;`
if bind_info in bind_infos_in_server:
raise DuplicateBindInfoInServer
bind_infos_in_server.add(bind_info)
# The [] operator in C++ creates the vector for us if it doesn't exist yet
bind_info_to_server_indices[bind_info].append(server_index)
for server_name in server.server_names:
# Raise an error if any server with the same bind_info already used this server_name
if server_name in names_of_bind_info[bind_info]:
raise ConflictingServerName
names_of_bind_info[bind_info].add(server_name)
socket() => bind() => listen()
From Server.cpp its constructor here:
for bind_info, server_indices in config.bind_info_to_server_indices:
socket_fd = socket()
socket_fd_to_server_indices[socket_fd] = server_indices
bind(socket_fd, bind_info)
listen(socket_fd)
# Adds socket_fd to poll() its array of fds,
# stores that this is a SERVER fd, and turns its POLLIN on
addFd(socket_fd, SERVER, POLLIN)
acceptClient()
From Server.cpp its _acceptClient()
here:
def acceptClient(socket_fd):
client_fd = accept(socket_fd)
addFd(client_fd, CLIENT, POLLIN);
clients.append(Client(client_fd, socket_fd))
getServerIndex()
From Server.cpp its _getServerIndexFromClientServerName()
here:
def getServerIndex(client):
server_indices = socket_fd_to_server_indices[client.socket_fd]
for server_index in server_indices:
server = config.servers[server_index]
for server_name in server.server_names:
if server_name == client.server_name:
return server_index
# If there was no server_name match, nginx defaults to the first server
# that has the same bind_info
return server_indices[0]
readFromClient()
From Server.cpp its _readFd()
here:
received = read(fd)
client.appendReceived(received)
if client.everythingReceived():
server_index = getServerIndex(client)
server = config.servers[server_index]
location = resolveToLocation(client.request_target, server.locations)
# We now know which server and location directive the request belongs to,
# so we have all the information we need!
if client.request_method == GET:
# TODO: Implement
elif client.request_method == POST:
# TODO: Implement
elif client.request_method == DELETE:
# TODO: Implement