HTTP Introduction and Debugging
"Hyper Text Transport Protocol" - HTTP is the single most important technology that drives the web and yet remains virtually transparent. Without this protocol HTML and XML via the web would not be able to perform the myriad of tasks that we put them to daily.
This article aims to cover the key concepts of HTTP, the tools needed for debugging and where to find the relevant Internet standards for more detail. Throughout later sections full graphical examples are given illustrating each concept with live web sites.
2 Intended Audience
Why bother learning about HTTP when for the large part most developers manage to produce web sites and never have to deal with HTTP directly.
The simple answer is that every form that's posted and every cookie that you rely on is sent over HTTP. Wouldn't it be nice to see exactly what is being sent to and from the web server and know for certain what is working and what is broken.
Here are situations where knowing how HTTP works will save your time and gain you credibility with the client.
- For developers this will provide an invaluable aid to diagnosing and testing your code, as you'll be able to construct your own HTTP messages with nothing more than notepad and telnet.
- Security audits become simpler for non-secure sites as you'll be able to determine where the weaknesses are.
- Troubleshooting such problems as "The site is not responding" when
you know the server is running and pinging.
3 What is HTTP
HTTP is a protocol run over TCP with all the features necessary for interacting with web servers that hold text and binary resources.
TCP guarantees that packets arriving to and from the web server are error free and in the right order. It doesn't however guarantee that packets arrive no matter what the network conditions are. When communications are congested or unavailable web page delivery is slow and can time-out.
3.1 Asynchronous Protocol
Asynchronous means "not at the same time". This is the basis of the request-response architecture of HTTP.
A request is issued and the response will return some time later. The web browser will not wait for the response actively, instead it will leave the line of communication to the server open until the response or a timeout occurs.
It is important to note that it is request-response from the client web browser. A server cannot send unsolicited responses. There are however web-push technologies that are not discussed here.
A web browser is configured to have no more than two outstanding requests open concurrently. This is defined in the HTTP specification and is designed to prevent overloading of the server by any single individual.
3.2 The Request
The most frequently used requests are GET and POST. The GET fetches a resource from the web server using a path, not a fully qualified URL as the server is implicitly the second party in the communication.
GET /index.html HTTP/1.1
Virtual hosting on certain servers can make use of the intended host in the URL so it is included in the message header with the "Host" directive.
That's it! All that's required to fetch a resource from the web server. The above example can be typed by hand into telnet (port 80) and the web page will be returned. Only the first line is actually required for a server that has no virtual hosting.
When writing interactive web pages we will need to pass details to the web
server using URL encoding. An example below illustrates this with AccountNo=276
and Amount=50 accessing the balance page:
You'll notice the page now has a ? to indicate where the page path ends and the variables begin. Each variable in turn is delimited by an & character. Special characters like space and ampersand that have to be part of the variable, or value, must be escaped by using the % sign followed by its hex ASCII index.
So why have POST at all when we can pass information by the GET command? The answer is two fold. There is a physical limit to the length of a GET request and therefore the amount of information you can pass to the server. The second is that the GET is visible in the web browser address bar making it viewable to any user watching and could potentially be bookmarked.
A post looks very similar to a GET except the variables (right of the ?) are in the message body:
POST /balance.html HTTP/1.1
Now we have the basics of a request we'll move on to response.
3.3 The Response
The response consists of a header and body just like the request. The header contains a status code and the body contains the resource.
The response codes fall into several classes:
The most common are code "200 OK" and "404 Not Found" with redirections taking code "301" or "302".
The response can also set other header directives that we will see next.
3.4 Cookies for State Management
Since a request-response protocol lives as long as the request is outstanding how do we manage state information for each user?
The term "cookie" is very familiar but what is it? It's not part of the HTTP specification but rather an add-on that is described in other specifications.
Cookies exist in the client browser's cache and are transmitted to the server in a request header directive marked: "Cookie:".
The server can tell the client to set a cookie by using the "Set-Cookie:" directive. The web browser is responsible for transmitting appropriate cookies to the server that set them.
Three types of cookie exist and they are based on the lifetime:
- Session, indicates as long as the web browser is open.
In all cases a user can clear their cookies no matter what life expectancy they were set to.
Rather than give examples here we dive into this subject when tracing a live web site below.
The last directive controls caching. This is a huge subject but in its simplest form you can set a page to not be cached by using on response header directive:
Once it's received by the web browser all-subsequent requests for that page will be issued with the same directive preventing proxies and the client's local cache from using stale data.
In fact some web servers (Netscape) have caching components built in to the
server-side that need to have the directive set to prevent sensitive personalised
information from being sent to multiple users.
3.6 HTTP/1.1 Through Proxies
The subject of HTTP/1.1 will be covered in more detail in future updates to this article.
For now it is important to know that when using a proxy server (regularly or when debugging)
IE will default to HTTP/1.0 which has poor performance compared to HTTP/1.1. So before using
a tool such as Charles it is important to enable HTTP/1.1 in IE, "Tools | Internet Options | Advanced".
4 Charles - Debugging Proxy
Enough with the theory! Below are live examples of how debugging HTTP can give a deep insight into web applications.
A request is described by a GET for all resources, images, HTML etc. etc.
A response is returned below with the response header (upper pane) and response data (lower pane)
Charles untangles all the communication and represents it in site-structure format, making it easy to locate resources. IE will use two conversations concurrently whilst accessing a site for increased performance. Charles handles this concurrency automatically.
This allows for more data to be sent to the web server as GET has a finite limit. For completeness the response to the POST is below, notice that the server is Apache/1.3.27 running on UNIX
When there are problems, you can see which resource failed with a dreaded 404. Notice how the web server returns an HTML fragment to be optionally display by the browser.
And when a site is down an exception is reported and the response stays empty.
Each subsequent request by the client now includes the cookie (cookie name = "Site").
Charles - Web Debugging