HTTP Introduction and Debugging

HistoryAuthor: Gordon McKinney  /
TBAWill introduce details of HTTP/1.1 and anatomy of cookies
09 Mar 2004Added several new links to Further Reading
30 Nov 2003Added Charles screenshots and HTTP/1.1 through proxies
31 Oct 2003Added links to HTTP/TCPIP books (see end)
04 Feb 2001First Release

1 Intro

"Hyper Text Transport Protocol" - HTTP is the single most important technology that drives the web and yet remains virtually transparent. Without this protocol HTML and XML via the web would not be able to perform the myriad of tasks that we put them to daily.

This article aims to cover the key concepts of HTTP, the tools needed for debugging and where to find the relevant Internet standards for more detail. Throughout later sections full graphical examples are given illustrating each concept with live web sites.

 

2 Intended Audience

Why bother learning about HTTP when for the large part most developers manage to produce web sites and never have to deal with HTTP directly.

The simple answer is that every form that's posted and every cookie that you rely on is sent over HTTP. Wouldn't it be nice to see exactly what is being sent to and from the web server and know for certain what is working and what is broken.

Here are situations where knowing how HTTP works will save your time and gain you credibility with the client.

- For developers this will provide an invaluable aid to diagnosing and testing your code, as you'll be able to construct your own HTTP messages with nothing more than notepad and telnet.

- Security audits become simpler for non-secure sites as you'll be able to determine where the weaknesses are.

- Troubleshooting such problems as "The site is not responding" when you know the server is running and pinging.

 

3 What is HTTP

HTTP is a protocol run over TCP with all the features necessary for interacting with web servers that hold text and binary resources.

TCP guarantees that packets arriving to and from the web server are error free and in the right order. It doesn't however guarantee that packets arrive no matter what the network conditions are. When communications are congested or unavailable web page delivery is slow and can time-out.

The sections below outline the key features of HTTP and therefore how a browser and web server interact. Knowing this when writing client side JavaScript and server side ASP or Vignette will save many hours of head scratching when code doesn't work.

3.1 Asynchronous Protocol

Asynchronous means "not at the same time". This is the basis of the request-response architecture of HTTP.

A request is issued and the response will return some time later. The web browser will not wait for the response actively, instead it will leave the line of communication to the server open until the response or a timeout occurs.

It is important to note that it is request-response from the client web browser. A server cannot send unsolicited responses. There are however web-push technologies that are not discussed here.

A web browser is configured to have no more than two outstanding requests open concurrently. This is defined in the HTTP specification and is designed to prevent overloading of the server by any single individual.

3.2 The Request

The most frequently used requests are GET and POST. The GET fetches a resource from the web server using a path, not a fully qualified URL as the server is implicitly the second party in the communication.

GET /index.html HTTP/1.1
Host: www.aserver.org

Virtual hosting on certain servers can make use of the intended host in the URL so it is included in the message header with the "Host" directive.

That's it! All that's required to fetch a resource from the web server. The above example can be typed by hand into telnet (port 80) and the web page will be returned. Only the first line is actually required for a server that has no virtual hosting.

When writing interactive web pages we will need to pass details to the web server using URL encoding. An example below illustrates this with AccountNo=276 and Amount=50 accessing the balance page:

GET /balance.html?AccountNo=276&Amount=50 HTTP/1.1
Host: www.aserver.org

You'll notice the page now has a ? to indicate where the page path ends and the variables begin. Each variable in turn is delimited by an & character. Special characters like space and ampersand that have to be part of the variable, or value, must be escaped by using the % sign followed by its hex ASCII index.

So why have POST at all when we can pass information by the GET command? The answer is two fold. There is a physical limit to the length of a GET request and therefore the amount of information you can pass to the server. The second is that the GET is visible in the web browser address bar making it viewable to any user watching and could potentially be bookmarked.

A post looks very similar to a GET except the variables (right of the ?) are in the message body:

POST /balance.html HTTP/1.1
Host: www.aserver.org

AccountNo=276&Amount=50

Now we have the basics of a request we'll move on to response.

3.3 The Response

The response consists of a header and body just like the request. The header contains a status code and the body contains the resource.

The response codes fall into several classes:

1xx Informational
2xx Successful
3xx Redirection
4xx Client Error
5xx Server Error

The most common are code "200 OK" and "404 Not Found" with redirections taking code "301" or "302".

The response can also set other header directives that we will see next.

3.4 Cookies for State Management

Since a request-response protocol lives as long as the request is outstanding how do we manage state information for each user?

The term "cookie" is very familiar but what is it? It's not part of the HTTP specification but rather an add-on that is described in other specifications.

Cookies exist in the client browser's cache and are transmitted to the server in a request header directive marked: "Cookie:".

The server can tell the client to set a cookie by using the "Set-Cookie:" directive. The web browser is responsible for transmitting appropriate cookies to the server that set them.

Three types of cookie exist and they are based on the lifetime:

- Session, indicates as long as the web browser is open.
- Expiring, indicates that it has a fixed time to live.
- Permanent, indicates that it will live until it is deleted the server.

In all cases a user can clear their cookies no matter what life expectancy they were set to.

Rather than give examples here we dive into this subject when tracing a live web site below.

3.5 Caching

The last directive controls caching. This is a huge subject but in its simplest form you can set a page to not be cached by using on response header directive:

Cache-Control: no-cache

Once it's received by the web browser all-subsequent requests for that page will be issued with the same directive preventing proxies and the client's local cache from using stale data.

In fact some web servers (Netscape) have caching components built in to the server-side that need to have the directive set to prevent sensitive personalised information from being sent to multiple users.

 

3.6 HTTP/1.1 Through Proxies

The subject of HTTP/1.1 will be covered in more detail in future updates to this article. For now it is important to know that when using a proxy server (regularly or when debugging) IE will default to HTTP/1.0 which has poor performance compared to HTTP/1.1. So before using a tool such as Charles it is important to enable HTTP/1.1 in IE, "Tools | Internet Options | Advanced".


 

4 Charles - Debugging Proxy

Enough with the theory! Below are live examples of how debugging HTTP can give a deep insight into web applications.

A request is described by a GET for all resources, images, HTML etc. etc.

A response is returned below with the response header (upper pane) and response data (lower pane)

Charles untangles all the communication and represents it in site-structure format, making it easy to locate resources. IE will use two conversations concurrently whilst accessing a site for increased performance. Charles handles this concurrency automatically.


Now viewing HTTP becomes very useful when tracking information submitted in an HTML form…


No prizes for the search string. That was a GET submission of a form, below is a POST, notice how the data is carried in the body of the request. The query used here is "internet", you'll also notice the "book" variable being set to "dictionary"

This allows for more data to be sent to the web server as GET has a finite limit. For completeness the response to the POST is below, notice that the server is Apache/1.3.27 running on UNIX

When there are problems, you can see which resource failed with a dreaded 404. Notice how the web server returns an HTML fragment to be optionally display by the browser.

And when a site is down an exception is reported and the response stays empty.


Cookies are present in every request and response. Here is an example of two cookies being set by the server:

 

Each subsequent request by the client now includes the cookie (cookie name = "Site").

 


5 Troubleshooting

5.1 Performance and Networking Problems

We have seen that when a server is down the resources are simply never returned, it is up to the browser to time-out and give up. This can be 60 seconds or 5 minutes depending on the configuration.

When your site and all it's resources are located and managed by one provider the number of potential problems are reduced. Your site is either up or down. This changes when your web pages rely on third parties such as advert providers.

Ad providers can supply simple GIF images or more complex dynamic HTML. Each failure scenario is covered below:

An Ad provider not serving GIF images will cause that request to fail after a timeout. This will cause the web page to load more slowly as one of the two concurrent download slots is busy waiting for the failed Ad server.

When an Ad provider is supplying dynamic HTML you will see JavaScript source being requested. A page can stall completely if the JavaScript loads but then request two more resources, for example an image rollover where both images are unavailable. Remember only two resources can be requested at a time leaving both waiting for a dead Ad server.

Both cases should cause concern as a third party can effectively disable a web site by having their servers fail. Charles can provide the evidence for this sort of failure within seconds as it clearly shows the requests that have failed with an exception, or simply display "Active Connections" that may be outstanding.


5.2 Dynamic HTML

As discussed above Ad providers can include JavaScript instead of static images to produce a rich experience for the user and hopefully get that much sought after 'click-through'.

While having an Ad server fail can cause a headache and possible loss of a site there is another problem. This comes in the form of JavaScript errors and incompatibilities. All HTML developers have hit NS/IE problems and most times have had to use some tricks to make them co-exist after some heavy testing.

Ad providers can upset a page of working HTML by introducing JavaScript with DHTML content. The browser normally hides the provided code when it loads the page but Charles can list all requested resources and importantly all responses, including any suspect JavaScript.

5.2 Limitations of Charles

Remember that Charles is a non-caching proxy that does affect the communications between client and server (as does any proxy). Use it for diagnosing problems but never for testing.

 

6 Internet Standards

Below is a list of the Internet standards that define HTTP:

HTTP Related documents:
RFC2616 -- Hypertext Transfer Protocol -- HTTP/1.1
RFC2965 -- HTTP State Management Mechanism (Cookies)
RFC2964 -- Use of HTTP State Management
RFC2936 -- HTTP MIME Type Handler Detection
RFC2817 -- Upgrading to TLS Within HTTP/1.1
RFC2617 -- HTTP Authentication: Basic and Digest Access Authentication

Multipurpose Internet Mail Extensions (MIME):
RFC2045 -- Part 1: Format of Internet Message Bodies
RFC2046 -- Part 2: Media Types
RFC2047 -- Part 3: Message Header Extensions for Non-ASCII Text
RFC2048 -- Part 4: Registration Procedures
RFC2049 -- Part 5: Conformance Criteria and Examples

Read RFCs at: RFC Denmark or FAQs.org



7 Downloads and Tools

Charles - Web Debugging
Charles is an HTTP proxy / HTTP monitor / Reverse Proxy / WAN Simulator with full NTLM support that enables a developer to view all of the HTTP traffic between their machine and the Internet. This includes requests, responses and the HTTP headers (which contain the cookies and caching information). The WAN simulator allows the simulation of high latency and low bandwidth links and provides detailed page timing statistics for analysis. Charles is a great all-in-one tool for debugging and performance tuning.



Ethereal
Ethereal is a free network protocol analyzer for Unix and Windows. It allows you to examine data from a live network or from a capture file on disk. You can interactively browse the capture data, viewing summary and detail information for each packet. Ethereal has several powerful features, including a rich display filter language and the ability to view the reconstructed stream of a TCP session.

 



8 Further Reading


   

Understanding Application Layer Protocols
Extract from a chapter covering TCP-based services such as HTTP, UDP services like DNS, and applications that use a combination of TCP and UDP, such as the Real Time Streaming Protocol (RTSP). Finally, we'll look at how these types of applications can be secured using Secure Sockets Layer (SSL).




       

HTTP: The Definitive Guide
Web technology has become the foundation for all sorts of critical networked applications and far-reaching methods of data exchange, and beneath it all is a fundamental protocol: HyperText Transfer Protocol, or HTTP. HTTP: The Definitive Guide documents everything that technical people need for using HTTP efficiently. A reader can understand how web applications work, how the core Internet protocols and architectural building blocks interact, and how to correctly implement Internet clients and servers.




       

HTTP Pocket Reference
All web programmers, administrators, and application developers need to be familiar with HTTP in order to work effectively. The HTTP Pocket Reference provides a solid conceptual foundation of HTTP, and also serves as a quick reference to each of the headers and status codes that compose an HTTP transaction. For those who need to get "beyond the browser," this book is the place to start.




       

TCP/IP Illustrated, Volume 1: The Protocols
TCP/IP Illustrated, Volume 1 is a complete and detailed guide to the entire TCP/IP protocol suite - with an important difference from other books on the subject. Rather than just describing what the RFCs say the protocol suite should do, this unique book uses a popular diagnostic tool so you may actually watch the protocols in action.
By forcing various conditions to occur - such as connection establishment, timeout and retransmission, and fragmentation - and then displaying the results, TCP/IP Illustrated gives you a much greater understanding of these concepts than words alone could provide. Whether you are new to TCP/IP or you have read other books on the subject, you will come away with an increased understanding of how and why TCP/IP works the way it does, as well as enhanced skill at developing aplications that run over TCP/IP.




       

Ethereal Packet Sniffing
Ethereal offers more protocol decoding and reassembly than any free sniffer out there and ranks well among the commercial tools. You’ve all used tools like tcpdump or windump to examine individual packets, but Ethereal makes it easier to make sense of a stream of ongoing network communications. Ethereal not only makes network troubleshooting work far easier, but also aids greatly in network forensics, the art of finding and examining an attack, by giving a better "big picture" view. Ethereal Packet Sniffing will show you how to make the most out of your use of Ethereal.