Upload
dylan-foster
View
219
Download
2
Embed Size (px)
Citation preview
DBIDBI
Representation and Management of Data on the Internet
HTTPHTTP
HyperText Transfer Protocol
In the BeginningIn the Beginning……The Internet
FTP –File Transfer Protocol
SMTP –Simple Mail Transfer Protocol
NNTP –Network-News Transfer Protocol
HTTP –HyperText Transfer Protocol
Let there be a Web
Tim Berners-Lee
The Creation of the WebThe Creation of the Web
• Tim Berners-Lee implemented the HTTP protocol in 1990-1 at CERN, the European Center for High-Energy Physics in Geneva, Switzerland.
• The World-Wide Web is based upon – Information representation in HTML (HyperText
Markup Language) documents– Resources Transmission in HTTP (HyperText
Transfer Protocol)
Previous HTTP VersionsPrevious HTTP Versions
• HTTP/0.9 used by WWW since 1990• HTTP/1.0 [RFC 1945]
– Supports MIME (Multipurpose Internet Mail Extension) messages [RFC 1341]
• MIME transmits non-textual files by encoding them
– Content negotiation
• HTTP/1.1 [RFC 2068]– Persistent connections– Caching
General FeaturesGeneral Features
• Lightness and speed(response time of 100 ms in a hypertext jump)
• Client-Server protocol
• Stateless object-oriented protocol
• Open-ended set of methods and headers
• Typing and negotiation of data representation
TerminologyTerminology
• User agent: client which initiates a request (browser, editor, web robot, …)
• Origin server: the server on which a given resource resides
• Proxy: acts as both a server and a client• Gateway: server which acts as intermediary
for other servers• Tunnel: acts as a blind relay between two
connections
Client-Server ProtocolClient-Server Protocol
• The browser is the client
• The client sends requests to an HTTP Server
Client-Server SessionsClient-Server Sessions
• The HTTP protocol supports a short conversation between browser and server
• The entire conversation is conducted using ASCII characters (8-bit)
• The standard (and default) port for HTTP servers to listen on is 80, though they can use any port
HTTP SessionHTTP Session
• A basic HTTP session has four phases:1.Client opens the connection (a TCP connection)
2.Client makes the request
3.Server sends a response
4.Server closes the connection
Nested ObjectsNested Objects
• Suppose a client accesses a page containing 10 inline images; to display the page completely would require 11 HTTP sessions
• Some browsers/servers support a feature called keep-alive which can keep the connection open until it is explicitly closed
Index.html
Left frame
Jumping fish
Right frame
Fairy icon HUJI icon
Stateless ProtocolStateless Protocol
• HTTP is a stateless protocol, which means that once a server has delivered the requested data to a client, the connection is broken, and the server retains no memory of what has just taken place
ResourcesResources
• A resource is a chunk of information that can be identified by a URL (Universal Resource Locator)– The most common kind of resource is a file, but
a resource may also be • A dynamically-generated query result• The output of a CGI script, or• An active server page
URLURL
• Universal Resource Identifiers [RFC 2396] are used to specify the object of a method– as an address (URL)
– as a name (URN)
URL = “http://” host [“:” port] [path]
IP addresses in URLs should be avoided [RFC 1900]
Different URLsDifferent URLs
• There are different types of URL’s– http://<host>:<port>/<path>?
<searchpart>– mailto:<account@site>– news:<newsgroup-name>
In a URLIn a URL
• Spaces are represented by “+”
• Characters such as &,+,% are encoded in the form “%xx” where xx is the ascii value in hexadecimal; For example, “&” = “%26”
• The inputs to the parameters are in a list of the following form
Var1=value1&var2=value2&var3=value3
War&peace Tolstoy
http://www.google.com/search?lr=&safe=off&q=war%26peace+Tolstoy
Format of Request and ResponseFormat of Request and Response
• An initial line • Zero or more header lines • A blank line (i.e., a CRLF by itself), and • An optional message body (e.g., a file,
query data, or query output)
Note: CRLF = “\r\n” (usually ASCII 13 followed by ASCII 10)
RequestRequest
• A request consists of:– Initial line– Headers– Blank line– Message body
Initial Line of a RequestInitial Line of a Request
• The initial line consists of – Method– Path– HTTP Version
Request FormatRequest Format
Request ExampleRequest Example
GET /courses/dbi/index.html HTTP/1.0
From: [email protected] User-Agent: HTTPTool/1.0 [blank line here] Method Path Version Headers
Initial line
Do Not Forget CRLFDo Not Forget CRLF
GET /courses/dbi/index.html HTTP/1.0 [CRLF]
From: [email protected] [CRLF] User-Agent: HTTPTool/1.0 [CRLF][CRLF]
Request MethodsRequest Methods• GET returns the contents of the indicated
document– The most frequently used command
• HEAD returns the header information for the indicated document– Useful for finding out info about a resource
without retrieving it
• POST treats the document as a script and sends some data to it
More MethodsMore Methods
• PUT replaces the contents of the document with some data
• DELETE deletes the indicated document
• TRACE invokes a remote loop-back of the request. The final recipient SHOULD reflect the message back to the client
• Usually these methods are not allowed
GET MethodGET Method
• GET is the most common HTTP method
• It says “give me this resource”
GET Requests With a ProxyGET Requests With a Proxy
Proxy Server
Client
Web ServerClient
Web Server
~/dbi/index.html
~/dbi/index.html
www.cs.huji.ac.il
www.cs.huji.ac.il
http://www.cs.huji.ac.il/~dbi/index.html
HEAD RequestHEAD Request
• A HEAD request asks the server to return the response headers only, and not the actual resource (i.e., no message body)
• Same as GET but without the message body• This is useful for checking characteristics of
a resource without actually downloading it, thus saving bandwidth
• Used for testing hypertext links for validity, accessibility and recent modification
PostPost
• POST request can send data to the server
• POST is mostly used in form-filling– The contents of a form are translated by the
browser into some special format and sent to a script on the server using the POST command
Post (cont.)Post (cont.)
• There is a block of data sent with the request, in the message body
• There are usually extra headers to describe this message body, like Content-Type: and Content-Length:
• The request URI is a program to handle the sent data, not a resource to retrieve
• The HTTP response is normally the output of a program, not a static file
Post ExamplePost Example
• Here's a typical form submission, using POST:
POST /path/script.cgi HTTP/1.0
From: [email protected]
User-Agent: HTTPTool/1.0
Content-Type: application/x-www-form-urlencoded
Content-Length: 35
home=Ross+109&favorite+flavor=flies
35 characters
HeadersHeaders
• HTTP 1.0 defines 16 headers– none are required
• HTTP 1.1 defines 46 headers– one header (Host:) is required in requests
HeadersHeaders
• From: – gives the email address of whoever is making
the request or running the program doing so
• User-Agent:– identifies the program that's making the request,
in the form "Program-name/x.xx", • x.xx is the (mostly) alphanumeric version of the
program. • For example, Netscape 3.0 sends the header
"User-agent: Mozilla/3.0Gold"
Headers (cont.)Headers (cont.)
• Server: – analogous to the User-Agent: header: – it identifies the server software in the form
"Program-name/x.xx". – For example, one beta version of Apache's
server returns "Server: Apache/1.2b3-dev"
Headers (cont.)Headers (cont.)
• If an HTTP message includes a body, there are usually header lines in the message that describe the body. In particular,
• Content-Type: – gives the MIME-type of the data in the body,
such as text/html or image/gif
• Content-Length: – gives the number of bytes in the body
Headers (cont.)Headers (cont.)
• Last-Modified: – Gives the modification date of the resource
that's being returned – It's used in caching and other bandwidth-saving
activities – Greenwich Mean Time should be used and the
format isLast-Modified: Fri, 31 Dec 1999 23:59:59
GMT
Initial Line of a ResponseInitial Line of a Response
• The initial line of a response is also called the status line.
• The initial line consists of– HTTP version– response status code– reason phrase that describes the status code
Response FormatResponse Format
HTTP/1.0 200 OK Date: Fri, 31 Dec 1999 23:59:59 GMT Content-Type: text/html Content-Length: 1354
<html> <body> <h1>Hello World</h1> (more file contents) . . . </body> </html>
Headers
Response ExampleResponse ExampleInitial line
Version
Status code
Reason phrase
Message body
Status CodeStatus Code
• The status code is a three-digit integer, and the first digit identifies the general category of response: – 1xx indicates an informational message only – 2xx indicates success of some kind – 3xx redirects the client to another URL– 4xx indicates an error on the client's part
• Yes, the system blames it on the client if a resource is not found (i.e., 404)
– 5xx indicates an error on the server's part
Status Code 1xxStatus Code 1xx
• The 100 (Continue) Status– Allows a client to determine if the Server
is willing to accept the request (based on the request headers) before the client sends the request body
– The client’s request must have the headerExpect: 100 (Continue)
• 101 Status -- Switching Protocols
Status Code 2xxStatus Code 2xx
Status codes 2xx -- Success
• The action was successfully received, understood, and accepted– 200 OK– 201 POST command successful– 202 Request accepted– 203 GET or HEAD request fulfilled– 204 No content
Status Code 3xxStatus Code 3xx
Status codes 3xx -- Redirection
• Further action must be taken in order to complete the request– 300 Resource found at multiple locations– 301 Resource moved permanently– 302 Resource moved temporarily– 304 Resource has not modified (since date)
Status Code 4xxStatus Code 4xx
Status codes 4xx -- Client error• The request contains bad syntax or cannot be
fulfilled– 400 Bad request from client– 401 Unauthorized request– 402 Payment required for request– 403 Resource access forbidden– 404 Resource not found– 405 Method not allowed for resource– 406 Resource type not acceptable
Status Code 5xxStatus Code 5xx
Status codes 5xx -- Server error
• The server failed to fulfill an apparently valid request– 500 Internal server error– 501 Method not implemented– 502 Bad gateway or server overload– 503 Service unavailable / gateway timeout– 504 Secondary gateway / server timeout
Response InformationResponse Information
• Description of information– Server Type of server
– Date Date and time
– Content-Length Number of bytes
– Content-Type Mime type
– Content-Language English, for example
– Content-Encoding Data compression
– Last-Modified Date when last modified
– Expires Date when file becomes invalid
Manually Experimenting with Manually Experimenting with HTTPHTTP
>host wwwwww.cs.huji.ac.il is a nickname for vafla.cs.huji.ac.il
vafla.cs.huji.ac.il has address 132.65.80.39
vafla.cs.huji.as.il mail is handled (pri=10) by cs.huji.ac.il
>telnet www.cs.huji.ac.il 80
Trying 132.65.80.39…
Connected to vafla.cs.huji.ac.il.
Escape character is ‘^]’.
Sending a RequestSending a Request
>GET /~dbi/index.html HTTP/1.0
[blank line]
The ResponseThe Response
HTTP/1.1 200 OKDate: Sun, 11 Mar 2001 21:42:15 GMTServer: Apache/1.3.9 (Unix)Last-Modified: Sun, 25 Feb 2001 21:42:15 GMTContent-Length: 479Content-Type: text/html
<html> (html code …)</html>
GET /~dbi/index.html HTTP/1.0
HTTP/1.1 200 OK
HTML code
GET /~dbi/no-such-page.html HTTP/1.0
HTTP/1.1 404 Not Found
HTML code
GET /index.html HTTP/1.1
HTTP/1.1 400 Bad Request
HTML code
Why is it a Bad Request?
HTTP/1.1 without Host Header
HTTP 1.1HTTP 1.1
HTTP/1.1 is replacing/has replaced HTTP/1.0 as the new Web protocol
ImprovementsImprovements
• Faster response– allowing multiple transactions to take place over a
single persistent connection
– adding cache support
• Faster response for dynamically-generated pages– supporting chunked encoding, which allows a response
to be sent before its total length is known
• Efficient use of IP addresses– allowing multiple domains to be served from a single
IP address
Improvements over HTTP 1.0Improvements over HTTP 1.0
• HTTP/1.1 has a number of features/improvements over HTTP/1.0, including– Persistent TCP connections
– Partial document transfers
– Conditional fetch
– Support for nonstandard HTTP/1.0 extensions
– Better support for alternative character sets
– More flexible authentication
– Faster response and great bandwidth savings
– Efficient use of IP addresses (virtual hosting)
Non-Persistent ConnectionsNon-Persistent Connections
1 Browser opens TCP connection to port 80 of server (handshake)
2 Browser sends http request message3 Server receives request, locates object,
sends response4 Server closes TCP connection5 Client receives response, parses object6 Repeat 1-4 for each embedded object
Persistent ConnectionPersistent Connection
1 Browser opens TCP connection to port 80 of server (handshake)
2 Browser sends http request message3 Server receives request, locates object, sends
response4 Client receives response, parses object5 Repeat 2-4 for each embedded object6 TCP connection closes on demand or timeout
Advantages of Persistent Advantages of Persistent ConnectionConnection
• CPU time saved in routers and hosts
• HTTP requests and responses can be pipelined on a connection
• network congestion is reduced
• latency on subsequent requests is reduced
PipelinesPipelines
• 2 types of persistent connections– without pipelining
• the client issues a new request only after the previous response has arrived
– with pipelining• client sends the request as soon as it encounters a
reference
• multiple requests/responses
– on the same IP packet, or
– on back-to-back packets
Virtual HostsVirtual Hosts
• With HTTP 1.1, one server at one IP address can be multi-homed: – “www.cs.huji.ac.il” and “www.math.huji.ac.il” can live
on the same server
– These are called virtual hosts
– Without this mechanism, we have to use 2 different IP addresses
• It is like several people sharing one phone• An HTTP request must specify the host name (and
possibly port) for which the request is intended
ExampleExample
• The request specifies the host:
GET /path/file.html HTTP/1.1
Host: www.host1.com:80
Virtual Hosting (cont.)Virtual Hosting (cont.)
• Virtual hosting – reduces hardware expenditures – extends the ability to support additional servers– makes load balancing and capacity planning
much easier
• Without it – each host name requires a unique IP address,
and we are quickly running out of IP addresses with the explosion of new domains
The Date HeaderThe Date Header• In HTTP 1.1, servers must include the generation
time of the response in the Date: header • Time values use Greenwich Mean Time (GMT)
and have the format
Date: Fri, 31 Dec 1999 23:59:59 GMT • Date is omitted only in a few cases, e.g., status
code 100 (continue) and some server errors• Servers must synchronize their clocks with a
reliable external standard
CachingCaching
Caching improves performance
• Eliminates the need to send requests in many cases (reduces network round-trips), using an expiration mechanism
• Eliminates the need to send full responses in other cases (reduces network bandwidth), using a validation mechanism
Client CachingClient Caching
client
server
cache
• Client GET /fruit/apple.gif• Server responds with
Last-Modified-Date: ...
• Client caches object and last-modified-date
• Client sendsGET /fruit/apple.gif …If-Modified-Since: …
• Server returns either
304 Not Modified or object
Network CachesNetwork Caches
client
client
client
server
server
proxyserver
GET /fruit/apple.gif
GET /fruit/apple.gif
GET /fruit/apple.gif
Internet
Benefit of CachingBenefit of Caching
client
client
client
10Mbps LAN
R R
1.5Mbps
server
server15 req/sec100Kbits/req
proxyserver 40% hit rate
Expiration ModelExpiration Model
• Servers may provide an expiration time using the Expires header– By checking the expiration time, the cache can
return a fresh response without contacting the server
• If the expiration time is not specified, the cache can heuristically estimate the expiration times (e.g., using header values, such as the Last-Modified time)
The Risk in CachingThe Risk in Caching
• Response might not be
“semantically transparent”– the response is different from what would have
been returned by the origin server
• The cache should verify that the copy is fresh (i.e., expiration time has not passed)
• The copy is stale if it is not fresh
ValidatorsValidators
• A validator is any mechanism that may help in determining whether a copy is fresh or stale– A strong validator is, for example, a counter
that is incremented whenever the resource is changed
– A weak validator is, for example, a counter that is incremented only when a significant change is made
Using the CacheUsing the Cache
• To check whether a copy is fresh, the cache must either– Use the expiration model, or– Compare the Last-Modified time or some
validator with the origin server
• In the second case, the origin server either– Responds with the message 304(Not Modified), or
– Sends a full response with the entity body
Cache-Control HeaderCache-Control Header
• Cache-control headers specify directives to the cache – Can be included in either requests or responses
• The server can specify “must-revalidate”– Cache must revalidate with the origin server
that the copy is still fresh
• The client can specify – the max-age of an unvalidated response– The max-stale time of a stale copy
Do not Use a CacheDo not Use a Cache
• The Pragma: no-cache request header indicates that the request should not be satisfied from a cache
• Same as the no-cache cash-directive
• Should include both if server is not HTTP/1.1 compliant
• Directive applies to any recipient along the request/response chain
If-Modified-Since HeaderIf-Modified-Since Header
• The If-Modified-Since: header is used with a GET request
• If the requested resource has been modified since the given date, the server returns the resource as it normally would (i.e., header is ignored)
• Otherwise, the server returns a 304 Not Modified response, including the Date: header, but with no message body
HTTP/1.1 304 Not Modified Date: Fri, 31 Dec 1999 23:59:59 GMT [blank line here]
If-Unmodified-Since HeaderIf-Unmodified-Since Header
• The If-Unmodified-Since: header can be used with any method
• If the requested resource has not been modified since the given date, the server returns the resource as it normally would
• Otherwise, the server returns a 412 Precondition Failed response
HTTP/1.1 412 Precondition Failed [blank line here]
Cooperative CachingCooperative Caching
Cooperative Caching (cont.)Cooperative Caching (cont.)
• Higher level cache (e.g., national cash)– larger user population – higher hit rates
• Multiple Web cashes which cooperate => Improve overall performance
• Cooperative cashes usually built from clusters – divide the traffic overhead– improve storage capacity
Cooperative Caching (cont.)Cooperative Caching (cont.)
• Which cashes should be asked for a particular doc?
• Hash routing (of URLs) -- an object will not be present in more than one cash
Hop by HopHop by Hop
• HTTP/1.1 introduces the concept of hop-by-hop headers: – Message headers that apply only to a given
connection, and not to the entire path
– It enables much more power with the usage of proxies (cashes)
Hop-by-Hop HeadersHop-by-Hop Headers
• Connection – options that are desired for that particular connection (e.g.,
connection:close)
• Public – lists the set of methods supported by the server
• Proxy-Authenticate– enables authentication methods between two hops
• Transfer-Encoding – compression method between two hops
• Upgrade – additional communication protocols
Chunked EncodingChunked Encoding
• Chunked encoding– Transmission of streaming multimedia
• One frame varies in size and composition from the next
– Streaming video• Entire image transmitted in first chunk and
differences from the previous image are transmitted in the next chunk
Wake up, we speak about movies in the Internet
CompressionCompression
• Most image formats (GIF, JPEG, MPEG) are precompressed
• Many other data types used in the Web are not precompressed
• Compression could save almost 40% of the bytes sent via HTTP
• There is a need for negotiating the type of encoding of the compressed resource
Compression (cont.)Compression (cont.)
• Client sends the header Accept-Encoding– The header indicates the content-encodings that the
client can handle and the ones that the client prefers
• Server Sends– Content-Encoding header – for end-to-end
encoding indication
– Transfer-Encoding header - for hop-to-hop encoding indication (supported only in HTTP/1.1)
Content NegotiationContent Negotiation
• Content Negotiation:– the process of selecting the best
representation for a given response when there are multiple representations available
• HTTP supports two kinds of content negotiation:– Server-driven negotiation– Agent-driven negotiation
Server-Driven NegotiationServer-Driven Negotiation
The selection is made by the server, based on:– header field in the request (client preferences):Accept-Language / Accept-Encoding
– available representations of the response– other information (i.e., address of the client)
Disadvantages:– Impossible for the server to determine what is best for
the user– Inefficiency (clients should describe their capabilities
in every request)– Complicates implementation of servers
Agent-Driven NegotiationAgent-Driven Negotiation
• Selection is made by the client after receiving an initial response from the server– Based on available representations specified
in the initial response– Automatic or manual
• Disadvantages:– needs a second request to obtain the best
alternative representation
Protocol SwitchingProtocol Switching
• Protocol switching– Client can specify another protocol more suited
to the data being transferred (e.g., real-time synchronous protocol)
I hate HTTP/1.0I want
another protocol
AuthenticationAuthentication
• Many sites require users to provide a username and password in order to access the documents housed on the server
• This requirement provides a mechanism for keeping track of users (more than just a security mechanism)
AuthenticationAuthentication
Client Web Server
~/dbi/index.html
www.cs.huji.ac.il
Who are you?
Who are you?~/dbi/index.html
I am DonaldMy password is Duck
response
AuthenticationAuthentication
• How does it’s work?– Client sends
• ordinary request message– server responds with
•401 Authorization Required status code •WWW-Authenticate header which specifies how to
perform authentication– Client resends
• the requested message, but this time including the Authorization header (e.g., user-name & password)
– The client continues to add this header for each following request to that server
CookiesCookies
• Alternative way to identify browsers• Server response includes the Set-cookie
header that has the attributes– name = VALUE– expires = DATE STRING– domain = DOMAIN NAME– path = PATH– secure
• Client returns cookie with matching URLs
CookiesCookies
• Example:– Client contacts a web site for the first time– Server response includes the header:
Set-cookie : 1678453
– Client stores the cookie value and the server name in a special “cookie file”
– For each further request for that server, the client will add the header
Cookie : 1678453
Cookies (cont.)Cookies (cont.)
• Usage:– Server requires authentication, but doesn’t want
to hassle a user with a user-name and password– Remembering user’s preferences for
advertising– Cookies enable creating a virtual shopping cart
• Problems– users who access the same site from different
machines
Are you HTTP experts nowAre you HTTP experts now??
• Not yet
• There are more headers, for example, that this talk did not cover
• To know more, go to the specifications
Additional InformationAdditional Information
• For specifications and additional information:– http://www.w3.org/Protocols/– http://www.w3.org/Protocols/Specs.html– http://www.jmarshall.com/easy/http/– http://wdvl
.com/Internet/Protocols/HTTP/article.html