Upload
patience-jones
View
221
Download
0
Embed Size (px)
Citation preview
23 January 2007 Kaiser: COMS E6125 1
COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)
COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)
Prof. Gail KaiserProf. Gail Kaiser
Spring 2007Spring 2007
23 January 2007 Kaiser: COMS E6125 2
Reminders
• Class attendance required!• Preliminary paper proposal January 29th
• Preliminary project proposal March 5th
• Paper must be individual, projects may be teams of 2-5 students
• See advice about team formation at http://york.cs.columbia.edu/classes/cs6125/team_advice.htm
23 January 2007 Kaiser: COMS E6125 3
Class Attendance is Required!
• Attendance will be taken at every class meeting, starting TODAY
• Final grade reduced one notch for first miss (e.g., A- -> B+)
• Final grade reduced full letter grade for second miss (e.g., A- -> B-)
• Fail (or drop) course for third miss
23 January 2007 Kaiser: COMS E6125 4
Today’s Topic: Basic Mechanics of the Web
• URI (~URL)• HTTP• Client/Server Intermediaries
23 January 2007 Kaiser: COMS E6125 5
What is a “URI”?• Uniform Resource Identifier• Compact string of characters, that
conform to a certain syntax, for identifying an abstract or physical resource
• Simple and extensible format• Example:
http://york.cs.columbia.edu/classes/cs6125
23 January 2007 Kaiser: COMS E6125 6
What is a “Resource”?• Some piece of information that can be
identified by a URI• The most common kind of resource is a
file• But may also be a dynamically-
generated query result, the output of a script, a document available in several languages, etc.
23 January 2007 Kaiser: COMS E6125 7
Uniform Resource Identifier• Uniform: aka Universal, same string can be
used with the same semantic interpretation, even when mechanisms used to access the resource differ
• Resource: Conceptual mapping to an entity or set of entities - not necessarily the entity which corresponds to that mapping at any particular instance in time, not always network “retrievable”
• Identifier: An object that can act as a reference to something that has identity
23 January 2007 Kaiser: COMS E6125 8
Key requirement: Transcribability
• Sequence of characters• May be transcribed from non-network
source• Often needs to be remembered by people• Should consist of characters that are most
likely to be able to be typed into a computer, within the constraints imposed by keyboards (and related input devices) across languages and locales
23 January 2007 Kaiser: COMS E6125 9
Why do we usually say URL rather than URI?
• A Uniform Resource Locator (URL) refers to the subset of URI that identify resources via a representation of their primary access mechanism (e.g., their network “location”)
• Most popular form of URI
23 January 2007 Kaiser: COMS E6125 10
What’s a URI that’s not a URL?
• URN = Uniform Resource Name• Subset of URIs that denote a resource
independent of its current location or the name by which it is known or the mechanism by which it is accessed
• Required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable
• Thus not necessarily retrievable
23 January 2007 Kaiser: COMS E6125 11
URN vs. URL Example• Assume a published book (the
resource)• ISBN assigned by the Library of
Congress - this is the URN• Assume the entire contents of the book
were placed on a Web server at http://www.xyz.com/book.gz and an Ftp server at ftp://ftp.xyz.com/book.gz - both of these are URLs
23 January 2007 Kaiser: COMS E6125 12
URL Notation• <scheme>://<authority><path>?
<query>
typically, an Internet domainname
specific to the authority, identifies the resource within
the scope of the scheme and authority
a string of information to be interpreted
by the resource
23 January 2007 Kaiser: COMS E6125 13
What’s a “domain name”?
• Domain Name System (DNS)– Maps domain names to IP addresses and vice
versa – Hierarchy of DNS servers for top level domains
(.com, .edu, .uk, etc.), second level domains (columbia.edu, ibm.com, etc), and so on
– Eventually finds IP address for individual host (e.g., www.cs.columbia.edu)
• Originated ~1982, for email (gk60@CMUA -> [email protected] -> [email protected])
23 January 2007 Kaiser: COMS E6125 14
What is a “scheme”?• <scheme>:<scheme-specific-part> • In a URL, the protocol employed for
retrieval (http, ftp, file, mailto, etc.)• More generally, a specification for
defining the syntax and semantics of the rest of the URI
• Extensible because new schemes can be defined, with their own scheme-specific format after the colon (:)
23 January 2007 Kaiser: COMS E6125 15
Example URLs• http://www.ietf.org/rfc/rfc3986.txt • gopher://gopher.quux.org/1/Software/
Gopher
• mailto:[email protected] • news:news.newusers.questions • telnet:cs.columbia.edu
23 January 2007 Kaiser: COMS E6125 16
Example Absolute URIs• http://somehost/absolute/URI/
with/absolute/path/to/resource.txt• ftp://somehost/resource.txt• urn:a-rose-by-any-other-name
23 January 2007 Kaiser: COMS E6125 17
Example Relative URIs• http://somehost/absolute/URI/with/absolute/
path/to/resource.txt• /relative/URI/with/absolute/path/to/
resource.txt• relative/path/to/resource.txt• ../../../resource.txt• resource.txt• /resource.txt#frag01• #frag01• [empty string]
23 January 2007 Kaiser: COMS E6125 18
Relative Addresses
• Allows document trees to be (partially) independent of their location and scheme
• A single set of hypertext documents can be simultaneously traversable via each of the ftp, http and file schemes if the documents refer to each other using relative URIs
• Such document trees can be moved, as a whole, without changing any of the relative references
23 January 2007 Kaiser: COMS E6125 19
URI “Standard”• URI is an Internet protocol element
defined currently in RFC 3986 (2005)• Originally RFC1630 (1994)
23 January 2007 Kaiser: COMS E6125 20
What is an “RFC”?• Request for Comments • One of a series, begun in 1969, of
numbered Internet informational documents and standards widely followed by commercial software and freeware in the Internet and Unix communities
• All Internet standards are recorded in RFCs
23 January 2007 Kaiser: COMS E6125 21
Who keeps track of RFCs?
• IETF = Internet Engineering Task Force• Open, all-volunteer organization, with no
formal membership or membership requirements
• Organized into a large number of working groups, each dealing with a specific topic
• April 1st RFCs, e.g., http://www.apps.ietf.org/rfc/rfc3514.html
23 January 2007 Kaiser: COMS E6125 22
What is “W3C”?• World Wide Web Consortium defines data
formats and usage conventions as well as Internet protocols relevant to Web
• Members pay fees depending on country, revenues and non-profit/for-profit status (e.g., $953 vs. $63,500)
• Otherwise organized similar to IETF, but writes “Recommendations” instead of “Requests for Comments”
• http://www.w3.org/
23 January 2007 Kaiser: COMS E6125 23
Back to URLs• Most (?) Web documents use the “http”
scheme• What is “http” (HyperText Transfer
Protocol)?
23 January 2007 Kaiser: COMS E6125 24
HTTP• The default Internet protocol used to
deliver data on the World Wide Web• Usually through TCP/IP sockets on port
80, but can use any port and can be implemented on top of any reliable networking protocol
• A Web browser (HTTP client) sends requests to an Web server (HTTP server), which sends responses back to the client
23 January 2007 Kaiser: COMS E6125 25
What’s “TCP/IP”?• IP = Internet Protocol
– Delivers individual packets from one host to another, based on their IP address (in IPv4, four 8-bit octets as in 128.59.16.20)
– Network routers direct traffic of IP packets
23 January 2007 Kaiser: COMS E6125 26
What’s “TCP/IP”?• TCP = Transmission Control Protocol
– Provides an abstraction of reliable, bidirectional connections for the delivery of IP packets to a particular port at a given IP address
– The so-called well known ports (< 1024) are reserved for specific protocols
– By default, HTTP uses port 80; this can change in the URL
– http://www.foo.com:2007/doc.html
23 January 2007 Kaiser: COMS E6125 27
HTTP History• HTTP/0.9 (1990) - simple protocol for raw
data transfer• HTTP/1.0 (RFC 1945, 1996) - Allowed
MIME-like messages, containing meta-information about the resources transferred and modifiers on the request/response semantics
• HTTP/1.1 (RFC 2616, 1999)• HTTP Extension Framework (RFC 2774,
2000)
23 January 2007 Kaiser: COMS E6125 28
What is “MIME”?• Multipurpose Internet Mail Extensions• Standard representation for “complex”
message bodies (numerous RFCs since 1993)
• Examples include messages with embedded graphics or audio clips, messages with file attachments, messages in Japanese or Russian, signed messages
23 January 2007 Kaiser: COMS E6125 29
MIME Header Fields• Mime-Version, Content-Type, Content-
Transfer-Encoding, Content-Description, Content-ID, Content-Location, Content-Disposition, Part Body
• Discrete (text, image, audio) and Multipart (mixed, digest) content types
23 January 2007 Kaiser: COMS E6125 30
HTTP Request/Response
HTTPrequest
Port 80
ResponseOther port
Processing
HTTP C
lien
t
23 January 2007 Kaiser: COMS E6125 31
HTTP Requests and Responses
• Consist of a start-line, zero or more headers (one per line), an empty line (CRLF) indicating the end of the header fields, and possibly a message-body
• Message body only allowed with certain request methods and response status codes (200 OK vs. 404 NOT FOUND)
23 January 2007 Kaiser: COMS E6125 32
Sample HTTP Exchange• To retrieve the file at the URL
http://www.somehost.com/path/file.html
• First open a socket to the host www.somehost.com, port 80 (use the default port of 80 because none is specified in the URL)
23 January 2007 Kaiser: COMS E6125 33
Sample• Then, send something like the following
through the socket: GET /path/file.html HTTP/1.0
From: [email protected] User-Agent: HTTPTool/1.0
Accept: text/html, image/gif, image/jpeg [blank line here]
23 January 2007 Kaiser: COMS E6125 34
• The server should respond with something like the followingHTTP/1.0 200 OK Server: Apache/1.3.0 (Linux)
Date: Sun, 31 Dec 2006 23:59:59 GMT Last-Modified: Sun, 31 Dec 2006 23:59:58
GMT Content-Type: text/html Content-Length: 1354 <html> <body> <h1>Happy New Year!</h1> (more file contents) . . . </body> </html>
23 January 2007 Kaiser: COMS E6125 35
Some Request Headers• From: gives the email address of whoever's
making the request, or running the program doing so (for bots)
• User-Agent: identifies the program that's making the request, in the form "Program-name/x.xx", where x.xx is the alphanumeric version of the program (e.g., browser)– User-Agent: Mozilla/4.0 (compatible; MSIE
6.0; Windows NT 5.1; .NET CLR 1.0.3705)
23 January 2007 Kaiser: COMS E6125 36
Some Response Headers
• Server: analogous to User-Agent:, identifies the server software in the form "Program-name/x.xx"– Server: Apache/1.3.12 (Unix)
• Last-Modified: gives the modification date of the resource that's being returned, e.g., for use in caching – Use Greenwich Mean Time, in the format
Last-Modified: Tue, 23 Jan 2007 00:00:01 GMT
23 January 2007 Kaiser: COMS E6125 37
Start Line• HTTP Version (0.9, 1.0, 1.1)• URI• Method (request) or Status Code
(response)
23 January 2007 Kaiser: COMS E6125 38
HTTP URIs• Up to some bounded length (often
255), or “unbounded”, status code 414 (Request-URI Too Long)
• Equivalence comparisonhttp://abc.com:80/~smith/home.htmlhttp://ABC.com/%7Esmith/home.htmlhttp://ABC.com:/%7esmith/home.html
23 January 2007 Kaiser: COMS E6125 39
Request Messages• Method SP Request-URI SP HTTP-Version
CRLF • GET http://www.w3.org/pub/WWW/ TheProject.html HTTP/1.1
• Equivalent to client making TCP connection to www.w3.org on port 80, then sending GET /pub/WWW/TheProject.html HTTP/1.1 Host: www.w3.org
• Host field allows for virtual hosts
23 January 2007 Kaiser: COMS E6125 40
What is a “virtual host”?
• Enables the same machine to host multiple domain names, sometimes at the same IP address (name-based virtual hosting)
• Important for website hosting (e.g., www.foo.com maps to /www/foo/site1 and www.bar.com maps to /www/bar/site2), but usually there can be only one secure https website per IP address/port
23 January 2007 Kaiser: COMS E6125 41
GET• Retrieve whatever information (in the form of
an entity) is identified by the URI• If the URI refers to a data-producing process,
it is the produced data (given the input parameters after the “?”, if any) that is returned as the entity in the response - not the source text of the process (unless that text happens to be the output of the process)
• http://foo.com/run.cgi?name1=val1&name2=val2
23 January 2007 Kaiser: COMS E6125 42
Conditional and Partial GET
• Conditional if the request message includes an If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range header field
• Partial if the request message includes a Range header field
• Don’t retrieve data the client doesn’t need (e.g., at least part and up to date already in cache)
23 January 2007 Kaiser: COMS E6125 43
HEAD• Identical to GET except that the server
must not return a message-body in the response - only returns headers
• Often used for testing hypertext links for validity and modification
• Can mark cache entries as stale if certain header information changes (e.g., length, last-modified)
23 January 2007 Kaiser: COMS E6125 44
POST• Used to request that the origin server
accept the entity enclosed in the request as a new subordinate of the resource identified by the Request-URI in the Request-Line
• Actual function performed by the POST method is determined by the server, usually dependent on the Request-URI
23 January 2007 Kaiser: COMS E6125 45
POST supports several functions
• Annotation of an existing resource• Posting a message to a bulletin board,
newsgroup, mailing list, or similar group of articles
• Providing a block of data, such as the result of submitting a form, to a data-handling process
• Extending a database through an append operation
23 January 2007 Kaiser: COMS E6125 46
POST vs. GET• GET can be used to send small amounts
of data to a server, with the data following the ? character
• The rest of the request-URI (before the ?) refers to some kind of processing program
GET /path/script.cgi?field1=value1&field2=value2 HTTP/1.0
23 January 2007 Kaiser: COMS E6125 47
PUT and DELETE• Often unsupported (501 Not Implemented)• PUT requests that the enclosed entity be stored
under the supplied Request-URI • May create a new resource at a new URI, or
modify an existing resource already at that URI• DELETE requests that the origin server delete
the resource identified by the Request-URI• May be overridden, e.g., by human
intervention, even if status code indicates successfully completed
23 January 2007 Kaiser: COMS E6125 48
OPTIONS and TRACE• OPTIONS allows the client to determine the
requirements associated with a resource, or the capabilities of a server (OPTIONS *), without implying a resource action or initiating a resource retrieval
• TRACE used to invoke application-layer loop-back of the request message, allowing the client to see what is being received at the other end of the request chain for testing or diagnostic information
23 January 2007 Kaiser: COMS E6125 49
HTTP is “Stateless”• Server doesn’t remember anything
about client between connections• Not even between requests during the
same persistent connection, except TCP data
• But some state can be encoded in complex URLs or in forms
• Or saved on client in “cookies”
23 January 2007 Kaiser: COMS E6125 50
Cookies• Opaque string associated with a website, stored
at the browser • Create in HTTP response with “Set-Cookie: ”• In all subsequent requests to this site, until
cookie’s expiration, the client sends the HTTP header “Cookie: ”
• Name-value pairs– Cookie: user=“alex”
lastvisit=“20070123-11:00”• Interpretation up to the Web application
23 January 2007 Kaiser: COMS E6125 51
Response Messages• HTTP-Version SP Status-Code SP
Reason-Phrase CRLF • Example: HTTP/1.0 404 Not Found • Status code: 3-digit integer result code
of the attempt to understand and satisfy the request
• Response phrase: short textual description of the Status-Code
23 January 2007 Kaiser: COMS E6125 52
Status Codes• Applications need only understand first
digit, treat others as equivalent to x00• 1xx: Informational - Request received,
continuing process ("100" : Continue, relevant to persistent connections)
• 2xx: Success - The action was successfully received, understood and accepted ("200" : OK)
23 January 2007 Kaiser: COMS E6125 53
Status Codes• 3xx: Redirection - Further action must
be taken in order to complete the request ("300" : Multiple Choices)
• 4xx: Client Error - The request contains bad syntax or cannot be fulfilled ("400" : Bad Request)
• 5xx: Server Error - The server failed to fulfill an apparently valid request ("500" : Internal Server Error)
23 January 2007 Kaiser: COMS E6125 54
HTTP Request/Response
• In HTTP 1.0, a connection is established by the client prior to each request and closed by the server after sending the response
• Either party may close the connection prematurely, due to user action, automated time-out, or program failure
• Closing of the connection by either or both parties always terminates the current request, regardless of its status
• But TCP connections are expensive
23 January 2007 Kaiser: COMS E6125 55
HTTP 1.1 “Persistent Connection”
• Many Web pages consist of several files on the same server
• If an HTTP 1.1 client sends multiple (pipelined) requests through a single connection, the server should send responses back in the same order
• Intermediate responses "100" : Continue
23 January 2007 Kaiser: COMS E6125 56
How does the connection finally get
closed?
• If a request includes the "Connection: close" header, that request is the final one for the connection and the server should close the connection after sending the response
• The server should also close an idle connection after some timeout period
23 January 2007 Kaiser: COMS E6125 57
Advantages of Persistent Connections
• Requests and responses can be pipelined - a client makes multiple requests without waiting for each response
• Network congestion reduced by fewer packets caused by TCP opens, and by allowing TCP sufficient time to determine the congestion state of the network
• Latency on subsequent requests is reduced since there is no time spent in TCP's connection opening handshake
23 January 2007 Kaiser: COMS E6125 58
Basic HTTP Architecture
23 January 2007 Kaiser: COMS E6125 59
Intermediary
• Program sitting in the path between HTTP clients and servers
• Acts as a server to clients and as a client to origin servers or other intermediaries
23 January 2007 Kaiser: COMS E6125 60
Proxy
• Forwarding agent• Receives request, rewrites all or parts
of the message, and forwards the reformatted request toward the server identified by the URI
• Used for load balancing, anonymizing clients
23 January 2007 Kaiser: COMS E6125 61
Gateway• Receiving agent• Acts as a layer above some other server(s)
and, if necessary, translates the requests to the underlying server's protocol
• Example: Web mail accessing an IMAP server– A URL identifies the mail server, mailbox,
password– Converts the HTTP request to an IMAP
request, gets the IMAP response, converts it to HTTP response
23 January 2007 Kaiser: COMS E6125 62
Tunnel• Relay point between two connections
without changing the message• Looks at the first line of the HTTP
message to locate the host to be contacted and accept the request
• Simply relays bits between the two connection points
• Does not parse or interpret messages • Used when the communication needs to
pass through a firewall
23 January 2007 Kaiser: COMS E6125 63
Transcoder• Modifies data as it passes to clients, e.g., to
filter ads• Particularly useful for wireless and/or
constrained devices– Convert HTML to WML– Modify content to fit small screen– Convert modality of interaction, e.g.,
driving directions from displaying text to playing audio
23 January 2007 Kaiser: COMS E6125 64
Caching• Request/response chain is shortened if one of
the participants along the chain has a cached response applicable to request
• Used to reduce latency and network traffic
23 January 2007 Kaiser: COMS E6125 65
HTTP 1.1 Caching Support
• Allows a server to determine caching policies in its response– Expires xx-xx-xx yy:yy:yy.yy– Cache-Control: no-store – don’t cache at
all– Cache-Control: no-cache – validate
every time or don’t cache– Cache-Control: private – can’t keep in a
public cache
23 January 2007 Kaiser: COMS E6125 66
HTTP 1.1 Chunked Encoding
• Faster response for dynamically-generated pages or very large pages
• Allows the beginning of a response to be sent before its total length is known
• Each chunk is prefixed by its size in bytes• A zero size chunk indicates the end of the
response message• If a server is using chunked encoding it must set
the Transfer-Encoding header to "chunked".
23 January 2007 Kaiser: COMS E6125 67
Reminders
• Class attendance required!• Preliminary paper proposal January 29th
• Preliminary project proposal March 5th
• Paper must be individual, projects may be teams of 2-5 students
• See advice about team formation at http://york.cs.columbia.edu/classes/cs6125/team_advice.htm
23 January 2007 Kaiser: COMS E6125 68
COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)
COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)
Prof. Gail KaiserProf. Gail Kaiser
Spring 2007Spring 2007