URLs and Resources

Preview:

DESCRIPTION

URLs and Resources. Herng-Yow Chen. Outline. Navigating the Internet’s Resources URL syntax and what the various URLs mean and do URL Shortcuts that many web clients support: relative URLs and expanded URLs URL encoding and character rules Common URL schemes - PowerPoint PPT Presentation

Citation preview

1

URLs and Resources

Herng-Yow Chen

2

Outline Navigating the Internet’s Resources URL syntax

and what the various URLs mean and do URL Shortcuts that many web clients support:

relative URLs and expanded URLs

URL encoding and character rules Common URL schemes The future of URLs, including URNs

3

Navigating a resource by URL, which tells a web client

Web pagehttp://english.csie.ncnu.edu.tw/demo/index.ht

mlhttp://english.csie.ncnu.edu.tw/demo/index.ht

mlScheme

(how)

Host (where)

Path (what)

1. URL scheme: how to access the resource

2. Server location: where the resource is hosted

3. Resource path: what particular local resource

on the server is being requested

4

URLs URLs can direct you to resources available throu

gh protocols other than HTTP. Email account:

mailto:hychen@csie.ncnu.edu.tw A file resides on a FTP server:

ftp://ftp.ncnu.edu.tw/a_file.txt A video streamed by a video server:

rtsp://www.cnn.com/headline.rm

Most URLs have the same “scheme://server location/path” structure

5

Navigating a resource by URL, which tells a web client

Web pagehttp://english.csie.ncnu.edu.tw/demo/index.ht

mlhttp://english.csie.ncnu.edu.tw/demo/index.ht

mlScheme

(how)

Host (where)

Path (what)

1. URL scheme: how to access the resource2. Server location: where the resource is hosted3. Resource path: what particular local resource

on the server is being requested

6

URL Syntax <scheme>://<user>:<password>@<host>:

<port>/<path>;<params>?<query>#<frag>

7

Scheme: what protocol to use The scheme is really the main identifier of

how to access a given resource.

The scheme must start with an alphabetic character, and it is separated from the rest of the URL by the first “:” character.

Scheme names are case-insensitive.

8

Usernames and Passwords Many servers require a username and password

before you can access data through them.

For examples: ftp://ftp.prep.ai.mit.edu/pub/gnu ftp://anonymous@ftp.perp.ai.mit.edu/pub/gnu ftp://anonymous:my_passwd@ftp.prep.ai.mit.edu/pub/gnu http://joe:joespasswd@www.joes-hardware.com/sales_info.txt

The default username and password “anonymous” for username “Internet Explorer sends “IEUser” for password, while

Netscape send “mozilla”.

9

Hosts and Ports The host component (IP or Domain Name)

identifies that host machine on the Internet that has access to the resource.

The port component identifies the network port on which the server is listing.

Different services uses different default ports for a machine. HTTP: 80 FTP: 21 Telnet: 23 SMTP: 25

10

Paths The path component of the URL specifies where

on the server machine the resource lives. The path often resembles a hierarchical filesyste

m path. For example: http://www.csie.ncnu.edu.tw/course/1998.html

The path in the URL is “ /course/1998.html”, which resembles a filesystem path on a UNIX filesystem.

The path component for HTTP URLs can be divided into path segments separated by“ /” . Each path segment can have its own params component (described later).

11

Parameters For many schemes, a simple host and path

to the object just aren’t enough. Aside from what port the server is listening

to and even whether or not you have access to the resource with a username and password, many protocols require more information to work.

For example, ftp://ftp.ncnu.edu.tw/image.gif;type=a ftp://ftp.ncnu.edu.tw/program.exe;type=i

12

Query strings Some resources, such as database, can

be queried according to input strings. For example:

http://www.xxx.tw/a.cgi?id=123&name=abcid=123&name=abc There is no requirement for the format of

the query component, except that some characters are illegal. By convention, many gateways except the query to be formatted as a series of “name=value” pairs, separated by “&” characters.

13

Query Strings

http://english.csie.ncnu.edu.tw/course/NWSMLViewer.php?lectureid=rctlee-20030909125212

Internet

“viewer” gateway

lectureid=rctlee-20030909125212

Server

14

Fragments Some finer resource fragments, such as sessions

in a large HTML document , can friendly be accessed. For example,

http://engquiz.csie.ncnu.edu.tw/e-book/html/B001.html#page10

Because HTTP servers generally deal only with entire objects, not with fragments of objects, clients don’t pass fragments along to servers. Namely, the whole object is retreived, but only the partial content is displayed.

Note that in Range Request feature of HTTP/1.1, agents may request byte ranges of objects. (later lectures)

15

Fragments

Internet

Client www.csie.ncnu.edu.tw

(a)User selects link to “http://www.csie.ncnu.edu.tw/~hychen/web_tech/#Resource”

(Fragment is NOT sent to the server)

(b)Browser makes request to http://www.csie.ncnu.edu.tw/~hychen/web_tech/

(c)Server returns entire HTML page

(d)Browser displays HTML page starting with named ”Resource”fragment

Browser scrolls down to star at named “Resource” fragment

16

URL shortcuts Web clients understand and use a few URL

shortcuts. Many browsers also support automatic

expansion of URLs, where the user can type in a key (memorable) part of a URL, and the browser fills in the rest.

Relative URLs Base URLs Resolving relative references Expanded URLs

17

Relative URLs URLs comes in two flavors: absolute and

relative. So far, we have looked only at absolute

URLs, all the information you need to access a resource.

On the other hand, relative URL is incomplete. To get all the information need to access a resource, a relative URL must be interpreted on the basis of another URL, called its base.

18

HTML snippet with relative URL

<HTML><HEAD> <TITLE> Joe’s Tools </TITLE> </HEAD><BODY><H1> Tools page </H1><H2> Hammers </H2><P> Joe’s HARDWARE online has the largest sele

ction of <A href= “ ./hammers.html”> hammers </A> on earth.

</BODY></HTML>

19

Using a base URL

Base URL:

http://www.joes-hardware.com/tools.html

Relative URL:

./hammers.html

http://www.joes-hardware.com/hammers.htmlNew absolute URL

20

Base URLs The first step in the conversion process is to find

a base URL, which can come from a few places. Explicitly provided in the resource

Use <BASE> tag to define the base URL Base URL of the encapsulating resource

Does not explicitly specify a base URL. Use the URL of the resource in which the document is

imbedded as a base, as the example in the preceding slide. No base URL

In some instances, there is no base URL. This often means that you have an absolute URL; however, sometimes you just have an incomplete or broken URL.

21

Resolving relative references

22

Expanded URLs Some browser try to expand URLs automat

ically, either after you submit the URL or while you’re typing. This provides users with a shortcut: they don’t have to type in the complete URL. Hostname expansion

Ex: yahoo www.yahoo.com History expansion

Ex: http://www.ncnu http://www.ncnu.edu.tw

23

Shady characters in URLs URLs were designed to be portable, to uniformly

name all the resources on the Internet. This means that the URLs will be transmitted through various protocol.

Because different protocols (schemes) use different mechanisms for transmitting, it is important for the URLs to be transmitted safely, namely without losing information, through any protocols over network.

Some protocols, such as the Simple Mail Transfer Protocol (SMTP) for email, use a 7-bit encoding for message; this can strip off certain characters if the source is encoded in 8 bits or more.

24

Shady characters in URLs URLs are permitted to contain only characters fro

m a relatively small, universally safe alphabet. In addition to the transportable issue, URLs shoul

d be readable. Hence, some invisible, nonprinting characters also are prohibited in URLs, even though these character may pass through mailers.

To complete matter further, URLs also need to be complete. One day people would want URLs to contain binary data or characters outside of the universally safe of alphabets. So, an escape mechanism was added.

25

The URL Character Set US-ASCII is very portable, due to its long legacy.

It uses 7 bits to represent most keys available on an English typewriter and a few non-printing control character for text formatting and hardware signal. But it doesn’t support the inflected characters common in European languages or non-Romanic language read.

Want to contain arbitrary binary data. Use escape sequences allow the encoding of

arbitrary values using restricted subset of the US-ASCII character set, yielding portability and completeness.

26

Encoding mechanism Simply represents the unsafe character by

an “escape” notation, consisting of a percent sign (%) followed by two hexadecimal digits.

For example ~ 0x7E, http://www.ncnu.edu.tw/%7Ehychen Space 0x20, http://www.abc.com/web%20tools.html % 0x25, http://www.abc.com/100%25satisfaction.html

27

Character Restrictions % escape token / path delimiter . Path component .. Path component # fragment delimiter ? Query-string delimiter ; params delimiter : to delimit the scheme, user/password, and host/p

ort $,+ Reserved @&= Reserved - special meaning in some scheme {}|\^~[]’ Restricted unsafe handling by various transport

agent, such as gateway <>” Unsafe; should be encoded have meaning outs

ide the scope of URL 0x00-0x1F, 0x7F Restricted fall within nonprintable range >0x7F Restricted fall within this range do not fall within 7-bit range of US-ASCII

28

Common scheme format http, https mailto ftp rtsp, rtspu file News telnet

29

The Future: URN?

Internet

Internet

STEP1:Ask the resource resolver what the Joe’s Hardware URL is. Receive from the resolver the current location of the resource

Client

Client

Purl.oclc.org

Get http://purl.oclc.org/jhardware/

Actual:http://www.joes-hardware.com/STEP2: Get the actual URL for the resource Get http://www.joes-hardware.com

www.joes-hardware.com

30

URIUniversal Resource Identifier

URIs defined in RFC 1630. (1994) URI is a superset of URL and URN.

Full URI: proto://hostname/pathhttp://www.csie.ncnu.edu.tw:80/~hychen/

Partial URI: /path/~hychen/

Identifies the Server

No server mentioned

31

URLs information http://www.w3.org/Addressing/

The W3C page about naming and addressing URIs and URLs. http://www.ietf.org/rfc/rfc1738.txt

RFC 1738, “Uniform Resource Locators (URL),” by T. Berners-Lee, L. Masinter, and M. McCahill.

http://www.ietf.org/rfc/rfc2396.txt RFC 2396, “Uniform Resource Identifiers (URI): Generic Syntax,” by T. B

erners-Lee, R. Fielding, and L. Masinter. http://www.ietf.org/rfc/rfc2141.txt

RFC 2141, “URN Syntax,” by R. Moats. http://purl.oclc.org

The persistent uniform resource locator web site. http://www.ietf.org/rfc/rfc1808.txt

RFC 1808, “Relative Uniform Resource Locators,” by R. Fielding.

Recommended