Cambia Research - Supporting the Microsoft .NET Developer Community Supporting the Microsoft .NET Developer Community  

     | Home  | Articles  | Categories  | Coders  | Search  | Submit  | Contact Us    
The inside of a computer is as dumb as hell, but it goes like mad! --Richard Feynman

Share Your Knowledge! -- Create and submit your articles the easy way with WebWriter.

Updated:03:01 AM CT Jan 10, 2007
Posted:10:44 PM CT Jan 09, 2007

Parsing URLs with Regular Expressions and the Regex Object

And the Anatomy of a URI (Uniform Resource Identifier)

Author: Steve Lautenschlager

SnippetRegular ExpressionsText and StringsC#.NETBrowsersWeb
    Prev     1    2     Next  

 Summary

While the .NET Framework contains a class call Uri which provides a lot of functionality when dealing with URLs and URIs, it's not always easy to get at the pieces of a URL. I present a technique here using regular expressions to extract the subparts of a URI in one fell swoop. But first let's talk about the parts of a URI.

For the code go straight to Page 2. Read on below about the anatomy of a URI.

 Anatomy of a URI

The following information is taken from the official W3C Network Working Group documents. For the whole kit and kaboodle go here.

URI stands for Uniform Resource Identifier and is a "compact sequence of characters which identifies an abstract or physical resource."

URL stands for Uniform Resource Locator. URLs are a subset of URIs that "provide a means of locating a resource by describing its primary access mechanism, for example, its network location.

Now, let's talk about the syntax of the URI, or the various pieces that combine to form the whole. These consist of the scheme, authority, path, query and fragment.

 Components of a URI

  foo://example.com:8042/over/there?name=ferret#nose
  \_/   \______________/\_________/\__________/ \__/
   |           |             |           |        |
scheme     authority       path        query   fragment

 Scheme

The scheme of a URL is the first item, such as http, which indicates that this URI uses the hyper-text transport protocol. Examples of other schemes are:

httpWorld Wide Web Server. (http://www.w3.com)
httpsSecure World Wide Web Server. (https://www.w3.com)
ftpFile Transfer Protocol (ftp://www.ftpx.com)
gopherThe Gopher Protocol.
mailtoElectronic Mail Address. (mailto:info@cambiaresearch.com)
newsUSENET newsgroups.
nntpUSENET newsgroups using NNTP access.
telnetInteractive sessions on server.
waisWide Area Information Servers.
filefile on local system.
prosperoProspero Directory Services.

 Authority

In a URL the authority is also called the domain and may include a port number at the end separated by a colon.

In the following example, the authority is www.cambiaresearch.com

http://www.cambiaresearch.com

In the following example, the authority is www.cambiaresearch.com:81

https://www.cambiaresearch.com:81

In the following example, the authority is info@cambiaresearch.com

mailto:info@cambiaresearch.com

 Path

The path component of the URL specifies the specific file (or page) at a particular domain. The path is terminated by the end of the URL, a question mark (?) which signifies the beginning of the query string or the number sign (#) which signifies the beginning of the fragment.

The path of the following URL is "/default.htm"

http://www.cambiaresearch.com/default.htm

The path of the following URL is "/snippets/csharp/regex/uri_regex.aspx"

http://www.cambiaresearch.com/snippets/csharp/regex/uri_regex.aspx

 Query

The query part of the URL is a way to send some information to the path or webpage that will handle the web request. The query begins with a question mark (?) and is terminated by the end of the URL or a number sign (#) which signifies the beginning of the fragment.

The query of the following URL is "?id=241"

http://www.cambiaresearch.com/default.htm?id=241

The query of the following URL is "?sourceid=navclient&ie=UTF-8&rls=GGLC,GGLC:
1969-53,GGLC:en&q=uri+query"

http://www.google.com/search?sourceid=navclient&ie=UTF-8&rls=GGLC,GGLC:1969-53,GGLC:en&q=uri+query

 Fragment

In a URL the fragment is used to specify a location within the current page. This is often used in a FAQ with a list of links at the top of the page linking to longer descriptions farther down in the page.

The fragment of the following URL is "contact"

http://www.cambiaresearch.com/default.htm#contact

The fragment of the following URL is "scheme"

http://www.cambiaresearch.com/snippets/csharp/regex/uri_regex.aspx#scheme
    Prev     1    2     Next  

Add New Comment
Parsing URLs with Regular Expressions and the Regex Object
wmhogg21 May 08, 12:43Reply 
Parsing URLs with Regular Expressions and the Regex Object
Peter29 May 08, 3:43Reply 
CR Comments by Cambia Research
advertisement
 
Steve Lautenschlager (steve)
Steve is the founder and creator of Cambia Research. Developing and maintaining the site combines his passions for technology, writing and education.
Steve holds a Ph.D. in particle physics from Duke University, has worked at CERN, the European center for particle physics (where the web was born) and in Microsoft's web division with microsoft.com, msnbc.com and other web properties. Steve is a web consultant specializing in Microsoft.NET technologies. Read more here.


 
Copyright © Cambia Research 2002-2007. All Rights Reserved. steve [ at ] cambiaresearch.com