The first 90% of the code accounts for the first 90% of the development time. The remaining 10% of the code accounts for the other 90% of the development time. -- Tom Cargill
Welcome to my blog and project site for Microsoft.NET development.

I've been a full time .NET developer for ten years, but I didn't start my professional life as a programmer ... more
Share/Print this page:

Parsing URLs with Regular Expressions and the Regex Object

And the Anatomy of a URI (Uniform Resource Identifier)

By steve on January 09, 2007.
Updated on January 22, 2012.
Viewed 47,020 times (25 times today).
Article TypesArticle TypesLanguage ElementsLanguage ElementsLanguagesTechnologiesTopicsTopics
OverviewSnippetRegular ExpressionsText and StringsC#.NETPolicy and StandardsWeb

Summary

Contents

While the .NET Framework contains a class call Uri which provides a lot of functionality when dealing with URLs and URIs, it's not always easy to get at the pieces of a URL. I present a technique here using regular expressions to extract the subparts of a URI in one fell swoop. But first let's talk about the parts of a URI.

For the code go straight to Page 2. Read on below about the anatomy of a URI.

Anatomy of a URI

Contents

The following information is taken from the official W3C Network Working Group documents. For the whole kit and kaboodle go here.

URI stands for Uniform Resource Identifier and is a "compact sequence of characters which identifies an abstract or physical resource."

URL stands for Uniform Resource Locator. URLs are a subset of URIs that "provide a means of locating a resource by describing its primary access mechanism, for example, its network location.

Now, let's talk about the syntax of the URI, or the various pieces that combine to form the whole. These consist of the scheme, authority, path, query and fragment.

Components of a URI

Contents
  foo://example.com:8042/over/there?name=ferret#nose
  \_/   \______________/\_________/\__________/ \__/
   |           |             |           |        |
scheme     authority       path        query   fragment

Scheme

Contents

The scheme of a URL is the first item, such as http, which indicates that this URI uses the hyper-text transport protocol. Examples of other schemes are:

httpWorld Wide Web Server. (http://www.w3.com)
httpsSecure World Wide Web Server. (https://www.w3.com)
ftpFile Transfer Protocol (ftp://www.ftpx.com)
gopherThe Gopher Protocol.
mailtoElectronic Mail Address. (mailto:info@cambiaresearch.com)
newsUSENET newsgroups.
nntpUSENET newsgroups using NNTP access.
telnetInteractive sessions on server.
waisWide Area Information Servers.
filefile on local system.
prosperoProspero Directory Services.

Authority

Contents

In a URL the authority is also called the domain and may include a port number at the end separated by a colon.

In the following example, the authority is www.cambiaresearch.com

http://www.cambiaresearch.com

In the following example, the authority is www.cambiaresearch.com:81

https://www.cambiaresearch.com:81

In the following example, the authority is info@cambiaresearch.com

mailto:info@cambiaresearch.com

Path

Contents

The path component of the URL specifies the specific file (or page) at a particular domain. The path is terminated by the end of the URL, a question mark (?) which signifies the beginning of the query string or the number sign (#) which signifies the beginning of the fragment.

The path of the following URL is "/default.htm"

http://www.cambiaresearch.com/default.htm

The path of the following URL is "/snippets/csharp/regex/uri_regex.aspx"

http://www.cambiaresearch.com/snippets/csharp/regex/uri_regex.aspx

Query

Contents

The query part of the URL is a way to send some information to the path or webpage that will handle the web request. The query begins with a question mark (?) and is terminated by the end of the URL or a number sign (#) which signifies the beginning of the fragment.

The query of the following URL is "?id=241"

http://www.cambiaresearch.com/default.htm?id=241

The query of the following URL is "?sourceid=navclient&ie=UTF-8&rls=GGLC,GGLC: 1969-53,GGLC:en&q=uri+query"

http://www.google.com/search?sourceid=navclient&ie=UTF-8&rls=GGLC,GGLC:1969-53,GGLC:en&q=uri+query

Fragment

Contents

In a URL the fragment is used to specify a location within the current page. This is often used in a FAQ with a list of links at the top of the page linking to longer descriptions farther down in the page.

The fragment of the following URL is "contact"

http://www.cambiaresearch.com/default.htm#contact

The fragment of the following URL is "scheme"

http://www.cambiaresearch.com/snippets/csharp/regex/uri_regex.aspx#scheme

Back to Top

User Comments (2)

Posted 2008 May 21 12:43 PM. reply
Wonderful post! Just what I was looking for - a practical approach to parsing URL's with regular expressions.

wmhogg
Posted 2008 May 29 03:43 AM. reply
Is it possible to add in something like the Segments property of the uri class into the regular expression, where the Path is split by "/". Or would you say its better to do this with a string.split?

Peter
Post Your Comment
  You may post without logging in or login here.
Display Name: Required.
Email: Required. Will not be shown. Used for identicon.
Comment:
Allowed tags: <quote></quote>, <code></code>, <b></b>, <i></i>, <u></u>, <red></red>
 
   Please type text as shown in the image at left.