The Six Phases of a Project: Enthusiasm. Disillusionment. Panic. Search for the Guilty. Punishment of the Innocent. Praise for non-participants

Parsing URLs with Regular Expressions and the Regex Object

And the Anatomy of a URI (Uniform Resource Identifier)

Summary

While the .NET Framework contains a class call Uri which provides a lot of functionality when dealing with URLs and URIs, it's not always easy to get at the pieces of a URL. I present a technique here using regular expressions to extract the subparts of a URI in one fell swoop. But first let's talk about the parts of a URI.

Anatomy of a URI

The following information is taken from the official W3C Network Working Group documents. For the whole kit and kaboodle see RFC9386.

URI stands for Uniform Resource Identifier and is a "compact sequence of characters which identifies an abstract or physical resource."

URL stands for Uniform Resource Locator. URLs are a subset of URIs that "provide a means of locating a resource by describing its primary access mechanism, for example, its network location.

Now, let's talk about the syntax of the URI, or the various pieces that combine to form the whole. These consist of the scheme, authority, path, query and fragment.

Components of a URI

  foo://example.com:8042/over/there?name=ferret#nose
  \_/   \______________/\_________/\__________/ \__/
   |           |             |           |        |
scheme     authority       path        query   fragment

Scheme

The scheme of a URL is the first item, such as http, which indicates that this URI uses the hyper-text transport protocol. Examples of other schemes are:

httpWorld Wide Web Server. (http://www.w3.com)
httpsSecure World Wide Web Server. (https://www.w3.com)
ftpFile Transfer Protocol (ftp://www.ftpx.com)
gopherThe Gopher Protocol.
mailtoElectronic Mail Address. (mailto:info@cambiaresearch.com)
newsUSENET newsgroups.
nntpUSENET newsgroups using NNTP access.
telnetInteractive sessions on server.
waisWide Area Information Servers.
filefile on local system.
prosperoProspero Directory Services.

Authority

In a URL the authority is also called the domain and may include a port number at the end separated by a colon.

In the following example, the authority is www.cambiaresearch.com

http://www.cambiaresearch.com

In the following example, the authority is www.cambiaresearch.com:81

https://www.cambiaresearch.com:81

In the following example, the authority is info@cambiaresearch.com

mailto:info@cambiaresearch.com

Path

The path component of the URL specifies the specific file (or page) at a particular domain. The path is terminated by the end of the URL, a question mark (?) which signifies the beginning of the query string or the number sign (#) which signifies the beginning of the fragment.

The path of the following URL is "/default.htm"

http://www.cambiaresearch.com/default.htm

The path of the following URL is "/snippets/csharp/regex/uri_regex.aspx"

http://www.cambiaresearch.com/snippets/csharp/regex/uri_regex.aspx

Query

The query part of the URL is a way to send some information to the path or webpage that will handle the web request. The query begins with a question mark (?) and is terminated by the end of the URL or a number sign (#) which signifies the beginning of the fragment.

The query of the following URL is "?id=241"

http://www.cambiaresearch.com/default.htm?id=241

The query of the following URL is "?sourceid=navclient&ie=UTF-8&rls=GGLC,GGLC: 1969-53,GGLC:en&q=uri+query"

http://www.google.com/search?sourceid=navclient&ie=UTF-8&rls=GGLC,GGLC:1969-53,GGLC:en&q=uri+query

Fragment

In a URL the fragment is used to specify a location within the current page. This is often used in a FAQ with a list of links at the top of the page linking to longer descriptions farther down in the page.

The fragment of the following URL is "contact"

http://www.cambiaresearch.com/default.htm#contact

The fragment of the following URL is "scheme"

http://www.cambiaresearch.com/snippets/csharp/regex/uri_regex.aspx#scheme

Example: Regular Expressions for Parsing URIs and URLs

OK, we're finally here. The following method may be copied into the code behind file of your aspx page. Ensure there is a Label named lblOutput on your aspx page and call the TestParseURL method.

Example: Parse a URL with C# Regex

public void TestParseURL()
{
   string url = "http://www.cambiaresearch.com"
      + "/Cambia3/snippets/csharp/regex/uri_regex.aspx?id=17#authority";

   string regexPattern = @"^(?<s1>(?<s0>[^:/\?#]+):)?(?<a1>" 
      + @"//(?<a0>[^/\?#]*))?(?<p0>[^\?#]*)" 
      + @"(?<q1>\?(?<q0>[^#]*))?" 
      + @"(?<f1>#(?<f0>.*))?";

   Regex re = new Regex(regexPattern, RegexOptions.ExplicitCapture); 
   Match m = re.Match(url);

   lblOutput.Text = "<b>URL: " + url + "</b><p>";

   lblOutput.Text +=
      m.Groups["s0"].Value + "  (Scheme without colon)<br>"; 
   lblOutput.Text +=
      m.Groups["s1"].Value + "  (Scheme with colon)<br>"; 
   lblOutput.Text +=  
      m.Groups["a0"].Value + "  (Authority without //)<br>"; 
   lblOutput.Text +=  
      m.Groups["a1"].Value + "  (Authority with //)<br>"; 
   lblOutput.Text +=  
      m.Groups["p0"].Value + "  (Path)<br>"; 
   lblOutput.Text +=  
      m.Groups["q0"].Value + "  (Query without ?)<br>"; 
   lblOutput.Text +=  
      m.Groups["q1"].Value + "  (Query with ?)<br>"; 
   lblOutput.Text +=  
      m.Groups["f0"].Value + "  (Fragment without #)<br>"; 
   lblOutput.Text += 
      m.Groups["f1"].Value + "  (Fragment with #)<br>"; 


}

The following is the output you should see on your aspx page when you run the above method.

Example: Output

URL: http://www.cambiaresearch.com/Cambia3/snippets/csharp/
      regex/uri_regex.aspx?id=17#authority

http (Scheme without colon)
http: (Scheme with colon)
www.cambiaresearch.com (Authority without //)
//www.cambiaresearch.com (Authority with //)
/Cambia3/snippets/csharp/regex/uri_regex.aspx (Path)
id=17 (Query without ?)
?id=17 (Query with ?)
authority (Fragment without #)
#authority (Fragment with #)