Parsing URLs with Regular Expressions and the Regex Object
And the Anatomy of a URI (Uniform Resource Identifier)
By
Steve on
Tuesday, January 09, 2007
Updated
Thursday, September 08, 2016
Viewed
108,220 times. (
2 times today.)
Summary
While the .NET Framework contains a class call Uri which provides a lot of functionality when dealing with URLs and URIs, it's not always easy to get at the pieces of a URL. I present a technique here using regular expressions to extract the subparts of a URI in one fell swoop. But first let's talk about the parts of a URI.
Anatomy of a URI
The following information is taken from the official W3C Network Working Group documents. For the whole kit and kaboodle see RFC9386.
URI stands for Uniform Resource Identifier and is a "compact sequence of characters which identifies an abstract or physical resource."
URL stands for Uniform Resource Locator. URLs are a subset of URIs that "provide a means of locating a resource by describing its primary access mechanism, for example, its network location.
Now, let's talk about the syntax of the URI, or the various pieces that combine to form the whole. These consist of the scheme, authority, path, query and fragment.
Components of a URI
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/\__________/ \__/
| | | | |
scheme authority path query fragment
Scheme
The scheme of a URL is the first item, such as http, which indicates that this URI uses the hyper-text transport protocol. Examples of other schemes are:
http | World Wide Web Server. (http://www.w3.com) |
https | Secure World Wide Web Server. (https://www.w3.com) |
ftp | File Transfer Protocol (ftp://www.ftpx.com) |
gopher | The Gopher Protocol. |
mailto | Electronic Mail Address. (mailto:info@cambiaresearch.com) |
news | USENET newsgroups. |
nntp | USENET newsgroups using NNTP access. |
telnet | Interactive sessions on server. |
wais | Wide Area Information Servers. |
file | file on local system. |
prospero | Prospero Directory Services. |
Authority
In a URL the authority is also called the domain and may include a port number at the end separated by a colon.
In the following example, the authority is www.cambiaresearch.com
http://www.cambiaresearch.com
In the following example, the authority is www.cambiaresearch.com:81
https://www.cambiaresearch.com:81
In the following example, the authority is info@cambiaresearch.com
mailto:info@cambiaresearch.com
Path
The path component of the URL specifies the specific file (or page) at a particular domain. The path is terminated by the end of the URL, a question mark (?) which signifies the beginning of the query string or the number sign (#) which signifies the beginning of the fragment.
The path of the following URL is "/default.htm"
http://www.cambiaresearch.com/default.htm
The path of the following URL is "/snippets/csharp/regex/uri_regex.aspx"
http://www.cambiaresearch.com/snippets/csharp/regex/uri_regex.aspx
Query
The query part of the URL is a way to send some information to the path or webpage that will handle the web request. The query begins with a question mark (?) and is terminated by the end of the URL or a number sign (#) which signifies the beginning of the fragment.
The query of the following URL is "?id=241"
http://www.cambiaresearch.com/default.htm?id=241
The query of the following URL is "?sourceid=navclient&ie=UTF-8&rls=GGLC,GGLC:
1969-53,GGLC:en&q=uri+query"
http://www.google.com/search?sourceid=navclient&ie=UTF-8&rls=GGLC,GGLC:1969-53,GGLC:en&q=uri+query
Fragment
In a URL the fragment is used to specify a location within the current page. This is often used in a FAQ with a list of links at the top of the page linking to longer descriptions farther down in the page.
The fragment of the following URL is "contact"
http://www.cambiaresearch.com/default.htm#contact
The fragment of the following URL is "scheme"
http://www.cambiaresearch.com/snippets/csharp/regex/uri_regex.aspx#scheme
Example: Regular Expressions for Parsing URIs and URLs
OK, we're finally here. The following method may be copied into the code behind file of your aspx page. Ensure there is a Label named lblOutput on your aspx page and call the TestParseURL method.
Example: Parse a URL with C# Regex
public void TestParseURL()
{
string url = "http://www.cambiaresearch.com"
+ "/Cambia3/snippets/csharp/regex/uri_regex.aspx?id=17#authority";
string regexPattern = @"^(?<s1>(?<s0>[^:/\?#]+):)?(?<a1>"
+ @"//(?<a0>[^/\?#]*))?(?<p0>[^\?#]*)"
+ @"(?<q1>\?(?<q0>[^#]*))?"
+ @"(?<f1>#(?<f0>.*))?";
Regex re = new Regex(regexPattern, RegexOptions.ExplicitCapture);
Match m = re.Match(url);
lblOutput.Text = "<b>URL: " + url + "</b><p>";
lblOutput.Text +=
m.Groups["s0"].Value + " (Scheme without colon)<br>";
lblOutput.Text +=
m.Groups["s1"].Value + " (Scheme with colon)<br>";
lblOutput.Text +=
m.Groups["a0"].Value + " (Authority without //)<br>";
lblOutput.Text +=
m.Groups["a1"].Value + " (Authority with //)<br>";
lblOutput.Text +=
m.Groups["p0"].Value + " (Path)<br>";
lblOutput.Text +=
m.Groups["q0"].Value + " (Query without ?)<br>";
lblOutput.Text +=
m.Groups["q1"].Value + " (Query with ?)<br>";
lblOutput.Text +=
m.Groups["f0"].Value + " (Fragment without #)<br>";
lblOutput.Text +=
m.Groups["f1"].Value + " (Fragment with #)<br>";
}
The following is the output you should see on your aspx page when you run the above method.
Example: Output
URL: http://www.cambiaresearch.com/Cambia3/snippets/csharp/
regex/uri_regex.aspx?id=17#authority
http (Scheme without colon)
http: (Scheme with colon)
www.cambiaresearch.com (Authority without //)
//www.cambiaresearch.com (Authority with //)
/Cambia3/snippets/csharp/regex/uri_regex.aspx (Path)
id=17 (Query without ?)
?id=17 (Query with ?)
authority (Fragment without #)
#authority (Fragment with #)