A few years ago, I needed a Content Management System (CMS) for my site Voor Beginners and its English counterpart For Beginners. One of the requirements was, that the CMS should use "search engine friendly" URLs. This is fairly easy to accomplish with Linux and Apache; however, another requirement was that the CMS should run on the Windows platform... In this article, I will show how you can "simulate" the effects of .htaccess and mod_rewrite using Microsoft's Internet Information Server (IIS) and classic ASP.
A typical CMS stores its content in a database for easy maintenance. When an end user visits a web page, the content for that page must be retrieved from the database so that it can be displayed. So how does the system "know" which database record should be retrieved for a particular page? The answer is that the URL for that page contains a query string or "parameter" that uniquely identifies the content, e.g.:
http://www.example.com/showitem.php?id=12345
In this example, there is only a single parameter ("id"), but it is quite possible to have URLs with two or more parameters. For example, if you have a piece of clothing that is available in different colors and sizes, you could have something like:
http://www.example.com/showitem.php?id=12345&col=12&siz=34
Unfortunately, search engines have a problem indexing URLs with parameters (A.K.A. "dynamic URLs"); and that's especially true for URLs with multiple parameters. Therefore, if we want all of our pages to be indexed, we need a mechanism that hides the parameters from the search engines and turns them into "static" URLs that look something like:
http://www.example.com/showitem/12345/12/34/
Two earlier articles by garrett and bheerssen show how you can achieve the desired effect with Apache using .htaccess and mod_rewrite. However, .htaccess and mod_rewrite are not available if you use IIS, so we need a "trick" to simulate their effect on the Windows platform...
Let's use the ASP version of the earlier example with three parameters. In other words, when end users request the page:
http://www.example.com/showitem/12345/12/34/
we will act as if they had requested the page:
http://www.example.com/showitem.asp?id=12345&col=12&siz=34
The first thing to notice is that the "parameterless page" does not really exist, so when end users request it, they will actually trigger an error ("404 File Not Found").
This leads us to the idea that we can use a custom error handler to deal with the problem.
To specify a custom error handler for our web site in IIS, we go to the "Custom Errors" tab of your site's Properties. The default handler for the 404 error will be of type "File" and will point to a file called "404b.htm" somewhere in your Windows directory. We click on "Edit" to specify a new error handler. First, we change "Message type" from "File" to "URL". Next, we enter the absolute URL of the ASP file that will act as our 404 error handler; i.e., we enter "/my404.asp" rather than "my404.asp". Finally, we click "OK" to confirm.
We have now stated that there will be a file called "my404.asp" in the root directory of our site that will deal with "file not found" errors, so our next step is to create one.
How do we know which (non-existent) file has been requested by an end user? Fortunately, that is something that we can easily find out by looking at "Request.QueryString".
If someone requests "http://www.example.com/showitem/12345/12/34", Request.QueryString will contain "404;http://www.example.com:80/showitem/12345/12/34", i.e., the error code "404" followed by a semicolon and the requested URL. (By the way, notice that the URL includes the port number ":80"!)
Now all we have to do is "parse" the URL to find the three "hidden" parameters, and then we can "translate" the requested URL into the actual URL that we will send back to the browser.
A first, extremely "naive" version of our code could be something like:
Dim RQ, P, ID, Color, Size RQ = Request.QueryString P = Instr(RQ,"showitem/") If P > 0 Then RQ = Mid(RQ,P+9) ' The string "showitem/" contains 9 characters! P = Instr(RQ, "/") ID = Left(RQ,P-1) RQ = Mid(RQ,P+1) P = Instr(RQ, "/") Color = Left(RQ,P-1) RQ = Mid(RQ,P+1) P = Instr(RQ, "/") Size = Left(RQ,P-1) Response.Write "ID: " & ID & ", Color: " & Color & ", Size: " & Size End If
In reality, we would need much better error handling; what, for example, if the URL does not contain the required number of parameters, or if it does not contain a trailing slash?
For the sake of simplicity, we will respond to these cases by sending a status code of 404 to the browser and stop further processing; we'll do the same when someone requests a completely unrelated (non-existent) page (e.g., http://www.example.com/nosuchpage.htm). This can be done with the following code:
Dim RQ, P, ID, Color, Size, ErrorFound
RQ = Request.QueryString
ErrorFound = False
P = Instr(RQ,"showitem/")
If P > 0 Then
RQ = Mid(RQ,P+9) ' The string "showitem/" contains 9 characters!
P = Instr(RQ, "/")
If P > 0 Then
ID = Left(RQ,P-1)
RQ = Mid(RQ,P+1)
P = Instr(RQ, "/")
If P > 0 Then
Color = Left(RQ,P-1)
RQ = Mid(RQ,P+1)
P = Instr(RQ, "/")
If P > 0 Then
Size = Left(RQ,P-1)
Else
ErrorFound = True
End If
Else
ErrorFound = True
End If
Else
ErrorFound = True
End If
Else
ErrorFound = True
End If
If Not ErrorFound Then
Response.Write "ID: " & ID & ", Color: " & Color & ", Size: " & Size
Else
Response.Status = "404 File Not Found"
Response.End
End If
So far, we have responded to a (well-formed) URL request by displaying the three parameters ID, Color, and Size. In reality, however, we want to return the page:
http://www.example.com/showitem.asp?id=12345&col=12&siz=34
This can easily be accomplished using Server.Transfer:
Server.Transfer "/showitem.asp?id=" & ID & "&col=" & Color & "&siz=" & Size
(We have to make sure, however, that the file "showitem.asp" itself uses absolute, rather than relative, URLs for graphics, style sheets, etc., otherwise it will point to items in a non-existent directory!)
The example above deals with a single type of page (clothing items) with three parameters (ID, Color, and Size). Of course, we could expand the code so that it can handle different page types and (perhaps variable) numbers of parameters. As a result, we would be able to use URLs like:
http://www.example.com/showbook/9876/
to display information on books (that have no color or size, just an ID), or:
http://www.example.com/showitem/12345/12/34/5/
for clothing items that have a fourth parameter (e.g., material). However, as you can imagine, the required code could quickly get very messy and hard to debug...
As I was thinking about a way to improve upon this idea, the following thought struck me. What if we were to use the entire query string (after some "basic cleaning", perhaps, like converting it to lower case and removing extraneous characters) to retrieve the associated content from a database; something like:
SQL = "SELECT * FROM MyContent WHERE MyTitle = '" & CleanQueryString & "'" ' ...
This would provide us with a very flexible way to display content from our database! While I haven't (yet) implemented this idea myself, it may well be worth exploring... Happy coding!
Comments
Useful little hack, however
Useful little hack, however If you have administrative access over the server a better solution might be to install a specific ISAPI filter that performs the rewrite. For a site that I run, I use ISAPI_Rewrite version 2, which uses it's own system of configuration file. (They have released version 3 which using mod_rewrite's .htaccess format for compatibility, but I have not switched so I can no not comment on that version) I have found it very reliable and not actually too expensive (depending on your config it can even be free). Once installed it is as simple as creating a file in the root of the website containing a list of rewrite rules. For the example given the line would be
RewriteRule /showitem/([0-9]{4})/([0-9]{2})/([0-9]{2})/ showitem.asp?id=$1&col=$2&siz=$3 [I]you may find this an easier approach than including rules as code blocks in your custom 404 page, as with sites where the structure of the site is more complex than the example given can rapidly become unwieldly
Dynamic is not always a Problem.
You say that "search engines have trouble indexing dynamic URLs". That isn't strictly true. You promote the "dynamic URLs are bad" mantra, when in fact the problem needs some extra qualification.
Consider these issues:
Multiple Parameters
Search engines do have problems indexing URLs that have more than three parameters. That's because there can be so many combinations that the bot gets stuck on the site and has to abort. In general, URLs with one or two parameters have no real issues to deal with (except see comments below about Parameter Ordering). Those are usually spidered and indexed just fine. For sites with less than three parameters in URLs, there is generally no issue to fix.
Session IDs
Search engines do have problems indexing URLs that contain session IDs. A session ID is an extra parameter. That's because with a session ID, every "page" of the site appears to have a new URL every time it is revisited. That particular Duplicate Content issue causes untold problems. It stifles proper flow of Pagerank around the site, and causes many pages to be briefly multiple listed, and then dropped in favour of an alternative URL (same URL but with a different session ID) every few days. You MUST hide session IDs from search engine bots.
Parameter Ordering
Sites that use multiple parameters are often inconsistent in the ordering of the parameters in the URLs:
Those are all the same page, but search engines treat it as three identical pages competing with each other. They pick one to list and drop the others. They might not pick the one you wanted. They might choose a different one to show after a few weeks or months. You MUST be consistent to use the same parameter order everywhere on the site, and take steps to block or redirect access when an incorrect (as in, wrong order) format is requested.
Print Friendly URLs
Pages with a "print friendly" version, usually just deliver the same content but at a slightly different URL (often an additional &pf=1 parameter). The print friendly version should be blocked from being indexed, as it is Duplicate Content. Additionally, no one wants to arrive at a "print friendly" page directly from the SERPs as that page will, by its very nature, be very unlikely to contain any navigation back to anywhere else on the site.
Those are the real issues.
I don't like the idea of using a 404 error handler to deliver the "core" content of a website. Every core URL should directly return a "200 OK" status code and the required content. Disallowed URL variations should return a 301 redirect pointing to the correct URL. URLs that have no content should return a "404 Not Found" status code, and a Custom Error Page with links to the relevant site sections to help them on their way.
Using "static" looking URLs can often be a good idea, but there are many ways to badly implement that.
If you do implement such a scheme you MUST make sure that all of the parameter-driven dynamic URLs are no longer web accessible. They MUST return either a "404 Not Found" error, or a 301 redirect to the "foldered" URL format.
Please clarify this...
*** Now all we have to do is "parse" the URL to find the three "hidden" parameters, and then we can "translate" the requested URL into the actual URL that we will send back to the browser.***
Can you explain a little more about this?
In particular, what do you mean by "the actual URL that we will send back to the browser"?
Does the user see the URL in their browser change? If so, then this is a very bad method to implement. If every on-site URL request results in a redirect to another URL, then search engines would index those "target" URLs, not the "friendly" ones used in the internal links on the site itself. That would not be a good idea.
Or, did you mean "translate the requested URL into the actual server filepath that the content will be fetched from"?
If the latter, then there is probably no real issue. That's a perfectly normal rewrite (as opposed to a redirect).
Be aware that a URL and a server filepath are two very different things.
When a URL is requested, the server can do one of several things. It can fetch content from an internal server path and filename that exactly matches those in the URL, or it can externally redirect the user to a new URL (the browser will then make a new request), or it can silently translate the requested URL into a different internal server filepath, and then fetch the content from there without exposing what that filepath actually is. That last option is a rewrite. Only if there is nothing to fetch should the final option, sending the "404 Not Found" error page, be invoked.
Clarification
g1smd: I agree with some the points you make, although it was my impression that URLs with more than one parameter can be "problematic" (so we only have different opinions regarding the case N=2); and I'd say that the best way to implement a "print-friendly version" would be to use an appropriate style sheet.
As for your question regarding "the actual URL that we will send back to the browser", let me clarify.
The user will not be "redirected", and the URL in the browser will not change.
Instead, as you say, "the requested URL" will be silently translated "into a different internal server filepath" --my example uses Server.Transfer, which does exactly that, as opposed to Response.Redirect, which would cause a redirect!
So yes, I'm talking about a "rewrite" rather than a "redirect".
Thanks
Thanks for the clarification that you are using a rewrite and not a redirect.
Many people muddle these two things up and then end up doing major harm to the indexing of their sites.
The differences between URLs and server filepaths, and between redirects and rewrites, are very important.
In general I see very few issues with parameter-driven URLs with up to four parameters. I am quite happy with N >= 1 ... N <= 3. I am not so confident with N = 4. However, most of the problems I see happening with sites seem to stem from inconsistent ordering of the parameters, and session ID inclusion, leading to Duplicate Content issues.
If you specifically avoid those particular problems, then you'll likely have no other issues to deal with.
Clrification is good
Big Problems
Its a big idea but it has at least 2 big problem:
1)we cant use server.transefer with a URL that contain value parameters & querystring
2)we cant use unicode characters in parameters.querystring returns null string for this parameters