I have been trying to figure out how to block some dynamic URLs from Googlebot. The search robots of Yahoo! Slurp and MSNBot use the same or very similar syntax to block dynamic URLs. As an example, I have this line in my htaccess file that allows me to use static pages instead of dynamic pages, but I found that sometimes Googlebot keeps crawling my dynamic pages. This can lead to duplicate content that is not approved by any of the major search engines.

I am trying to clean up my personals site as it currently ranks well with Yahoo but not Google. I believe that MSN Live has algorithms similar to Google, but this is not scientifically proven in any way. I only state this from my own personal experience with SEO and my clients’ sites. I think I’ve found some answers about ranking well with Google, MSN and possibly Yahoo. I’m in the middle of testing right now. I managed to rank well in Google for a client’s site for relevant keywords. Anyway, here’s how to block Google’s dynamic pages using your robots.txt file. The following is an extract from my htaccess file:

RewriteRule personals-dating-(.*).html$ /index.php?page=view_profile&id=$1

This rule, in case you were wondering, allows me to create static pages like appointments-personals-4525.html from the dynamic link index.php?page=view_profile&id=4525. However, this has caused problems as now Googlebot can and has “loaded” me with duplicate content. Duplicate content is frowned upon and creates more work for Googlebot because it now has to crawl additional pages and the algorithm can see it as spam. The bottom line is that duplicate content should be avoided at all costs.

The following is an excerpt from my robots.txt file:

User-agent: Googlebot

Disallow: /index.php?page=view_profile&id=*

Notice the “*” (asterisk) sign at the end of the second line. This just tells Googlebot to ignore any number of characters in the place of the asterisk. For example, Googlebot will ignore index.php?page=view_profile&id=4525 or any other number, set or character. In other words, these dynamic pages will not be indexed. You can check if your rules in your robots.txt file will work properly by logging into your Google Webmaster Control Panel account. If you don’t have a Google account, you simply need to create one from Gmail, AdWords, or AdSense and you’ll have access to Google’s webmaster tools and control panel. If you want to achieve higher rankings then you must have one. Then all you need to do is sign in to your Gmail, AdWords or AdSense accounts to have an account. They make it pretty simple to set up an account and it’s free. Click on the “Diagnostics” tab and then on the “robots.txt scan tool” link under the Tools section in the left column.

By the way, your robots.txt file should be in your webroot folder. Googlebot checks your site’s robots.txt file once a day and will update it in your Google Webmaster Control Panel under the “robots.txt analysis tool” section.

To test your robots.txt file and validate if your rules will work correctly with Googlebot, simply type the URL you want to test in the “Test URLs with this robots.txt file” field. I added the following line to this field:

http://www.personals1001.com/index.php?page=view_profile&id=4235

I then clicked the “Check” button at the bottom of the page. Googlebot will block this URL given the conditions. I think this is a better way to block Googlebot instead of using the “URL Removal” tool you can use. The “URL Removal” tool is located in the left column of your Google Webmaster Dashboard. I’ve read in a few cases in Google Groups that people have had problems with the “URL Removal” tool.

Leave a Reply

Your email address will not be published. Required fields are marked *