Optimising your secure content for Google
Wednesday 22nd October 2008

An important consideration when securing your content is how this will affect your search engine optimisation and pagerank. After all, you work on great content so you want people to be able to find it - they may have to pay to view it but you still want them to be able to find it. If Googlebot (the tool Google uses to crawl and index web pages) can't view your secure content, it can't index your content and your web site may not be ranked as highly as it should be.
This article discusses how to enable Googlebot to access your secure content, the article also covers how to enable Googlebot to login to your OpenCrypt membership software 'PHP Login Interface' which enables you to control what secure content Google can see, and enables you to track Googlebot's usage of your website.
Google and the .htaccess pop-up login prompt

If you use a .htaccess pop-up prompt to secure your content then it is not possible to reliably detect the visitor is Googlebot, however the .htaccess file can check Googlebot's IP address or hostname against a stored list and if it matches, allow Googlebot to access the secure content. The issue with doing this is a user could falsify their own IP information in order to trick your system into thinking they are Googlebot. Google recommend using the Googlebot IP address to check the authenticity of the visitor, but their advice is to detect the visitor's hostname from the IP address, then detect the IP address for the hostname and compare the IP addresses to double check the visitor isn't providing false headers.
Reference:
http://www.google.com/support/webmasters/bin/answer.py?answer=80553
To allow Googlebot to access your .htaccess protected secure content, place the following at the end of your .htaccess file:
order deny,allow
deny from all
allow from googlebot.com google.com
satisfy any
Using this is not recommended on sites where security is of importance!
One issue with allowing Googlebot to access your secure content is the 'Cached' page feature, you may have noticed when searching on Google next to the search results a small link for 'Cached', this takes you to a version of the web site stored on the Google server. Google rarely stores many pages for a site or all the images, but it can be useful if a web site goes offline or is very slow. If you allow Googlebot to access your secure content, the 'Cached' link may provide a method for visitors to view your secure content without even visiting your web site!
To stop Google from caching your pages, place the following HTML tag in your page header between your <head> and </head> tags:
<meta name="googlebot" content="noarchive">
This tag only removes the 'Cached' link for the search results, Google will continue to index the page and display a snippet.
Reference: http://www.google.com/support/webmasters/bin/answer.py?answer=35306
Optimising OpenCrypt's 'PHP Login Interface' for Google

If you use OpenCrypt's 'PHP Login Interface' to secure your content we can optimise your secure content for Googlebot. We are focusing on OpenCrypt's login interface but anyone who has a custom PHP login system should be able to adjust this code to suit their needs.
OpenCrypt users, simply place the following code after your require "login.php"; statement (this is usually in your /oc/header.php file):
if (($login_successful!="1") && ($dbusername=="")) {
if (stristr($envbrow,"googlebot")) {
if ($envip!="") {
$envaddr = gethostbyaddr($envip);
if (stristr($envaddr,"googlebot.com")) {
$envaddrip = gethostbyname($envaddr);
if ($envip==$envaddrip) {
$login_successful = "1";
$result = "4";
$dbusername = "googlebot";
$input_username = $dbusername;
$header_html = "<meta name=\"googlebot\"
content=\"noarchive\">";
}
}
}
}
}
Note, you will need to include the $header_html variable in your HTML headers to display the 'noarchive' tag so Googlebot doesn't offer a cached version of the page. To repeat what is written above but to ensure you don't miss it; one issue with allowing Googlebot to access your secure content is the 'Cached' page feature, the 'Cached' link may provide a method for visitors to view your secure content without even visiting your web site!
Here is the same code with line by line explanations:
if (($login_successful!="1") && ($dbusername=="")) {
Avoid unnecessary checks for logged in users.
if (stristr($envbrow,"googlebot")) {
Useragent includes googlebot text.
if ($envip!="") {
Check IP address is present.
$envaddr = gethostbyaddr($envip);
Get hostname for IP address.
if (stristr($envaddr,"googlebot.com")) {
Hostname includes googlebot.com text.
$envaddrip = gethostbyname($envaddr);
Use the hostname to detect the IP address.
if ($envip==$envaddrip) {
Check the users IP matches the googlebot.com server IP to verify authenticity.
References:
http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html http://www.google.com/support/webmasters/bin/answer.py?answer=80553
$login_successful = "1";
$result = "4";
Success, let googlebot access your secure content.
$dbusername = "googlebot";
$input_username = $dbusername;
Set a username for googlebot to track usage and control what content can be viewed.
$header_html = "<meta name=\"googlebot\"
content=\"noarchive\">";
Prevent Google from caching your page so users can't view the secure content via the 'Cached' link on Google - very important! This tag only removes the 'Cached' link for the search results, Google will continue to index the page and display a snippet.
Reference: http://www.google.com/support/webmasters/bin/answer.py?answer=35306
}
}
}
}
}
Close the if statements..
Tracking Googlebot's Usage of Your Website

Once you've setup the above code to work with your 'PHP Login Interface', simply create an account in your OpenCrypt system with the username 'googlebot'. When Googlebot visits your web site you will be able to see what pages Googlebot has accessed via the 'Statistics' system. You can of course setup a subscription specifically for the 'googlebot' account to control what content Google can see, or you can use if statements to detect the 'googlebot' username and display different content based on that.
For example:
if ($dbusername=="googlebot") {
print "Some content just for Google";
}
Of course, be very careful displaying content just to Google because your ranking can be penalised, for example if you were to detect Googlebot and display blocks of keywords which weren't visible to general website users.
Suggestions for Limiting Content Displayed to Google

If you are concerned about Google viewing all of your secure content you could consider limiting what content is displayed. For example, the OpenCrypt version 1.7 Article Manager add-on provides a method for securing articles for specific subscription groups, this facility can be customised to display article snippets to non-registered users and could easily be advanced to display for example, the first 500 words of an article to Googlebot to enhance your rankings. Another suggestion would be to display every five words out of ten, this would display half the text of the article for Googlebot to rank, but it wouldn't make much sense to a human visitor - of course this may cause Google to penalise you.
Reference: http://www.google.com/support/webmasters/bin/answer.py?answer=66355
Share this article:
|
<- Membership business models
|
Customer Showcase: A-Z-Animals.com ->
|
|