Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

**oldschoolrocker** · 19 Apr 2008, 06:20 AM

Ill-behaved bots are nothing new for Yahoo (some of which are blocked via robots.txt & .htaccess)... But I'd like to at least control the activities of .crawl.yahoo.net - a very useful crawler.

Unfortunately, many of the below pages (in meta_tags.php) are still being indexed:

define('ROBOTS_PAGES_TO_SKIP','login,logoff,create_account,account,
account_edit,account_history,account_history_info,account_newsletters,
account_notifications,account_password,address_book,advanced_search,
advanced_search_result,checkout_success,checkout_process,
checkout_shipping,checkout_payment,checkout_confirmation,conditions,
cookie_usage,create_account_success,contact_us,download,
download_timeout,customers_authorization,down_for_maintenance,
password_forgotten,time_out,unsubscribe,info_shopping_cart,
popup_image,popup_image_additional,product_reviews_write,page_2,
page_3,page_4,privacy,shippinginfo,ssl_check,tell_a_friend');

I'm not fond of having my Conditions, Privacy, Shipping Info, FAQs, Ordering and other such pages come up whenever someone searches for my store... Any tips or suggestions would be appreciated!

**Get Em Fast** · 19 Apr 2008, 01:42 PM

Can you not block the folders from them in your robots.txt like?:

User-agent: *
Disallow: /tmp/
Disallow:/mail/
Disallow:/etc/

**oldschoolrocker** · 19 Apr 2008, 10:33 PM

Thanks for the reply, Get Em.

While robots.txt works fine for excluding directories (e.g. Disallow: /cgi-bin/), it bombs when it comes to php pages:

Disallow: /cookie_usage.php
Disallow: /conditions.php
Disallow: /contact_us.php
Disallow: /page_2.php
Disallow: /page_3.php
Disallow: /page_4.php
Disallow: /shippinginfo.php
Disallow: /privacy.php

Over the past couple years, I've tried various methods to prevent these (and other) ZC pages from being indexed, but nothing seems to work. I was glad to hear that v1.3.* was going to have a noindex-type define but, as stated in my initial post, this hasn't done the trick either... Am I the only one who's experiencing this? Server stat hawks, please chime in!

**oldschoolrocker** · 2 May 2008, 12:14 AM

Well, I'm still having problems with certain bots ignoring robots.txt and ROBOTS_PAGES_TO_SKIP in meta_tags.php. Yahoo absolves itself from any and all misbehaving bots, so there's no solution to be found on their end. From an SEO standpoint, there's nothing on my end that would contribute to this (and nothing relevant in htaccess). So, since we're all using ZC, I can't be the only one who's experiencing this - or the only one who checks server stats on a regular basis (to notice it). Yahoo slurp isn't the only offender, but it is the biggest and worst.

If I can assume that certain bots are just ignoring the page-skip define, then I'm back to tweaking my robots.txt file...

When researching this topic, I found many instances where "Disallow: /cookie_usage.php" was offered as a solution. Well, in the past several years of using open source carts, that has NEVER worked for me. Besides, I'm not convinced that robots.txt works for disallowing individual pages. I have also tried disallowing the directory (e.g. /cookie_usage/), but that doesn't seem to work either.

I hate to provide too much information (in a viewable text file), but would adding the path prove more effective: "Disallow: /includes/modules/pages/cookie_usage/"? Or would "Disallow: /includes/modules/pages/cookie_usage.php" be correct?

On another note, I manually added <meta name="robots" content="noindex, nofollow"> to the headers of select pages in v1.2.7, and it seemed to work. And while I do see that in the source view of certain pages in 1.3.8a, I'm wondering if the page-skip definition is somehow being rendered ineffective. At this point, I'm just grasping at straws...

**chuck** · 3 May 2008, 10:52 AM

Originally Posted by oldschoolrocker

When researching this topic, I found many instances where "Disallow: /cookie_usage.php" was offered as a solution. Well, in the past several years of using open source carts, that has NEVER worked for me. Besides, I'm not convinced that robots.txt works for disallowing individual pages. I have also tried disallowing the directory (e.g. /cookie_usage/), but that doesn't seem to work either.

I hate to provide too much information (in a viewable text file), but would adding the path prove more effective: "Disallow: /includes/modules/pages/cookie_usage/"? Or would "Disallow: /includes/modules/pages/cookie_usage.php" be correct?

Those probably don't work for you because the Zen Cart URLs don't look like that.
Try this instead:
Disallow: index.php?main_page=cookie_usage

Then again, I never use robots.txt except for site-map stuff for google. The rest I leave alone because Zen Cart seems to do it fine for me.

**Tech-E** · 3 May 2008, 03:19 PM

This is probably not a problem with Zen Cart, unless the ROBOTS_PAGES_TO_SKIP list is not generating a noindex meta tag. My main business is SEO and I've been doing it for over 10 years.

Several spiders do not respond quickly to changes in either meta tags or the robots.txt file. The "noindex" meta tag is more effective than the robots.txt file. Regardless of what Google or Yahoo say in their documentation, they do not visit the robots.txt file regularly and only randomly scan the file.

I've always found that Yahoo is the most stubborn and it can sometimes take a year for them to recognize changes to the robots.txt file.

The robots.txt file works best when it is in place when the site first goes up. Once a spider indexes a site, changes to the file take longer to have an effect.

The ROBOTS_PAGES_TO_SKIP list should generate a noindex meta tag on each of the pages in the list. Have you checked to see if that is working correctly?

**madk** · 3 May 2008, 04:21 PM

Originally Posted by oldschoolrocker

Thanks for the reply, Get Em.

While robots.txt works fine for excluding directories (e.g. Disallow: /cgi-bin/), it bombs when it comes to php pages:

I don't find this true at all. I've created hundreds of sites and always use robots.txt to block certain filese. MSN, Yahoo Slurp and Google all follow the rules.

The best way to set this is up is to sign up for a Google Webmaster Account. Within the tools you will find a robots.txt testing tool. You can test it out in real time and make sure things work.

**Tech-E** · 3 May 2008, 04:41 PM

I agree with madk. While some search engines are very slow to recognize changes in the robots.txt file, you can block almost any type of script. Just don't expect changes to take place right away. Spiders only randomly scan this file. They no longer scan it with every visit like they used to.

There are a couple of problems I see with the way people use the robots.txt file.

First, if you need to block a PHP page in a subdirectory, don't block it using the following

Disallow: /filename.php

This will only block this file if it is in the root directory. You will either need to include the subdirectory path or remove the preceding slash.

The other issue is when people do thing like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

Disallow: /data/

If you look at the specifications for the robots.txt file, a blank line is a delimiter that ends the current record. Disallow: /data/ will never be recognized. This file does not follow the rules for HTML scripts, where lines and multiple spaces are ignored.

It's always a good idea to run the robots.txt file through a "robots.txt validator". There are many freebbies on the web. The one in Google's Webmaster tools also works well.

Thread: Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

Thread Tools

Search Thread

Display

Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

Re: Yahoo Ignores 'ROBOTS_PAGES_TO_SKIP'

Re: Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

Re: Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

Re: Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

Re: Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

Re: Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

Re: Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

Similar Threads

v138a IE9 ignoring attributes

Yahoo mail SMTP setting - with Yahoo Hosting

Securing zen in Yahoo -- can I do SSL on Yahoo hosting?

Product Listing Ignoring formatting

Robots_pages_to_skip

Bookmarks

Bookmarks

Posting Permissions