Results 1 to 8 of 8
  1. #1
    Join Date
    Sep 2005
    Posts
    63
    Plugin Contributions
    0

    Default Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

    Ill-behaved bots are nothing new for Yahoo (some of which are blocked via robots.txt & .htaccess)... But I'd like to at least control the activities of .crawl.yahoo.net - a very useful crawler.

    Unfortunately, many of the below pages (in meta_tags.php) are still being indexed:

    define('ROBOTS_PAGES_TO_SKIP','login,logoff,create_account,account,
    account_edit,account_history,account_history_info,account_newsletters,
    account_notifications,account_password,address_book,advanced_search,
    advanced_search_result,checkout_success,checkout_process,
    checkout_shipping,checkout_payment,checkout_confirmation,conditions,
    cookie_usage,create_account_success,contact_us,download,
    download_timeout,customers_authorization,down_for_maintenance,
    password_forgotten,time_out,unsubscribe,info_shopping_cart,
    popup_image,popup_image_additional,product_reviews_write,page_2,
    page_3,page_4,privacy,shippinginfo,ssl_check,tell_a_friend');

    I'm not fond of having my Conditions, Privacy, Shipping Info, FAQs, Ordering and other such pages come up whenever someone searches for my store... Any tips or suggestions would be appreciated!

  2. #2
    Join Date
    Dec 2006
    Location
    Seligman, MO U.S.A.
    Posts
    2,101
    Plugin Contributions
    5

    Default Re: Yahoo Ignores 'ROBOTS_PAGES_TO_SKIP'

    Can you not block the folders from them in your robots.txt like?:

    User-agent: *
    Disallow: /tmp/
    Disallow:/mail/
    Disallow:/etc/
    Last edited by Get Em Fast; 19 Apr 2008 at 01:48 PM.
    Teach them to shop and they will shop today;
    Teach them to Zen and they will OWN a shop tomorrow!

  3. #3
    Join Date
    Sep 2005
    Posts
    63
    Plugin Contributions
    0

    Default Re: Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

    Thanks for the reply, Get Em.

    While robots.txt works fine for excluding directories (e.g. Disallow: /cgi-bin/), it bombs when it comes to php pages:

    Disallow: /cookie_usage.php
    Disallow: /conditions.php
    Disallow: /contact_us.php
    Disallow: /page_2.php
    Disallow: /page_3.php
    Disallow: /page_4.php
    Disallow: /shippinginfo.php
    Disallow: /privacy.php

    Over the past couple years, I've tried various methods to prevent these (and other) ZC pages from being indexed, but nothing seems to work. I was glad to hear that v1.3.* was going to have a noindex-type define but, as stated in my initial post, this hasn't done the trick either... Am I the only one who's experiencing this? Server stat hawks, please chime in!

  4. #4
    Join Date
    Sep 2005
    Posts
    63
    Plugin Contributions
    0

    Default Re: Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

    Well, I'm still having problems with certain bots ignoring robots.txt and ROBOTS_PAGES_TO_SKIP in meta_tags.php. Yahoo absolves itself from any and all misbehaving bots, so there's no solution to be found on their end. From an SEO standpoint, there's nothing on my end that would contribute to this (and nothing relevant in htaccess). So, since we're all using ZC, I can't be the only one who's experiencing this - or the only one who checks server stats on a regular basis (to notice it). Yahoo slurp isn't the only offender, but it is the biggest and worst.

    If I can assume that certain bots are just ignoring the page-skip define, then I'm back to tweaking my robots.txt file...

    When researching this topic, I found many instances where "Disallow: /cookie_usage.php" was offered as a solution. Well, in the past several years of using open source carts, that has NEVER worked for me. Besides, I'm not convinced that robots.txt works for disallowing individual pages. I have also tried disallowing the directory (e.g. /cookie_usage/), but that doesn't seem to work either.

    I hate to provide too much information (in a viewable text file), but would adding the path prove more effective: "Disallow: /includes/modules/pages/cookie_usage/"? Or would "Disallow: /includes/modules/pages/cookie_usage.php" be correct?

    On another note, I manually added <meta name="robots" content="noindex, nofollow"> to the headers of select pages in v1.2.7, and it seemed to work. And while I do see that in the source view of certain pages in 1.3.8a, I'm wondering if the page-skip definition is somehow being rendered ineffective. At this point, I'm just grasping at straws...

  5. #5
    Join Date
    Jul 2005
    Posts
    537
    Plugin Contributions
    0

    Default Re: Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

    Quote Originally Posted by oldschoolrocker View Post

    When researching this topic, I found many instances where "Disallow: /cookie_usage.php" was offered as a solution. Well, in the past several years of using open source carts, that has NEVER worked for me. Besides, I'm not convinced that robots.txt works for disallowing individual pages. I have also tried disallowing the directory (e.g. /cookie_usage/), but that doesn't seem to work either.

    I hate to provide too much information (in a viewable text file), but would adding the path prove more effective: "Disallow: /includes/modules/pages/cookie_usage/"? Or would "Disallow: /includes/modules/pages/cookie_usage.php" be correct?
    Those probably don't work for you because the Zen Cart URLs don't look like that.
    Try this instead:
    Disallow: index.php?main_page=cookie_usage

    Then again, I never use robots.txt except for site-map stuff for google. The rest I leave alone because Zen Cart seems to do it fine for me.

  6. #6
    Join Date
    Jan 2007
    Posts
    159
    Plugin Contributions
    0

    Default Re: Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

    This is probably not a problem with Zen Cart, unless the ROBOTS_PAGES_TO_SKIP list is not generating a noindex meta tag. My main business is SEO and I've been doing it for over 10 years.

    Several spiders do not respond quickly to changes in either meta tags or the robots.txt file. The "noindex" meta tag is more effective than the robots.txt file. Regardless of what Google or Yahoo say in their documentation, they do not visit the robots.txt file regularly and only randomly scan the file.

    I've always found that Yahoo is the most stubborn and it can sometimes take a year for them to recognize changes to the robots.txt file.

    The robots.txt file works best when it is in place when the site first goes up. Once a spider indexes a site, changes to the file take longer to have an effect.

    The ROBOTS_PAGES_TO_SKIP list should generate a noindex meta tag on each of the pages in the list. Have you checked to see if that is working correctly?
    Last edited by Tech-E; 3 May 2008 at 03:22 PM.

  7. #7
    Join Date
    Jun 2006
    Location
    Michigan
    Posts
    196
    Plugin Contributions
    0

    Default Re: Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

    Quote Originally Posted by oldschoolrocker View Post
    Thanks for the reply, Get Em.

    While robots.txt works fine for excluding directories (e.g. Disallow: /cgi-bin/), it bombs when it comes to php pages:
    I don't find this true at all. I've created hundreds of sites and always use robots.txt to block certain filese. MSN, Yahoo Slurp and Google all follow the rules.

    The best way to set this is up is to sign up for a Google Webmaster Account. Within the tools you will find a robots.txt testing tool. You can test it out in real time and make sure things work.

  8. #8
    Join Date
    Jan 2007
    Posts
    159
    Plugin Contributions
    0

    Default Re: Yahoo Ignoring 'ROBOTS_PAGES_TO_SKIP'

    I agree with madk. While some search engines are very slow to recognize changes in the robots.txt file, you can block almost any type of script. Just don't expect changes to take place right away. Spiders only randomly scan this file. They no longer scan it with every visit like they used to.

    There are a couple of problems I see with the way people use the robots.txt file.

    First, if you need to block a PHP page in a subdirectory, don't block it using the following

    Disallow: /filename.php

    This will only block this file if it is in the root directory. You will either need to include the subdirectory path or remove the preceding slash.

    The other issue is when people do thing like this:

    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /images/

    Disallow: /data/

    If you look at the specifications for the robots.txt file, a blank line is a delimiter that ends the current record. Disallow: /data/ will never be recognized. This file does not follow the rules for HTML scripts, where lines and multiple spaces are ignored.

    It's always a good idea to run the robots.txt file through a "robots.txt validator". There are many freebbies on the web. The one in Google's Webmaster tools also works well.

 

 

Similar Threads

  1. v138a IE9 ignoring attributes
    By MaureenT in forum General Questions
    Replies: 0
    Last Post: 3 Mar 2012, 10:38 PM
  2. Yahoo mail SMTP setting - with Yahoo Hosting
    By johnkwok in forum Installing on a Linux/Unix Server
    Replies: 10
    Last Post: 1 Aug 2008, 06:25 AM
  3. Securing zen in Yahoo -- can I do SSL on Yahoo hosting?
    By HumDaddy in forum General Questions
    Replies: 2
    Last Post: 2 Dec 2007, 06:02 AM
  4. Product Listing Ignoring formatting
    By rwoody in forum Templates, Stylesheets, Page Layout
    Replies: 11
    Last Post: 20 Mar 2007, 09:28 PM
  5. Robots_pages_to_skip
    By Marg in forum Basic Configuration
    Replies: 0
    Last Post: 12 Mar 2007, 01:48 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
disjunctive-egg
Zen-Cart, Internet Selling Services, Klamath Falls, OR