2
   

phpBB robots.txt tutorial

 
 
Craven de Kere
 
  1  
Reply Fri 12 Nov, 2004 02:28 am
Odd, robots.txt wildcards are something I think Google came up with (or rather popularized) even if it's non-standard.

But that alone shouldn't be it.

I see two main possibilities:

1) The meta tag (which should be removed regardless of any problem as it's useless)

2) The use of specific dynamic variables in your robots.txt file (not the wildcards but the unecessary dynamic variables).

Try the following:

1) Remove the meta tag.

2) remove the wild cards and make all robots.txt listings list the exact file names to prohibit.

Then try the Google tool I posted.

Note, if the robots.txt file is recent then maybe the spiders just haven't seen it yet.

They don't fetch the robots.txt with every single visit.
0 Replies
 
AdamStone
 
  1  
Reply Sat 13 Nov, 2004 11:24 am
Quote:
remove the wild cards and make all robots.txt listings list the exact file names to prohibit.

Hmm... I must be confused about what you're telling me to do. If I were to list them all individually there would be hundreds (ptopic1, ptopic2, post-144, post-145, ...).

Also, I've got more meta tags in overall_header. Should I remove them all? I know basically nothing about meta tags.

Code:<meta http-equiv="Content-Type" content="text/html; charset={S_CONTENT_ENCODING}">
<meta http-equiv="Content-Style-Type" content="text/css">
<meta name="keywords" content="social political society politics ...">
<meta name="description" content="Open forum ...">
<meta name="robots" content="index,follow">
{META}
0 Replies
 
Craven de Kere
 
  1  
Reply Sat 13 Nov, 2004 03:00 pm
AdamStone wrote:

Hmm... I must be confused about what you're telling me to do. If I were to list them all individually there would be hundreds (ptopic1, ptopic2, post-144, post-145, ...).


I mean these:

Disallow: /search.php
Disallow: /search.php?search_id=unanswered

Just use the search.php one.

Quote:
Also, I've got more meta tags in overall_header. Should I remove them all? I know basically nothing about meta tags.


Just remove teh robots tag, unless you are excluding pages it is usually pointless.
0 Replies
 
AdamStone
 
  1  
Reply Fri 19 Nov, 2004 02:50 pm
Okay, I removed the meta tag and the dynamic variables (not the wildcards from mod_rewrite), and then I had Google remove the url's I didn't want using robots.txt. However, it wouldn't accept my robots.txt until I removed the wildcards (which I did only for 1 minute to allow Google to do it's thing). Everything worked (no more search.php etc.)

The next day, Googlebot came around and grabbed about 45 links, and indexed several post-*.html urls. Crying or Very sad

I checked the robots.txt file, and Disallow: /post-*.html$ is perfectly in tact. Any ideas?
0 Replies
 
bugscout
 
  1  
Reply Fri 19 Nov, 2004 05:30 pm
hi,

i`ve the same problem and will test the following

in viewtopic.php

after

Code://
// Output page header
//

$page_title = $topic_title;


added

Code:if ( !empty($post_id) ) {

$template->assign_block_vars('switch_meta_noindex', array());
}


in overall_header.tpl

after

Code:<meta http-equiv="Content-Style-Type" content="text/css">


added

Code:<!-- BEGIN switch_meta_noindex -->
<META CONTENT="noindex,follow" NAME="robots">
<!-- END switch_meta_noindex -->


this will show noindex every time if viewtopic.php?p=x
is called and disappears if $p does´nt exist

but it`s a early beta, i will test it right now.
it´s 00:30 in my time, google will come tonight
and i will see the result in 24 - 48 hours

but this will only prevent you from getting double content in the index.
searchengine traffic will be the same, because noindex will be spidered
nevertheless.

regards
0 Replies
 
bugscout
 
  1  
Reply Tue 23 Nov, 2004 07:25 am
hi,

switch_meta_noindex works.

regards
0 Replies
 
AdamStone
 
  1  
Reply Tue 23 Nov, 2004 03:28 pm
bugscout,

I'm less proficient than many others. What exactly does your mod do? Does it work in the context of Craven's SEO mod?

Thanks Smile
0 Replies
 
bugscout
 
  1  
Reply Wed 24 Nov, 2004 12:01 pm
hi,

robots.txt exclusions

Code:Disallow: forums/post-*.html$
Disallow: forums/updates-topic.html*$
Disallow: forums/stop-updates-topic.html*$
Disallow: forums/ptopic*.html$
Disallow: forums/ntopic*.html$




Disallow: forums/post-*.html$ because about*.html$ belongs to the same thread and should prevent you from getting double indexed content.

example page

http://www.able2know.com/forums/ask-about36.html

our thread

left link in thread -> phpBB robots.txt tutorial

Code:
http://www.able2know.com/forums/viewtopic.php?t=22587


works with

Code:RewriteRule ^about([0-9]*).html&highlight=([a-zA-Z0-9]*) viewtopic.php?t=$1&highlight=$2 [L,NC]


right link in thread -> Tue Nov 23, 2004 2:28 pm

Code:
http://www.able2know.com/forums/viewtopic.php?p=1033728#1033728


works with

Code:RewriteRule ^post-([0-9]*).html&highlight=([a-zA-Z0-9]*) viewtopic.php?p=$1&highlight=$2 [L,NC]


this exclusion (Disallow: forums/post-*.html$) does not work in my case, so i allways add "noindex" if google comes through the right link.

Disallow: forums/ptopic*.html$ and Disallow: forums/ntopic*.html$
should also produce double content, if Disallow: forums/post-*.html$
does not work, but i have not seen that up to know. Rolling Eyes

so you only need to do this if you have post-xx.html and aboutxx.html
in your results (allinurl:www.domain.tld)

all my changes are additionally to Craven's SEO mod, that seems to be the best seo-mod for phpBB.

other changes
http://www.able2know.com/forums/viewtopic.php?t=38321

regards
0 Replies
 
bugscout
 
  1  
Reply Thu 25 Nov, 2004 08:12 pm
hi,

i think the wildcards will work, but google will still have results like this



this is what i see in germany.

all these nice results, coming from

posting.php?mode=quote&p=xx
posting.php?mode=reply&t=xxx
profile.php?mode=viewprofile&u=x
search.php?search_author=xxx
privmsg.php?mode=post&u=xxx


i have 80 topics and about 400 results in the serps,
in a few days there will be 1000.

a long way to search engine friendly phpBB.
don´t know if this is good for search engine index

http://www.google.de/search?q=site:www.phpbb.com&hl=de&lr=&c2coff=1&start=100&sa=N

regards
0 Replies
 
AdamStone
 
  1  
Reply Tue 7 Dec, 2004 05:05 pm
As promised, bugscout's switch_meta_noindex mod works to prevent indexing of duplicate content, but I'd still like to figure out why the robots.txt entry (Disallow: /post-*.html$) doesn't work.

I suspect my robots.txt file might be faulty, because Google occasionally indexes even straightforward restrictions.

Robots.txt is chmodded to 644 in my root directory. Does that make a difference?

Code:User-agent: *
Disallow: /post-*.html$
Disallow: /updates-topic.html*$
Disallow: /stop-updates-topic.html*$
Disallow: /ptopic*.html$
Disallow: /ntopic*.html$
Disallow: /admin/
Disallow: /db/
Disallow: /images/
Disallow: /includes/
Disallow: /language/
Disallow: /templates/
Disallow: /siteinfo.php
Disallow: /common.php
Disallow: /groupcp.php
Disallow: /memberlist.php
Disallow: /modcp.php
Disallow: /posting.php
Disallow: /profile.php
Disallow: /privmsg.php
Disallow: /viewonline.php
Disallow: /faq.php
Disallow: /search.php
Disallow: /login.php
0 Replies
 
AdamStone
 
  1  
Reply Wed 8 Dec, 2004 11:29 pm
Craven, what would be the effect of adding this wildcard:

Code:Disallow: /post-*.html*$
0 Replies
 
Craven de Kere
 
  1  
Reply Thu 9 Dec, 2004 02:20 am
It would have the effect of duplicating the first line in your robot exclusion file.
0 Replies
 
AdamStone
 
  1  
Reply Thu 9 Dec, 2004 11:59 am
Is it your opinion then, that mod_rewrite is probably applied incorrectly in my case? I can't figure out why Google is indexing post-* URL's, and it's driving me mad. Crying or Very sad
0 Replies
 
AdamStone
 
  1  
Reply Thu 6 Jan, 2005 10:33 am
I just received a reply from Google concerning the wildcard issue, and here's what they said:

Quote:
Thank you for contacting us. Please note that the Disallow line you have provided is not in line with the robots.txt standard. Disallow lines cannot have wildcards in the middle of the filepath. Also, as you have observed, our automatic removal tool will not process robots.txt files that include wildcards in their Disallow lines.

Also, please note that although a robots.txt file prevents our robots from
crawling your pages, it will not prevent our robots from adding a link to
your page without crawling it. Sometimes our robots add links to the
Google index without crawling them. When this happens the URLs appear in our search results without a title or cache.

Although a robots.txt file usually prevents pages from appearing in our
search results, the only fool-proof ways to keep them out of our index are
to make sure that no sites link to them, password protect them, or remove
the robots.txt file and use a NOINDEX meta tag instead.


So I'm wondering how it is that this mod works for Able2Know. Craven?
0 Replies
 
jmueller0823
 
  1  
Reply Fri 14 Jan, 2005 07:20 pm
(re above post)

Does this mean the robots.txt example in the first post is invalid?

Thanks.
0 Replies
 
Jetlag
 
  1  
Reply Sun 16 Jan, 2005 08:48 pm
Quote:
Does this mean the robots.txt example in the first post is invalid?

I would think so because you can not use wildcards in specifications, nor regex symbols. Also there is no " / "

Edit (Jetlag) removed... cant show where i got the robots.txt info and "how to"

Edit (moderator): URL removed
0 Replies
 
Craven de Kere
 
  1  
Reply Wed 19 Jan, 2005 10:51 pm
Since Jetlag is trying to help with the precondition that he be able to post a link I'll summarize the issue and solutions.

1) wildcard support in robots.txt hasn't taken hold.

Too bad, as it's useful. But we can work around that:

2) An easy solution using mod_rewrite is to spoof a disallowed subdirectory in your urls.

Thing is, most here won't know how to do that and I don't currently have time to show you so use this:

3) /foo matches /foobar so just use something like:

Disallow: /post-

to match

Disallow: /post-*.html
0 Replies
 
blackhawk12
 
  1  
Reply Fri 18 Feb, 2005 03:17 pm
Quote:
I have Google Adsense on each page of my forum. The Adsense Bot needs to visit each page to deliver relevant ads. Do you know if your example robots.txt file causes problems with Adsense? Thanks.

Craven wrote this reply:
Quote:
It shouldn't but you may want to allow post urls as the bot will want to monetize those as well.


Can this be elaborated on... is this what should be removed from my robots.txt file...

Code:Disallow: /phpBB2/posting.php
Disallow: /phpBB2/ptopic*.html$

Any other useful tips about robots.txt in conjunction with google adsense program would be very appreciated.

best regards,
Bernie
0 Replies
 
Craven de Kere
 
  1  
Reply Fri 18 Feb, 2005 05:17 pm
blackhawk12,

Any page that your serve adsense on and also block bots from will serve defaults or less relevant ads.

So if you want targeted ads then you need to let the bot spider the page.

What pages you should allow depend on which pages you use adsense on and how you weight the importance of targeted ads and the detriment of having bots spider those pages.
0 Replies
 
BoZaR
 
  1  
Reply Wed 30 Mar, 2005 07:24 am
Re: phpBB robots.txt tutorial
Quote:
Code:
Disallow: /forums/updates-topic
Disallow: /forums/stop-updates-topic
Disallow: /forums/ptopic
Disallow: /forums/ntopic
Disallow: /post-


The last few pertain to the Able2Know.com SEO MOD. You might also consider preventing search.php from being spidered.


hello in SEO MOD what those lign of code reffers to?
0 Replies
 
 

Related Topics

 
Copyright © 2024 MadLab, LLC :: Terms of Service :: Privacy Policy :: Page generated in 0.03 seconds on 05/10/2024 at 06:22:38