Questions

What we disallow in robots.txt for HTML and PHP website?

Asked by Nityanand Tripathi, in Education
generally in WordPress website we disallow /wp-admin/ and /wp-include but I want to know what we disallow in robots.txt for HTML and PHP website

Sponsor Ads


Answers

Joaquin F. Advanced  Telco CEO
All directories and files you don't like be indexed by search engines , of course your /wp-admin , config.php, and any file that must not be for direct browsing etc
Jul 29th 2018 13:29   
Bikes X. Freshman  Where cycling isn’t just a sport. It's a lifestyle
You can disallow all the files and directory to index by search engine.
Jul 29th 2018 23:49   
Varsha Akki Freshman  QuickBooks Customer Support USA +1 866-662-5999
Robots.txt is a text (not html) file you put on your site to tell search robots which pages you would like them not to visit. Robots.txt is by no means mandatory for search engines but generally search engines obey what they are asked not to do. It is important to clarify that robots.txt is not a way from preventing search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) and the fact that you put a robots.txt file is something like putting a note “Please, do not enter” on an unlocked door – e.g. you cannot prevent thieves from coming in but the good guys will not open to door and enter. That is why we say that if you have really sen sitive data, it is too naïve to rely on robots.txt to protect it from being indexed and displayed in search results.
Jul 30th 2018 01:23   
Chris E. Innovator  Business Brand Executive
Robots.txt file is the first page that Google bot visit to ensure that which page needs to crawl and which page the webmaster does not want to the bots to be crawled.

So, it should have the pages you think is not needed to visible in Google index and Google searches.

Example: Admin page.
Jul 30th 2018 02:09   
William Klein Committed  Expert in Internet Marketing..
Live Academic Expert Offers Online Tutoring at Incredible Prices
Jul 30th 2018 03:34   
Sonera Jhaveri Senior  Psychotherapist in Mumbai
Robots.txt file is the first page that Google bot visit to ensure that which page needs to crawl and which page the webmaster does not want to the bots to be crawled.

So, it should have the pages you think is not needed to visible in Google index and Google searches.
Jul 30th 2018 05:57   
Ordius IT Solutions Committed  Website Design & Digital Marketing
What is site robots txt?
The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned.
Jul 31st 2018 00:12   
InnovationM Technology Solutions Advanced  Tech Blogger
Robots.txt file is the first page that Google bot visit to ensure that which page needs to crawl and which page the webmaster does not want to the bots to be crawled.
Aug 3rd 2018 00:43   
Nishant Kumar Advanced  Education Blogger
Related Articles - Parsing files
User Comments
PHP: Parsing robots.txt
0 06

If you're writing any kind of script that involves fetching HTML pages or files from another server you really need to make sure that you follow netiquette - the "unofficial rules defining proper behaviour on Internet".

This means that your script needs to:

identify itself using the User Agent string including a URL;
check the site's robots.txt file to see if they want you to have access to the pages in question; and
not flood their server with too-frequent, repetitive or otherwise unnecessary requests.
If you don't meet these requirements then don't be surprised if they retaliate by blocking your IP address and/or filing a complaint. This article presents methods for achieving the first two goals, but the third is up to you.
Aug 5th 2018 00:08   
Tom Harris Advanced  web master
Robots .txt file help to index and ive instructions about their site to web robots.
Aug 9th 2018 01:53   
Ordius IT Solutions Committed  Website Design & Digital Marketing
Useful robots.txt rules
Here are some common useful robots.txt rules:

Rule Sample
Disallow crawling of the entire website. Keep in mind that in some situations URLs from the website may still be indexed, even if they haven't been crawled. Note: this does not match the various AdsBot crawlers, which must be named explicitly.
User-agent: *
Disallow: /
Disallow crawling of a directory and its contents by following the directory name with a forward slash. Remember that you shouldn't use robots.txt to block access to private content: use proper authentication instead. URLs disallowed by the robots.txt file might still be indexed without being crawled, and the robots.txt file can be viewed by anyone, potentially disclosing the location of your private content.
User-agent: *
Disallow: /calendar/
Disallow: /junk/
Allow access to a single crawler
User-agent: Googlebot-news
Allow: /

User-agent: *
Disallow: /
Allow access to all but a single crawler
User-agent: Unnecessarybot
Disallow: /

User-agent: *
Allow: /

Disallow crawling of a single webpage by listing the page after the slash:

Disallow: /private_file.html
Block a specific image from Google Images:

User-agent: Googlebot-Image
Disallow: /images/dogs.jpg
Block all images on your site from Google Images:

User-agent: Googlebot-Image
Disallow: /
Disallow crawling of files of a specific file type (for example, .gif):

User-agent: Googlebot
Disallow: /*.gif$
Disallow crawling of entire site, but show AdSense ads on those pages, disallow all web crawlers other than Mediapartners-Google. This implementation hides your pages from search results, but the Mediapartners-Google web crawler can still analyze them to decide what ads to show visitors to your site.

User-agent: *
Disallow: /

User-agent: Mediapartners-Google
Allow: /
Match URLs that end with a specific string, use $. For instance, the sample code blocks any URLs that end with .xls:
User-agent: Googlebot
Disallow: /*.xls$
Aug 9th 2018 04:15   
Sunil Upreti Advanced  Digital Marketing Executive (SEO)
It is very nice when finding engines repeatedly visit any site and index any content but often there are cases when indexing parts of your online content are not what you want. For instance, if you have two versions of a page. you'd rather have the printing version excluded from crawling, otherwise, you risk being imposed a duplicate content mulct.
Aug 20th 2018 23:48   
Amy Willor Freshman  ABAssignmenthelp
nice post thank you for sharing with us.
Sep 21st 2018 00:32   
Santosh Baranwal Magnate I  Sr. SEO
Robots .txt file help to index and ive instructions about their site to web robots.
Oct 9th 2018 01:55   
sla consultants india Innovator  SLA Consultants India - Training Center Delhi
It's permission for crawler which file access or Not ! ! or It's help for Indexing or no-Indexing file at your sites.
Oct 16th 2018 05:20   
Essay Corp Advanced  Online Writing Services
I'll say just post it in Google forums and they'll surely help you out with this.
May 3rd 2019 04:16   
Pace Staff Advanced  Singapore Pace Academy
Robots .txt file help to index and ive instructions about their site to web robots.
Nov 22nd 2020 22:17   
Please sign in before you comment.