What is a robots.txt file?
Robots.txt is a text file webmasters create to instruct search engine robots how to crawl pages on a website. The robots.txt file is part of the the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content.
In practice, robots.txt files indicate whether certain user agents can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents.
Basic format:
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
Where does robots.txt go on a site?
In order to be found, a robots.txt file must be placed in a website’s top-level directory(/). Robots.txt is case sensitive: the file must be named “robots.txt” (not Robots.txt, robots.TXT, or otherwise).The /robots.txt file is a publicly available.
Sample robots file
How to add webMethods API Portal?
As part of this tutorial we will try adding below sample robots.txt content to API Portal
User-agent: Google
Disallow:
User-agent: *
Disallow: /
Above robots.txt configuration file allows single bot named "Google" and disallows all other bots on complete API Portal content.
Step I
Please add below snippet of configuration in httpd-custom.conf / httpd-ssl-custom.conf
LoadModule alias_module modules/mod_alias.so
Alias "/docs" "<directoryContainingRobotsFile>"
RewriteCond %{REQUEST_URI} ^/robots.txt
RewriteRule "^/robots\.txt$" "/docs/robots.txt" [PT,L]
The Alias directive with httpd allows documents to be stored in the local filesystem other than under the DocumentRoot(Usually be $$SoftwareAG/API_Portal/server/bin/agentLocalRepo/.unpacked/httpd-run-prod-*-runnable.zip/httpd/htdocs). Here we are creating a alias "/docs" which points to a directory containing robots.txt. Please be sure the directory contains anyother important files.
We are creating a rewrite rule for accessing robots.txt to serve it from the created alias when the request URI matches to /robots.txt.
The changes has to be applied in both httpd-custom.conf/httpd-custom-ssl.conf files inorder to be served from ssl/non-ssl contexts.
Step II
Open ACC console and reconfigure loadbalancer runnable as below
reconfigure <loadbalancerInstanceId> HTTPD.modjk.exclude.cop="apidocs","docs"
loadbalacnerInstanceId could be either loadbalancer_s/loadbalancer_m/loadbalancer_l depending on your install.
Step III
Now restart the loadbalancer runnable and you will see robots.txt when you access https://api.xyz.com/robots.txt