Home >Web Front-end >HTML Tutorial >robot.txt_html/css_WEB-ITnose

robot.txt_html/css_WEB-ITnose

WBOY
WBOYOriginal
2016-06-24 11:53:351390browse

In China, website managers do not seem to pay much attention to robots.txt, but some functions cannot be achieved without it, so today Shijiazhuang SEO would like to briefly talk about robots through this article. txt writing. ? part, or specify that the search engine only includes the specified content.

When a search robot (some called a search spider) visits a site,

Basic introduction to robots.txt

robots.txt is a plain text file. In this file, website administrators can declare the parts of the website that they do not want to be accessed by robots, or specify that search engines only include specified content.

When a search robot (some called a search spider) visits a site, it will first check whether robots.txt exists in the root directory of the site. If it exists, the search robot will The scope of access is determined based on the content of the file; if the file does not exist, the search robot crawls along the link.

In addition, robots.txt must be placed in the root directory of a site, and the file name must be all lowercase.

Robots.txt writing syntax

First, let’s take a look at a robots.txt example: http://www.shijiazhuangseo.com. cn/robots.txt

Visit the above specific address, we can see the specific content of robots.txt as follows:

# Robots.txt file from http://www.shijiazhuangseo.com.cn

# All robots will spider the domain

User-agent: *

Disallow :

The above text means that all search robots are allowed to access all files under the www.shijiazhuangseo.com..cn site.

Specific syntax analysis: The text after # is explanatory information; User-agent: is followed by the name of the search robot. If it is followed by *, it generally refers to all search robots; Disallow: The following are the file directories that are not allowed to be accessed.

Below, I will list some specific uses of robots.txt:

Allow all robots to access

User-agent: *

Disallow:

Or you can create an empty file "/robots.txt" file

Disable all search engines from accessing any part of the site

User-agent: *

Disallow: /

Disable all search engines from accessing several parts of the website (01, 02, 03 directories in the example below)

User-agent: *

Disallow: / 01/

Disallow: /02/

Disallow: /03/

Disallow access to a search engine (BadBot in the example below)

User-agent: BadBot

Disallow: /

Only allow access from a certain search engine (in the example below Crawler)

User-agent: Crawler

Disallow:

User-agent: *

Disallow: /

In addition, I think it is necessary to expand the explanation and give some introduction to robots meta:

The Robots META tag is mainly It is for each specific page. Like other META tags (such as the language used, page description, keywords, etc.), the Robots META tag is also placed in the

of the page, specifically used to tell the search engine ROBOTS how to crawl the page. content.

How to write the Robots META tag:

There is no case distinction in the Robots META tag. name="Robots" means all search engines. You can write name="BaiduSpider" for a specific search engine. There are four command options in the content part: index, noindex, follow, nofollow. The commands are separated by ",".

INDEX instruction tells the search robot to crawl the page;

FOLLOW instruction means the search robot can continue crawling along the links on the page Go down;

The default values ​​of Robots Meta tags are INDEX and FOLLOW, except for inktomi, for which the default values ​​are INDEX, NOFOLLOW.

In this way, there are four combinations:

<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">

<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">

<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">

<META NAME=" ROBOTS" CONTENT="NOINDEX,NOFOLLOW">

Where

<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW"> It can be written as <META NAME="ROBOTS" CONTENT="ALL">; ROBOTS" CONTENT="NONE">

At present, it seems that the vast majority of search engine robots comply with the rules of robots.txt, but for the Robots META tag, it is not currently supported. There are many, but they are gradually increasing. For example, the famous search engine GOOGLE fully supports it, and GOOGLE has also added a command "archive" that can limit whether GOOGLE retains web page snapshots. For example:

<META NAME="googlebot" CONTENT="index,follow,noarchive">

means crawling the pages in this site And crawl along the links in the page, but do not keep a web snapshot of the page on GOOLGE.

The above is Shijiazhuang SEO’s syntax for writing robots.txt

First, let’s look at an example of robots.txt: http://www.shijiazhuangseo.com.cn /robots.txt

Visit the above specific address, we can see the specific content of robots.txt as follows:

# Robots.txt file from http://www.shijiazhuangseo.com.cn# All robots will spider the domain

User-agent: *

Disallow:

The above text means that all search robots are allowed to access all files under the www.shijiazhuangseo.com.cn site.

Specific syntax analysis: The text after # is explanatory information; User-agent: is followed by the name of the search robot. If it is followed by *, it generally refers to all search robots; Disallow: The following are the file directories that are not allowed to be accessed.

Below, I will list some specific uses of robots.txt:

Allow all robots to access

User-agent: *

Disallow:

Or you can create an empty file "/robots.txt" file

Disable all search engines from accessing any part of the site

User-agent: *

Disallow: /

Disable all search engines from accessing several parts of the website (01, 02, 03 directories in the example below)

User-agent: *

Disallow: / 01/

Disallow: /02/

Disallow: /03/

Disallow access to a search engine (BadBot in the example below)

User-agent: BadBot

Disallow: /

Only allow access from a certain search engine (Crawler in the example below)

User-agent: Crawler

Disallow:

User-agent: *

Disallow: /

Also, I think It is necessary to expand the explanation and give some introduction to robots meta:

The Robots META tag is mainly for specific pages. Like other META tags (such as the language used, page description, keywords, etc.), the Robots META tag is also placed in the

of the page, specifically used to tell the search engine ROBOTS how to crawl the page. content.

How to write the Robots META tag:

There is no case distinction in the Robots META tag. name="Robots" means all search engines , which can be written as name="BaiduSpider" for a specific search engine. There are four command options in the content part: index, noindex, follow, nofollow. The commands are separated by ",".

INDEX instruction tells the search robot to crawl the page;

FOLLOW instruction means the search robot can continue crawling along the links on the page Go down;

The default values ​​of Robots Meta tags are INDEX and FOLLOW, except for inktomi, for which the default values ​​are INDEX, NOFOLLOW.

In this way, there are four combinations:

<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">

<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">

<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">

<META NAME=" ROBOTS" CONTENT="NOINDEX,NOFOLLOW">

Where

<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW"> It can be written as <META NAME="ROBOTS" CONTENT="ALL">; ROBOTS" CONTENT="NONE">

At present, it seems that the vast majority of search engine robots comply with the rules of robots.txt, but for the Robots META tag, it is not currently supported. There are many, but they are gradually increasing. For example, the famous search engine GOOGLE fully supports it, and GOOGLE has also added a command "archive" that can limit whether GOOGLE retains web page snapshots. For example:

<META NAME="googlebot" CONTENT="index,follow,noarchive">

means crawling the pages in this site And crawl along the links in the page, but do not keep a web snapshot of the page on GOOLGE.

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn