www.RickBull.com > tutorials > HTML > Getting Listed on Search Engines

Note: This section is still under construction, and may be lacking content (or some inaccuracies); It may be better to check back at a later date. Thank you.

Getting Listed on Search Engines

This page will give you some tips on how to get your pages listed on search engines.

meta Tags

meta tags are small chunks of code that go into the <head> section of your HTML pages. They are used as indicators to web browsers, search engines, etc. to give information about your page and its contents.

meta elements are usually specified something like this:

<meta name="xxx" content="xxx">

…although http-equiv is sometimes used in place of name.

There are several meta tags you can put in your pages to help search engines, which are listed below:

Description

The description meta tag keyword allows you to add a description for your page something like so:

<meta name="description" content="A little text to describe the page">

Keyword

Keywords or phrases can be used when to find keywords or phrases that the user enters into the search engine. They are seperated by commas:

<meta name="keywords" content="rick,bull,rick bull">

So that example would match the individual keywords rick and bull, or the phrase rick bull.

Author

Some search engines require that you add an e-mail address for the author of the page:

<meta name="author" content="rickbull@rickmusic.fsnet.co.uk">

Language

Specifying the document's language via predefined language codes can help search engines to bring up results that are in the user's desired language:

<meta name="language" content="en-gb">

Robots

Robots are (for example) search engines that visit your page, analyse it and then add it to their databases so that users can search them and possibly bring up your site as a result. There are two ways to specify which pages robots should or should not visit - the robots.txt file, or the robots <meta> tag.

robots.txt

The robots.txt file is a file that robots check for on your server to tell it which directories/files it should and should not read. It is more flexible and preferable to use than using the <meta> tag, but some servers do not allow you to modify or create this file (usually for security reasons).

The robots.txt file must be in lower-case, and there should be only one in the root of the website (e.g. http://www.rickbull.co.uk/ is my root address, so my robots.txt file's URI would be http://www.rickbull.co.uk/robots.txt).

There are two parts to the robots.txt file: User-Agent and Disallow:

User-Agent

This keyword allows you to specify which robots are to use the proceding rules. For example, if we wanted to have rules only apply to the WebCrawler robot, we would use this line of code:

User-agent: WebCrawler

You then break down a line and add any disallow rules that you want to apply to this robot. There must be a blank line between each group of user-agents and rules, as is shown in the example robots.txt file.

You may also use an asterisk (*) to denote that the following rules apply to all robots. Note: the robots.txt is not Regular Expression aware; * has no special meaning other than any-robot, i.e. you can not say, for example, web* - this will just mean a robot named web* and not web followed by any other characters as it would in regular expressions.

Disallow

Once you have identified your target robots you need add some directories or files that it should ignore. You do this with the Disallow directive:

Disallow: /my_secret_files/

In this case the robot would ignore the contents of /my_secret_files/. You can also apply this to individual files, or to allow access to all files on your server you can use this code:

User-agent: *
Disallow:

Or create an empty robots.txt file. Also note that robots take these values as partial URIs. For example this code:

Disallow: /rick

…would not only disallow the directory /rick, but also any other files found that start with rick in the root directory, e.g. /rick_bull.htm, /rick.php, /rickamous.xyz, etc..

One last thing to note is that you can add comments with the hash (#) character, and all text until the end of that line will be ignored by the robot. Below is an example robots.txt file:

User-agent: WebCrawler
Disallow: /temp/
Disallow: /some_personal_files/personal.txt
Disallow: /some_personal_files/personal2.htm

User-agent: SneakBot #Don't let SneakBot see anything
Disallow: /

User-agent: BadBoy
Disallow: /~ #Stop bad boy from seeing anything with ~ at the start
Disallow: /private.htm

User-agent: GoodRobot #Allow GoodRobot to go anywhere
Disallow:

User-agent: * #All robots
Disallow: /my_stuff/

Robots <meta> Tag

If you do not have persmission to edit the robots.txt file on your server you can use the robots <meta> tag instead. There are six directives/keywords you can use:

index
Specifies that the current page should be indexed
noindex
Specifies that the current page should not be indexed
follow
Specifies that the robot should follow links in this page, and index them too
nofollow
Specifies that the robot should not follow links in this page
all
Combines index and follow to mean that this page should be indexed, and all links should be followed
none
Combines noindex and nofollow to mean that this page should not be indexed, and no links should be followed

You can use these keywords like any other <meta> tag:

<meta name="robots" content="noindex">

You can also specify more than one directive by seperating them by commas:

<meta name="robots" content="index,nofollow">

Obviously though you should not specify conflicting directives, such as "index,noindex" or "nofollow,follow".

Headings

Using headings in the correct manner (i.e. to empahsis the structure of the document) can also help some search engines to get an idea of what your page is all about. For more information on using headings please visit the basics of HTML section.


All tutorials and content written by Rick Bull unless otherwise stated
Page's last update: Friday, 15th January 2010; 12:54:10 UTC +0000
Official web robots page: http://www.robotstxt.org/
Top of the page