ElectricType:GeekTalk

How to Get Dynamic Pages Searched
by Todd Elliott 9 Jan 1997

Todd Elliott is the production manager of HotBot. He makes a kick-ass sangria.

Page 1

Q: How can I get dynamically created pages into the search engines? Most of our stuff is basically static, but because it's coming out of a database, it uses a CGI program.
I believe search engines ignore any links after a question mark or with /cgi-bin or /bin. So how do I get them indexed? Perhaps we need a standard way of building a text index similar to robots.txt that has all the data in a static page and links to where the engine should point people?
Another way it could be fixed, I guess, is by using SSI to fool the engine into thinking the pages are static.
- Simon
A: So you've got a swanky Web site with oodles of content, but there's a problem. You've put it all together with CGI and can't get the search engines to register all the pages. You are not alone.
For those of you who don't know, search engines use programs known as spiders or crawlers to index content on Web sites. Crawlers usually avoid pages with URLs that contain /cgi-bin or bin (and other variants), as well as CGI escape characters like &, ?, =, etc. These pages often lead to massive databases with recursive links that can easily trap crawlers in a maze of data. Sometimes this poses a threat to the crawler, but more often than not, a trapped spider will simply bring a server to its knees in a couple of minutes. Although database creators usually don't intend to trap search-engine crawlers, Bad People sometimes create recursive crawlspaces for no other reason than pure mischief. Generally, it's just a bad idea for a spider to follow such links.
Unfortunately, there's not much you can do right now to get the crawlers to index your site correctly. It may be tempting to try to trick them into believing the pages are static HTML, but I don't recommend this, since you still run the risk of endangering your server if the crawler becomes trapped.
There has been some discussion about allowing administrators to advertise content they want crawled instead of simply limiting what shouldn't be crawled on their sites. Some even wanted to incorporate this idea into the newly proposed robots.txt file (a text file that lists which robots can index the site and which parts of the site are off-limits), but it doesn't look like this will come to fruition anytime soon.
Another common problem site developers have with search engines is caused by frames. Fortunately this one's solvable. Crawlers tend to pick up only parent frames or subframes, which prevents them from registering entire sites correctly. The solution here is fairly simple. You should already be using the noframes tag to make sure anyone without a frames-supporting browser can access your content. If you aren't, you may want to start, because this'll take care of crawlers as well.
Search engines are useful tools, but they still have limitations. As more features become available on new browsers, it'll be up to the designers of spiders to incorporate the technology into their engines. In the meantime, stick to the current robots.txt