Network Programming: Web databases

What is a Web Database?

A web database is an organized listing of web pages. It's like the card catalog that you might find in the library. The database holds a "surrogate" (or selected pieces like the title, the headings, etc.) for each web page. The creation of these surrogates is called "indexing", and each web database does it in a different way. Web databases hold surrogates for anywhere from 1 to 30 million web pages. The program also has a search interface, which is the box you type words into (like in Alta Vista or Lycos) or the lists of directories you pick from (like in Yahoo). Thus, each web database has a different indexing method and a different search interface.

Data Organization

Web databases enable collected data to be organized and cataloged thoroughly within hundreds of parameters. The Web database does not require advanced computer skills, and many database software programs provide an easy "click-and-create" style with no complicated coding. Fill in the fields and save each record. Organize the data however you choose, such as chronologically, alphabetically or by a specific set of parameters.

Web Database Software

Web database software programs are found within desktop publishing programs, such as Microsoft Office Access and OpenOffice Base. Other programs include the Webex WebOffice database and FormLogix Web database. The most advanced software applications can set up data collection forms, polls, feedback forms and present data analysis in real time.

Applicable Uses

Businesses both large and small can use Web databases to create website polls, feedback forms, client or customer and inventory lists. Personal Web database use can range from storing personal email accounts to a home inventory to personal website analytics. The Web database is entirely customizable to an individual's or business's needs.

Methods of Indexing

There are three methods of indexing used in web database creation - full-text, keyword, and human.

Full-Text Indexing

As its name implies, full-text indexing is where every word on the page is put into a database for searching. Alta Vista and Open Text are examples of full-text databases. Full-text indexing will help you find every examples of a reference to a specific name or terminology. However, a general topic search will not be very useful in these database, and you will have to dig through a lot of "false drops" (or returned pages that have nothing to do with your search).

Keyword Indexing

In keyword indexing, only the "important" words and phrases are put into the database. Lycos and Excite are keyword indexed. This allows a searcher to search on more general subjects and have more accurate results. However, if a name is only mentioned once or twice on a page, it won't be included in the database.

Human Indexing

Yahoo and some of Magellan are two of the few examples of human indexing. In the above two indexing, all of the work was done by a computer program called a "spider" or a "robot". In human indexing, a person examines the page and determines a very few key phrases that describe it. This allows for the user to find a good start of works on a topic - assuming that the topic was picked by the human as something that describes the page. This is how the directory-based web databases are developed.

Spiders, Robots, or People

How do the web databases select which pages are indexed? As there is no centralized Internet computer, there's no one place where these services can learn about new pages. Thus, many services use automated programs called "spiders" or "robots" that travel from site to site, looking for new WWW pages. Some spiders only go to the "What's New" or the "What's Hot" pages and use those for indexing the "popular" sites. Others methodically examine every link leading from a page, and every link leading from that page, and so on... In some cases, people examine the pages brought back from these programs, and don't index the pages that don't meet certain criteria. So, these tools create three classes of web databases - those that look at all WWW pages, those that examine popular WWW pages, and those that examine quality web pages.