Greg R. Notess

Reference Librarian

Montana State University

On The Net

Internet Search Techniques

and Strategies

ONLINE, July 1997

Copyright © Online Inc.

 

Tracking down relevant information quickly on the Internet remains an elusive quest. While it has been refreshing to see the last few years' development of significant and substantial databases of Internet resources, these databases and search engines still have a long way to maturation. The lack of comprehensive coverage, the inability to distinguish popular materials from authoritative scholarship, and most of all the absence of controlled vocabulary mean that current Web databases are only one tool in the professional searcher's toolkit.

 

The constantly changing nature of the Net and its overall lack of organization make it more difficult to search than commercial online services, although the absence of a per minute charge (for most of us) relieves one pressure. However, the Internet requires different search strategies than commercial online services. The pearl building approach of finding one relevant citation in a controlled vocabulary index and then finding more based on its subject headings does not work well in Internet databases that lack controlled vocabulary. Instead, successful searching of the Internet depends on techniques such as going straight to the information source, guessing URLs, and developing strategies for when to use subject directories and search engines.

 

STRAIGHT TO THE SOURCE

 

Understanding the production of information content and the methods of its dissemination are useful skills in identifying likely sources for specific kinds of information. This is true for print resources, commercial online systems, and for the Internet. Knowing which handbooks, online databases, organizations, or other resources provide specific types of information has been an essential skill for the information professional for many years. Yet on the Net, with full-text search capabilities, the general lack of organization, and the unusual locations for useful information content, it can be more difficult to get a sense of the information landscape. Even so, identifying a likely information producer can be an important first step in Internet research.

 

A medical searcher helping a patron find basic information on a specific type of cancer knows that the National Cancer Institute is an authoritative source. Looking at the National Cancer Institute's Web site is a natural place to begin, and its link to CancerNet should get the patron quickly to relevant and reliable information. A student writing a paper on both sides of the abortion controversy can be led to the National Right to Life and the National Abortion and Reproductive Rights Action League Web sites. For an information request for basic financial information on Dow Chemical, point the user to both Dow's Web site for their annual report and to the Security and Exchange Commission's site for 10K reports.

 

Understanding how the Internet fits into the current publication cycle is an important piece of this strategy. What kind of information do organizations typically make available on the Web? Product information, public relations material, collaborative scientific project reports, staff directories, mission statements, library catalogs, current news, government information, selected article reprints, and press releases are just some of what is commonly accessible on the Web. Trade secrets, strategic plans, commercial databases, and most copyrighted published material are not readily accessible. If the information needed is of the type likely to be made available on the Web, then it is just a matter of determining who would make it available.

 

A BIT OF GUESSWORK

 

Identifying a possible source organization for specific information is a starting point, but then the searcher needs to locate the organization's Internet presence, i.e., to find the URL. The subject directories and larger search engines can certainly be used for this task, but before spending time with those, try some simple URL guesswork. For many Web sites, the unofficial standard of the www.company.com address may easily take a searcher directly to the top-level Web page for the organization's site. A look at the top-level page is useful for understanding the organization of information on the site as well as providing an overview of what information is available.

 

In guessing the URL, remember that both Netscape Navigator and Microsoft's Internet Explorer automatically take a host address and add the common http:// at the beginning. So to save typing a few strokes, just leave that part of the URL off. Since www is the most common way to begin a host address, start with that for the URL guess. After the www, try the organization's name, acronym, or abbreviated name, and then add the appropriate top-level domain. While most commercial sites now have the .com domain, do not forget the other common U.S. endings: .edu for educational institutions, .gov for U.S. government, .mil for U.S. military, and .org for other organizations. These will be expanded in the near future, but for now they are the most common.

 

In many cases, this basic strategy for guessing a URL works quite well. Jumping to the Army's main Web site is as easy as entering www.army.mil. Guessing that www.sec.gov goes to the Securities and Exchange Commission is a straightforward connection. It should be easy to identify the URLs for the National Rifle Association, Chrysler Corporation, and the University of Indiana as www.nra.org, www. chrysler.org, and www.indiana.edu.

 

SUBJECT DIRECTORY STRATEGIES

 

Unfortunately, more than a basic guess is required for some organizations, especially those with multiple word names or ones that use common acronyms. The American Marketing Association has claimed www.ama.org, so the American Medical Association resides at www.ama-assn.org. Company names may combine various pieces. The Houghton Mifflin Company claims www.hmco.com while Federated Department Stores combines a part of their name and their acronym, for www.federated-fds.com. For cases such as these, and there are many, the next step is to try a subject directory.

 

Subject directories such as Yahoo! or LookSmart (http://www.looksmart.com) are primarily identified by their databases of selected resource classified by subject. Hierarchical in nature, the subject categories and subcategories present one of the best starting points for Internet exploration. However, it is their emphasis on including the main page for particular Web sites rather than all pages that makes them the next step in Net search strategies. For URLs that are difficult to guess, a quick search in a subject directory can often provide the main page for a specific organization. Yahoo! works especially well in this context, since their database includes a large number of top-level pages for businesses, organizations, government agencies, and educational institutions. For any well-known organization, a quick search in Yahoo! is likely to turn up their Web site.

 

Product Searches. Besides locating organizations, subject directories should also be the first step in a product information search. While the direct strategy of going to the company's Web site may work, searching a directory for product information is especially helpful when searching for a group of products or for products where the company's name is not known. Yahoo!'s extensive commercial section includes many links to product information, especially computer hardware and software pages.

 

Even beyond the obvious computer products area, under Business and Economy: Products and Services, Yahoo! has well over 7000 entries. While small manufactured products are not readily accessible by name within Yahoo!, the categorization can help identify manufacturers, distributors, and vendors of products ranging from beads to cars to real estate.

 

Broad Topics. Another subject directories search strategy is to use them for broad topics. The category hierarchies and the ability to search within the hierarchy to identify a relevant subject category make the directories easy to browse. The fact that they do not try to index the entire Internet, but rather include selected sites and usually just the main page for those sites makes the directories an excellent starting point for a search. Patrons that have general topics or who have not yet narrowed their topic can be put into a directory to explore at least some of what is available on the Net.

 

Current Events. Similar to general topics, current events searches benefit when you look in a directory. Yahoo! sets up categories for hot news topics of general interest under News and Media: Current Events. Links point to related sections of Yahoo! and recent news stories and sites dedicated to the topic. Besides checking a directory, daily news is available from many sources on the Net, so a current events search should not be limited to just a subject directory. Also try one of the major national newspapers on the Net or take a broader approach with Excite's NewsTracker. Its coverage of the latest issues of over 300 online magazines and newspapers offer a known pool of content providers and categorized or keyword search access to the most current articles.

 

SEARCH ENGINE STRATEGIES

 

In this series of search strategies, the large automatically generated Web databases known as search engines are saved for very specific searches. Because the likes of HotBot and Excite include over 50 million pages in their databases, each indexed in full text, searches on these tools usually result in a very large number of hits. While the search engines are the first place to look for many novices, the other strategies and the directories provide quicker access to a relevant source than the search engines. Since these Web databases attempt to find and index as many pages as possible, and their automated spiders index pages far below the top-level page, these are extremely large databases that have millions of hits on common words.

 

For that reason, these large Web indexes are most effective for searches with very unusual keywords, for combining keywords, for using advanced features such as field searching and limiting, and for finding pages buried inside a Web site. The search engines can be used for tracking down top-level pages for organizations when neither guessing nor the subject directories help, but they require a different approach. Search for the organization's name as a phrase, but expect to see many subsidiary pages from the organization's Web site or links to it from sites other than the top-level page. That top-level page may be in the retrieval set, but it is often ranked low. However, simply take the subsidiary pages' URL to determine the root URL for the main site.

 

Unique keywords. Unique keywords, especially ones that uniquely identify a topic, are much better suited to large search engine queries. While even unique keywords may bring up many hits, certain kinds of keywords can very effectively narrow the search. Drug names, unique product names, CAS registry numbers, geographic locations, and people's names can all be effectively searched in these large databases when the unique keywords specifically identify the topic needed.

 

For example, biology searches that use a specific taxonomic name can be quite effective if it is not very common. A search on drosophila finds thousands of hits due to the vast amount of information on the fly, so you may want to try that one in a subject directory. But a search on tetraodon brings up a much more manageable list. (Truncating it to include the family name in AltaVista results in about 125 hits). The scientific name search is also effective, narrowing the search results to more relevant records, because only scientific pages are likely to use the taxonomic rather than the common name. In addition, since the puffer fish (tetraodon) has a common name consisting of common words, the taxonomic name more uniquely identifies the topic.

 

Phrase searching. The puffer example also demonstrates another important strategy when using the search engines: always use a phrase search if it is available. A search on the single word puffer finds thousands or pages, most irrelevant to tetraodon. Searching the phrase "puffer fish" results in about a tenth as many, with a higher precision rate. These may be less scientifically oriented than the taxonomic name search, but it provides a way to broaden the search while still excluding most of the irrelevant sites. In AltaVista, Excite, and Infoseek, phrase searching is designated by surrounding the phrase with double quotes. In HotBot and OpenText, choose phrase searching from drop down menus. Combining a phrase with other terms can narrow a search even further.

 

Field searching. Field searching is also an effective strategy for targeting search results in the large Web databases. AltaVista, Infoseek, and OpenText all support field searching. The most useful of the fields available are title and URL. For example, if you or a patron remember seeing a recall notice about a product from Defective Company, structure a search that searches for pages with the keyword recall and the URL www.defective.com. In the AltaVista simple search and Infoseek, the search statement would read +recall +url:www.defective.com. An easy way to narrow a search is to require the most important word to appear in the title.

 

Limits. A similar strategy is to use the various limits available on the search engines. Both HotBot and AltaVista (in the advanced search) can limit search results by the date that specific pages were last updated. This is especially useful for current events searches. A query on "TWA flight 800" pulls up thousands of hits, even when entered as a phrase search. Limit to sites updated in the past few weeks for a much more current and a much smaller set. (But do not forget to try this search on Excite's NewsTracker as well.) The geographic limits available on HotBot (and possible using AltaVista's domain field search) can also be useful for narrowing search results.

 

MULTIPLE-STEP STRATEGY

 

For the impatient, when the first few links seem irrelevant it becomes easy to give up on finding the answer on the Net. Even with the best search strategies, successful information retrieval on the Web usually requires multiple attempts. With the hypertext nature of the Web, the first set of results may not provide the answer, but they may link to other sites that do. Even the strangest page may provide a link to a reliable and authoritative information source.

 

The relevancy ranking algorithms employed by all the major Internet databases are intended to help bring the most relevant results to the top of a search set. With this kind of an approach (if it works), it would not matter if a search results in thousands of hits, since the most relevant show up first. Unfortunately, with the diversity and variable quality of Internet resources, the relevancy ranking algorithms often fail to lift the most relevant hits to the top. Try a multiple-step strategy: look for the best of the top ten or so hits, then see if those pages link to more relevant sites. Sometimes it takes going four or five levels deep to find a relevant site; continue digging deeper only if the pages are getting more relevant.

 

One variation on this strategy, and a very important one, is the search for subject-specific Internet resource guides. A current, well-maintained, targeted subject guide can be one of the best approaches to finding a variety of sources in that subject area. While they are not available for all subject areas, and many of those that are available are out-of-date, it is still well worth it to look for them. Many of the best secondary (or tertiary, or...) links that can be found in the multiple-step strategy are these Internet resource guides. Two other places to look for these guides are the Argus Clearinghouse (www.clearinghouse.net) and Yahoo!, where they are listed under each subject area as an Index or in an Indices subcategory. URL SURGERY: SLICING, DICING, AND CHOPPING In all of the above strategies, a commonly encountered error message is "file not found." With the frequency of site reorganization and the movement of entire sites to new hosts, many links available on the Internet point to dead ends. In a well-managed site, the link will still exist and provide a pointer to the new site. For those cases where that does not apply, look closely at the URL and try the "surgical strategy."

 

The basics of this strategy are to try chopping off parts of the URL, starting on the right-hand side and stopping at every slash. A page formerly at http://www.stateu. edu/~jsmith/courses/smartstuff.html may have been renamed or moved to another directory, or J. Smith may have graduated and moved all HTML files to a commercial server. First try http://www.stateu.edu/~jsmith/courses/ to see if any files are still available in that directory. If that gets the error message, try http://www.stateu.edu/~jsmith/ to see if there is a main page for that individual or at least a pointer to where the files have moved.

 

At a commercial URL such as http:// www.towntimes.com/today/headline.html, you may need to try reconstructive surgery. From the URL, it looks like it pointed to a daily story that might now be archived at a different URL. From the top-level page http://www.towntimes.com/ look for a site search option or an archive section. In other cases, dropping out an extra directory name may find the missing file. The Web changes, but few pages are deleted entirely. Try some creative surgery on the URL to see if you can find a fugitive page.

 

These are just a few of the many possible Internet search strategies that can make Internet searching more efficient and accurate. They all fail on some queries, but at least they provide a theoretical starting point for conceptualizing the search process. New search tools or improvements to the current crop will bring new strategies. Meanwhile, try some of these strategies. Compare them with your own. Develop new approaches. Otherwise, strategies may need to change as quickly as the Internet itself.

 

Communications to the author should be addressed to Greg R. Notess, Montana State University Libraries, Bozeman, MT 59717-0332; 406/994-6563; align@montana.edu or http://imt.net/~notess/.

Copyright © 1997, Online Inc. All rights reserved.