Reading Summary and Analysis Discussion Post: Exploring Information Systems

Total pages: two or more

There are three readings that need to be read, please read through them and write summary and analysis according to the questions below for each reading. Please write each question separately by using 1) 2) 3), and also separately each reading with its title. Thank you so much!

A brief summary of the key argument, problem, or issue
Suggesting the significance of the piece (how it contributes to our understanding of this topic within our class’s broad study of human information interaction)
Posing one or more questions that you would like to probe about this reading or any other combination of strategies to get the group discussion going

Introduction

Inherent in the concept of interactive information retrieval is the notion that we
interact with some search user interface (SUI) beyond the submission of an
initial query. Perhaps the most familiar SUI to many is the streamlined
experience provided by Google, but many more exist in online retail, digital
archives, within-website (vertical) search, legal records and elsewhere. Amazon,
for example, provides a multitude of different features that together make a
flexible, interactive and highly suitable gateway between users and products.

The aim of this chapter is to provide a framework for thinking about the
elements that make up different SUI designs, taking into account when and
where they are typically used.

Search: the way we usually see it
The SUI that many people now see daily is Google, and Figure 8.1 overleaf
shows the 14 notable SUI features it provides for users on its search engine
results page (SERP). The most common feature searchers expect to see is the
query box (#1 in Figure 8.1), which in Google provides a maintained context so
that the query can easily be edited or changed without going to the previous
page. Searchers are free to enter whatever they like, including special operators
that imply specific phrases or make sure certain words are not included. The
second most obvious feature is the display of results (#2), which is usually1
ordered by how relevant they are to the search terms. Results typically highlight
how they relate to the search terms by showing parts in bold font. Users are
typically able to view additional results using the pagination control (#3).

We also see many control and modifier SUI features. Google provides fixed
options across the top (#4) and relevant options down the left (#5) for

139

8
●●●

Interfaces for information retrieval

Max Wilson

Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

140

specializing the search towards certain types of results. Further, Google allows
users to restrict their results (#6), or change how they are shown (#7). It is typical
for search engines to provide an advanced search to help define searches more
specifically (#8). Finally, most search engines provide recommendations for
related queries (#9).

Google also provides extra information, such as an indicator on the number
of results found (#10), and information about when you may have made an
error (#11). Finally, Google also provides personalizable features that are
accessible when signed in (#14), such as settings (#13) and information about
your prior searches (#12).

A starting framework for thinking about SUI designs
Broadly, we can break the elements of a SUI, like those discussed in the Google
example above, into four main groups:

• input features – which allow the user to express what they are looking for
• control features – which help users to modify or restrict their input
• informational features – which provide results or information about results
• personalizable features – which relate specifically to searchers and their

previous interactions.

Figure 8.1 Fourteen notable features in the Google search user interface

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

These groups are highlighted in zones in Figure 8.2 (input as 1 and 8, control as
4, 5, 6, 7 and 9, informational as 2, 3, 10 and 11, personalizable as 12, 13 and 14),
and will be revisited throughout the chapter as other search interfaces provide
different features in these groups. Often new SUIs or SUI features innovate in
one of these groups. Finally, it is important to note that these groups can
overlap. Informational features are often modified by personalizable features,
for example, and some features can act as input, control and informational
features.

Early search user interfaces
A brief early history

The roots of information retrieval systems are in library and information science.
In libraries, books are indexed by a subject-oriented classification scheme and
to find books we interact with the physical spaces, signposting, and librarians
within them. Yet the study of information retrieval was motivated by the
development of computers in the 1960s, which could automatically perform one
of the tasks that librarians do: retrieve a document (or book). The interface with
computers, however, was with punch cards at first, and then command lines
sometime after. Immediately, we can see the model kind of support we wanted
to provide to users (a librarian) but were so far limited by technology.

141

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

Figure 8.2 The Google SUI zoned by the different types of feature categories

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

Conversation and dialogue

Given the user interface limitations, and the influence of librarianship, some of
the initial SUIs were modelled around conversations or ‘dialogues’. In
analysing, for example, the roles, questions and answers that took place in
conversations between visitors and librarians (Winograd and Flores, 1986), early
researchers developed question and answer style SUIs. Figure 8.3 shows an
early command-line dialogue-style system introduced in the 1970s (Slonim,
Maryanski and Fisher, 1978), which tried to help users describe what they were
searching for. These SUIs typically asked the searchers for any information they
already had about what they wanted, so that when it came to performing the
search (which could last a number of minutes or hours even) it was more likely
to return the correct result.

This conversational style was analysed for some time, and was also influenced
by those interested in artificial intelligence and natural language processing. As
technology improved and results were returned faster, the emphasis of the
conversational perspective moved towards modelling a continued dialogue
over multiple searches within interactive information retrieval. The MERIT
system (Belkin et al., 1995), for example, was designed based on a much more
flexible, continuing, conversation model.

Browsing

Another early type of system, still using command-line interaction, supported
‘browsing’. Similar to the initial dialogue-based systems, browsing systems
like the 1979 BROWSE-NET (Palay and Fox, 1980) in Figure 8.4 (on page 144)
presented different modes to scan through databases and provided options for
different ways of accessing the documents. Again, we see these browsing style
systems appear over the course of interactive information retrieval design,
although in 1983 research identified that people ‘browsed’ less on the early
online newsgroups. Geller and Lesk (1983) hypothesized that this may have
been because people often knew more about what was in a fixed dataset than
in the oft-changing web collection we have now. Despite this hypothesis, we
later saw the rise of website directories, like the Yahoo! Directory. Directories,
while still available, were never as successful as web search engines, perhaps
providing evidence for Geller and Lesk’s hypothesis. More recently, we see
browsing interfaces appear within individual websites, as discussed further in
the discussion of faceted browsing in the section ‘Faceted metadata’.

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

142

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

Form filling

As SUIs became more directly interactive, with the onset of commercially
available graphical user interfaces2 in the early 1980s, the common paradigm we
see today of ‘form filling’ became more popular. This advanced the conversational
response SUIs, which took input over time from a series of questions, by providing

143

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

Figure 8.3 An early command-line dialogue-style system (Slonim, Maryanski and Fisher,
1978).
Copyright © 1978 ACM, Inc. doi>10.1145/800096.803134. Reprinted by
permission.

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

144

all the data entry fields spatially.
Although ‘form filling’ includes normal
keyword searching, this technique
allowed systems to present all the fields
that could be individually searched in a
way that we now commonly call an
advanced search. The EUROMATH
system, shown in Figure 8.5 designed by
McAlpine and Ingwersen (1989), has a
custom form highlighting all the fields
that can be searched individually or in
combination.

Boolean searching

One advance in the algorithmic technol –
ogies was to process Boolean queries, so
that we could ask for information about
‘Kings OR Queens’, and get a more
comprehensive set about, in this case,
Monarchs. This technological advance
was made before the majority of SUI
develop ments, as can be seen in Figure
8.5. The advent of GUIs, however,
provided an opportun ity to help people
construct Boolean queries more easily
and visually. The STARS system (Anick
et al., 1990), shown in Figure 8.6,
allowed users to organize their query in

a 2D space, where horizontal space represented ‘AND’ joins, and anything aligned
vertically were ‘OR’ joins. Like all these early ideas, Boolean searching is still
prevalent in our modern interactive information retrieval SUIs, including Google
(see Figure 8.1); the ‘-’ before the word is equivalent to a Boolean NOT, in this case.

Summary

The initial advances in information retrieval were typically made in
technological improvements. Consequently, these SUI advances in the early
days related mainly to the input SUI features, with the exception of some
advances (like the browsing and form filling) which provided information about
the structure of the data, making them also contribute to the informational SUI

Figure 8.4 An early browsing interface
for databases (Palay and Fox, 1981).
Copyright © 1980 ACM, Inc. Reprinted by
permission.

Figure 8.5 The EUROMATH interface
(McAlpine and Ingwersen, 1989).
Copyright © 1989 ACM, Inc.
doi>10.1145/75334.75341. Reprinted by
permission.

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

features. Other informational advances included simple highlighting in a result
where it matched the query, as shown in Figure 8.7 where the horizontal bar at
the bottom indicated where in a book any search terms appear. The onset of
GUIs meant that SUIs became more interactive, with Pejtersen’s fiction browser
(Pejtersen, 1989) presenting an explorable-world view of a bookshop, as shown
in Figure 8.8 overleaf. Pejtersen’s fiction bookshop allowed users to browse the
bookshop using different strategies, where the figures shown are engaging in each
strategy. We were not yet, however, engaging in what we now call interactive

145

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

Figure 8.6 The STARS system (Anick et al.,1990).
Copyright © 1990 ACM, Inc. doi>10.1145/96749.98015.
Reprinted by permission.

Figure 8.7 Use of highlighting for terms that match a query (Teskey, 1988).
Copyright © 1988 ACM, Inc. doi>10.1145/62437.62481.
Reprinted by permission.

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

146

information retrieval, where we consider interactive information retrieval to be
the ongoing interaction over multiple searches to reach a goal, rather than the
single search that is still often considered in information retrieval.

The onset of modern interactive information retrieval SUIs

The onset of modern interactive information retrieval SUIs began around the
time that we first saw web search engines like AltaVista,3 but before Google
was launched. One of the first studies to demonstrate that there were
significant and specific benefits to interactive information retrieval, where users
actively engage in refining and submitting subsequent queries, was provided
by Koenemann and Belkin (1996). Using a query engine that was popular at
the time called INQUERY, Koenemann and Belkin built the RU-INQUERY SUI,
shown in Figure 8.9 (b). Searchers could submit a query in the search box at
the top left, and see a scrollable list of results on the right hand side. The current
query was then displayed in the box underneath the search box. The full text
of any selected result was displayed beneath the results on the right. The RU-
INQUERY interface had hidden, visible, and interactive relevance feedback
terms; the interactive terms provided the most effective support for users.

Figure 8.8 Pejtersen’s fiction bookshop (Pejtersen, 1989).
Copyright © 1989 ACM, Inc. doi>10.1145/75334.75340.
Reprinted by permission.

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

The experiment was built to leverage relevance feedback (discussed in Chapter
6, ‘Access Models’), which used key terms from the results marked as ‘relevant’,
using the check boxes, and added them to the search to get more precise results.
To demonstrate the benefits of interaction in information retrieval, three altern –
ative versions were developed:

• opaque – provided the typical relevance feedback experience that was
common at the time, where terms from the selected relevant documents
were added, but there was nothing in the SUI to display what those
additional terms were (Figure 8.9(b))

• transparent – provided a similar experience to the opaque version, except
that the added terms were made visible in the ‘current query’ box

• penetrable – allowed the users to choose additional terms from the relevant
documents; the keywords associated with the relevant documents were
listed in a separate box below the ‘current query’ box (Figure 8.9 (a)), and
could be added to the current query box manually.

While all three experimental versions provided improved support within a task-
based user study, the most interactive penetrable version provided statistically
significant improvements and did not significantly increase the time involved
in searching. When analysed according to the framework described in the
section ‘A starting framework for thinking about SUI designs’ above, this study
showed the initial value of having control SUI features that help people modify
and manipulate a search.

147

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

Figure 8.9 The RU-INQUERY interface (Koenemann and Belkin, 1996).
Copyright © 1996 ACM, Inc. doi>10.1145/238386.238487.
Reprinted by permission.

(a) Penetrable condition

(b) Opaque condition

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

148

Modern search user interfaces and features

This section covers many of the more modern advances in SUI designs, and is
structured according to the framework described in the section ‘A starting
framework for thinking about SUI designs’. It begins by discussing input
features, before moving on to control, informational and personalizable features.

Input features

While there have been many technical advances in the processing of user
queries and matching them against documents, the plain white search box has
remained pleasingly simple. This section begins by examining the design of the
search box, before moving on to other input methods.

The search box

The search box pervades SUIs and searchers can feel at a loss when they do not
have a small white text field to spill their search terms into. The search box has
many advantages:

• Flexibility – It is extremely flexible (assuming the technology behind it is
well made), uses the searcher’s language and the searcher can be as
generic or specific as they like.

• An informational feature – As well as being primarily used as an input
feature, the search box can – and should – be used as an informational
feature. When not being used to enter keywords, the search box should be
informing the user of what is currently being searched for.

• The auto-complete function – This can help people avoid entering unproductive
search queries. By providing information to the user as they query, auto-
complete helps make the search box a better informational feature as well as
an input feature. Auto-complete can be rich with context, with the Apple
website providing images, short descriptions and even prices, as can be seen
in Figure 8.10 (a). Furthermore, auto-complete can be personizable, as with
Google in Figure 8.10(b), which shows queries the searcher has used before.

• Operators and advanced search – The keyword search box itself has only
really had minor visual changes, with some suggesting this may affect the
number of words people put in their query. Regardless, studies indicate
that searchers submit between two- and three-word queries (Jansen,
Spink and Saracevic, 2000; Kamvar et al., 2009), and around 10% of
searchers use special operators to block certain words or match explicit
phrases. Advanced search boxes, when implemented well, can help guide
people towards providing more explicit queries in the search box.

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

149

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

(a) Apple – shows lots of contextual data (b) Google – prioritizing previous searches

Query by example

There is a range of searching systems that take example results as the input.
One example commonly seen in SERPs is a ‘More Like This’ button, which
returns pages that are related to a specific page. While these could be seen as
control examples, an example demonstrator called Retrievr 4 lets searchers
sketch a picture and returns similar pictures. Similarly, services like Shazam5
use recorded audio as a query to find music. Shazam and Retrievr are examples
that are explicitly query by example input features, while others can be seen as
input and/or control.

Adding metadata

While there have been some variations in how we enter information into a
search box, the alternative is typically to present useful and usable metadata to
the users as an input feature. The presentation and use of metadata in SUIs,
however, can be very hard to delineate in its contribution between input,
control, informational and personalizable features. Indeed, well designed use
of metadata can serve as a feature in each of these feature types. Presented on
the front page of a SUI, categories can, for example, allow the searcher to input
their query by browsing. If a searcher can filter their keyword search, or make
sub-category choices, then metadata can quickly become a control feature.
Further, if results are accompanied by how they are categorized, then metadata

Figure 8.10 Examples of auto-complete

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

can become an informational feature too; research has shown this to be popular
with searchers (Drori and Alon, 2003). Finally, it’s not beyond the realm of
possibility to highlight popular or previously used category options to make
them personalizable too.

Categories

Websites, including the Yahoo! Directory, often present high-level categories to
help users externalize what they are looking for. Several studies (Egan et al., 1989;
Dumais, Cutrell and Chen, 2001) have shown that categorizing results in SUIs
can help users to find results more quickly and more accurately. One key early
system called SuperBook, which automatically created a categorized index over
full-text documents, was shown to help people learn, as measured by quality of
short open-book essays (Egan et al., 1989). More recently, eBay and Amazon
provide searchers with higher level categories so that they can first define what
type of object they are looking for before browsing with richer metadata.

Clusters

One challenge for categories, especially for the whole web, is to categorize all
the data. Another approach, using clustering algorithms in the backend, is to
cluster results by key topics in their content. One early cluster system, called
Scatter/Gather, divided results into clusters of similar topics to highlight the
range of topics covered in a SERP. Evaluation of the Scatter/Gather approach
showed that searchers were easily and quickly able to identify groups of more
relevant documents compared with a standard SERP (Hearst and Pedersen,
1996).

A more recent system, Clusty 6 (Figure 8.11), embodies a clustering method
that creates automatic hierarchical clusters based on the results that are
returned, but is primarily used as a control feature. Despite some studies
showing evidence that clusters help searchers to search (e.g. Turetken and
Sharda, 2005), research has suggested that well designed carefully planned
metadata is better for SUIs than automatically generated annotations (Hearst,
2006a).

Faceted metadata

It has been popular to categorize results in multiple different ways, so that
searchers can express several constraints. Research has shown that, compared
with keyword search, faceted systems can improve search experiences in more
open-ended or subjective tasks (where no single right answer is available)

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

150

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

(Stoica and Hearst, 2004). The popular Epicurious website, for example, allows
users to describe recipes that they would like by several types of categories
(called facets), including cuisine, course, ingredient and preparation method.
While the first selection in a facet acts as an input, subsequent selections in facets
act as refinements, and can thus be considered as control.

The Flamenco interface 7 (Yee et al., 2003), shown in Figure 8.12, provides

151

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

Figure 8.11 The Clusty system

Figure 8.12 The Flamenco interface

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

152

several different categories (called facets), which can be used in combination to
define a query. It was used to demonstrate the value of faceted browsing and
represents the standard faceted SUI design. Many variations have been
designed since. Typically, a range of hierarchical or linear facets are provided,
and users can make selections in one or more of them. In Flamenco, used facets
are removed from view, so that remaining facets can receive more attention,
and selections are placed in a breadcrumb list of choices. Removing used facets
provides an effective approach for quickly narrowing results. Other systems
like mSpace8 (schraefel et al., 2006) leave facets in place to encourage exploration
by quickly changing and comparing decisions. mSpace provides an advanced
faceted SUI where the order of facets implies importance and gaps from left to
right are highlighted. Figure 8.13 shows that the two clips in the far right column
are from 1975 and 1974, which would not normally be conveyed in faceted SUIs.
mSpace (and iTunes) facets are only filtered in a left to right direction, and
highlights have been shown to help searchers learn and discover related items
in the remaining unused facets (Wilson, André and schraefel, 2010). Other
systems, including mSpace and eBay, permit multiple selections within single
facets, so, for example, searchers can see results that relate to two price brackets.

Faceted categories are typically used within fixed collections of results, such as
within one website (typically called vertical search), as there must be common
attributes across all the data to categorize them effectively. Although researchers
have tried to apply facets to general web search (Kules, Kustanowitz and
Shneiderman, 2006), Google does not typically provide faceted search, except in
Google Shopping.9 In the narrower space of searching for products, there are
common factors like price and shop that can be easily applied to all of the results.

Figure 8.13 The mSpace interface

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

More detailed literature can be read about the design of faceted metadata
interactive information retrieval systems (Hearst, 2006b; Tunkelang, 2009).

Social metadata

The rise of social media websites has led to the popular use of socially generated
metadata, such as tags. Tag clouds are now familiar in many SUIs, with services
like Flickr allowing you to explore by popular tags.10 Further, the research
prototype called Mr Taggy11 (Figure 8.14) allows users to search the web using
different types of tags, separating out adjectives and nouns, collected from
Delicious. Studies of MrTaggy suggest that searchers can explore and learn
more when tag clouds are available (Kammerer et al., 2009). Research has also
shown that building tags into the display of SERP can help users to (control or)
change their search at later stages (Gwizdka, 2010). Other social metadata can
be used, but is often informational, such as what other searchers have found or
how results have been rated.

Control features

Control features can be considered particularly important from an interactive
information retrieval perspective, as they facilitate and control the ways in
which the search continues and thus make it interactive. Very recent research
has emphasized the importance of good control features, showing that the right
support can significantly enhance an interactive information retrieval task, but
the wrong support can distract or slow down searchers (Diriye, Blandford and
Tombros, 2009).

153

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

Figure 8.14 The MrTaggy prototype

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

Interactive query changes

One of the key ways that we continue
to interact with a query is to alter and
refine a search, known as interactive
query expansion (IQE). IQE is discussed
more extensively in Chapter 9,
‘Interactive Techniques’, but the aim is
to suggest additional, or replace ment,
words to the searcher that might help
the system return more precise results.
If, for example, a user searches for
‘Queens of the stone age’, Google (in
Figure 8.15(b)) returns a series of
extensions to that query that might
further define what the searcher is
looking for. IQEs may vary in location
and type. Continuing their philosophy
on facets, Google presents IQEs at the
bottom of the page, assuming that the
user will only wish to refine their query
if they don’t find what they want in the

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

154

Figure 8.15 Interactive query suggestions (refinements and alternatives)

(a) Bing provides IQEs for every search on
the left

(b) Google provides IQEs for some searches at the bottom of the page

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

first ten results. Amazon provides fewer IQEs, positioned above the search results
(Figure 8.15(c)). Bing12 defines itself as a ‘discovery engine’ and provides a
collection of refinements (to narrow) and alternative searches (to change focus)
permanently on the left side of the screen, as shown in Figure 8.15 (a).

Corrections

One good rule for interactive information retrieval is to never let the user reach
dead-ends where the interaction has to stop, and so another common form of
control feature helps correct and notify users when they are heading down a bad
path. In making two errors and searching for ’Quens of the Stonage’, for example,
services often suggest corrections or auto-correct their search. Amazon (shown
in Figure 8.16 (b)) states clearly that no results were found for this query, provides
the correction, and provides example results for that spelling correction.

Bing (shown in Figure 8.16(a)) and Google often auto-correct errors, but provide
a link to results on the exact query, if they are confident that the corrected version
is more likely. Amazon (Figure 8.16(b)) says it cannot find results for an incorrect
query, but provides a link and example results for a corrected version.

Sorting

One method of helping searchers stay in control of what they are looking for is to
allow them to decide how they want results to be ordered. Web search engines
typically order results by how relevant they are to a query. It is common, in online
retail, for example, to be able to change results to be ordered by price, and either
by most expensive to least expensive, or vice versa. Figure 8.17 (a)–(c) shows a

155

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

Figure 8.16 Correcting errors in user queries to avoid dead ends

(a) Correcting in Bing

(b) Correcting in Amazon

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

156

Figure 8.17 Features for sorting search results

(a) Sorting options in Amazon (b) Sorting options in Walmart

(d) Tabular sorting in Scan.co.uk

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

range of ways in which products can be reordered. Scan.co.uk (Figure 8.17 (d))
takes an approach that is more commonly seen in tabular views, where any
column in the result set can be used to order results. Choosing a column orders
the results by the metadata in that column. Rechoosing a column typically reverse-
orders them.

Filters
Although not too dissimilar from using metadata-based input methods as a
control feature, as discussed above, SUIs often provide ways of filtering results.
Aside from entering special sub-domains (images, shopping and so on), Google
allows searchers to filter results to: recent results, results with pictures in them,
previously visited results, new results, and many more. These types of filters
are single web links that can be selected one at a time. It is becoming
increasingly popular to have dynamic filters like sliders and checkboxes (for
multiple selections). Originally studied in the early 1990s (Ahlberg and
Shneiderman, 1994), as shown in Figure 8.18, dynamic query filters that
immediately affect the display of results have been shown to help searchers
express their needs more quickly and effectively. We now see examples of
responsive dynamic filters, and other interactive features, on many modern
systems, including Globrix13 and Volkswagen.14

157

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

Figure 8.18 Dynamic query filters (Ahlberg and Shneiderman, 1994).
Copyright © 1994 ACM, Inc. doi>10.1145/191666.191775. Reprinted by
permission.

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

Grouping

Another approach to reordering results is to group the results by their type.
Flamenco (Figure 8.12) can group results by any facet of metadata. Similarly,
the iTunes store organizes results by whether they are matched by artist, album,
or song title and so on. Searchers are then able to focus in on the results that
most closely match their intentions. The notions of grouping results and tabular
views, as a form of control feature discussed above, begin to branch into how
results are presented and so can also be considered as informational features.

Informational features
Although they can appear fairly simple, the way we present individual results
in a SERP has been very well researched, but not all of the ideas have stuck.
Web search results, as shown in Figure 8.19 typically include the title of the
result, a snippet of text from the result, and the URL for the result.

• Text snippets – These have been well studied. Informally, two lines are
typically chosen as the optimal balance of communicating useful
information and including as many results as possible above the first-
scroll point. More formal research (White, Jose and Ruthven, 2003) has
shown that snippets are most effective when they include the search
terms, which are usually also highlighted. The size of snippets rarely
varies, with early research showing that lengthening them had a
significant impact on their benefit (Paek, Dumais and Logan, 2004). Some
systems provide a ‘more’ button to extend a snippet, while Google
provides an option to change the default length for all snippets.

• Usable information and deep links – These provide users with shortcuts to make
their search more efficient. It is common, for example, for online retail ers to
let searchers buy products directly from the SERP, or to jump direct to parts
of individual pages, such as reviews. Such shortcuts are typically called deep
links, and can be seen clearly on Google (Figure 8.20). They help searchers
jump directly to key pages within more popular websites.

• Images and thumbnails15 – These have been used at various times by most
interactive information retrieval systems, but their benefit remains

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

158

Figure 8.19 A typical example of a single result in a Google results page

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

inconclusive. While critical for image search, and beneficial for visual con –
tent like products, they are rarely used in web search. One conclusion is that
they aid revisitation and refinding (Teevan et al., 2009), but other research
suggests that users can make accurate snap judgements about the quality
and professionalism of websites (Zheng et al., 2009). When thumb nails are
used, research recommends that thumbnails be approximately 200 pixels
wide (Kaasten, Greenberg and Edwards, 2002). Further, the principle of
thumbnail previews can stretch to other multimedia (schraefel et al., 2006).

• Immediate feedback – Another strong principle of SUI design is to provide
results as soon as possible, as the user may see their result and never need
to refine it. While this typically means returning results and control features
together after the first search, Google Instant16 takes this principle to the
extreme by providing results after the very first character is entered into the
search box.

Visualizing relevance

Search systems rarely display explicit relevance values, as everyday searchers
will not know what they mean or how they are calculated, but interactive
information retrieval systems can communicate relevance in other ways. When
sorting by specific criteria, such as average rating or number of downloads, we
can clearly demonstrate these meaningful values in the way we display results.

We frequently see examples of bar-chart style indicators to communicate the
relevance of a result (Veerasamy and Belkin, 1996). Figure 8.21 overleaf shows
examples of bar-chart style relevance indicators found with Apple’s email
software and PDF reader. Such relevance judgements can be more specifically
helpful when results are sorted by a different dimension, as in Figure 8.21(a).

One early example of abstracted relevance, called TileBars (Hearst, 1995)
(Figure 8.22 on page 161), tried to provide detailed information about the
relevance of documents, and the parts within the document. Each row of a

159

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

Figure 8.20 Deep links in Google

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

160

TileBar represents the elements of the query (three terms in this case). The length
of a TileBar, instead of representing an overall relevance like a bar chart,
represents the length of a document. Each horizontally aligned square represents
a section of the document. The colour intensity of each square indicates the
relevance of that section of the document to that term in the query. Consequently,
users get a sense of the size of the document, and the amount of it that is relevant
to the different elements of the search criteria. Although we do not see TileBars,
as originally designed, frequently in search systems, we see examples of the
TileBar principles in more modern systems. The experimental HotMap (Hoeber
and Yang, 2009) interface, shown in Figure 8.23, provides a rotated TileBar-style
view over the results of a search. Each darker square indicates the relevance of
a result to one individual term in the keyword search.

The InfoCrystal system (Spoerri, 1993) (Figure 8.24 on page 162) provided a
unique visualization for conveying relevance to different search terms.

(a) Searching for email in Apple Mail

(b) Searching for terms in a PDF using Apple
Preview

Figure 8.21 Examples of indicating
relevance with a bar chart in Apple OS X

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

161

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

Figure 8.22 TileBars (Hearst, 1995).
Copyright © 1995 ACM, Inc.doi>10.1145/223904.223912. Reprinted
by permission.

Figure 8.23 The HotMap interface (Hoeber and Yang, 2009).
Copyright © 2009 Wiley. Reprinted by permission.

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

InfoCrystal creates a shape with as
many corners as search terms. Addi –
tional smaller shapes, placed within
that shape, indicate how many results
are relevant to combinations of these
terms, where the number of sides and
their colours indicate which search
terms the documents are relevant to.
Using this system, which can be
hierarchically nested, searchers can
determine the set of results that are
relevant to different combinations of
search terms.

2D displays of results
Branching into the realms of inform –

ation visualization, many systems have explored the use of horizontal and vertical
dimensions to present results. Most commonly we see search results displayed
as a grid (e.g. image search), where we expect (at least in the western world)
relevance to go left-to-right across rows, and to repeat for all subsequent rows.
This does not so much use two dimensions to present results as present one
dimension of relevance in a different layout.

Loosely organized 2D spaces
Embodying the idea of clustering, many 2D displays have been used to present
results spatially according to how relevant they are to each other. ‘Self-organizing
maps’ (SOMs) (Kohonen, 1990), for example, cluster results around key terms, as
shown in Figure 8.25. The layout is determined by the content of the documents,
and the positioning tries to find the optimal distance from the relevance scores
between documents. SOMs are based on neural network algorithms, and they
can create a layout that can look like a surface with mounds where there are many
similar results. Colour intensity, as with TileBars, is often used to create this effect.
SOMs can visualize large numbers of results, where the webSOM in Figure 8.25
is showing 1 million documents. The webSOM organizes newsgroups by how
they relate to each other. Searchers can zoom in on areas of the map and select
individual documents that are represented as nodes.

Similarly, there are several variations of graph-based visualizations, where the
emphasis in the visualization is on the connections between clusters, rather than
the clusters themselves. The ClusterMap visualization, shown in Figure 8.26 (on

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

162

Figure 8.24 Spoerri’s (1993) InfoCrystal
interface. Copyright © 1993 IEEE.
Reprinted by permission.

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

page 164), shows the cluster con –
nections by highlighting the topics in
the documents, and documents that
share multiple topics. Figure 8.26
(overleaf) shows a ClusterMap
generated by Aduna software.17 The
graph shows results that are related to
four categories, where 24 documents,
for example, are about both Microsoft Word Documents and RDF.18

There are many alternatives to these examples but one limitation they all share
is that it can be hard to know where the results are. They are ordered by how they
relate to certain topics and each other, and this order may not be clear to the
person searching. Similarly, research into the layout of tag clouds showed that
predictable orders, such as alphabetical order, were important for searchers. One
alternative is to let users control and manipulate these 2D spaces.

163

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

Figure 8.25 The webSOM interface

(b) First zoom level

(a) Overview level

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

Searcher organized 2D spaces

In order to make 2D spaces more meaningful for users, research has investigated
search methods that are similar to the way we can view folders in Windows
and the Mac OS X operating systems. Research into the TopicShop SUI (Amento
et al., 2000), shown in Figure 8.27 found that users were better able to find useful
websites when they were able to manipulate documents in a 2D space. Further,
users were able to use spatial memory to remember where they had left
documents. Giving users control over such spaces allows them to make the
spaces personally meaningful.

Structured 2D spaces

When results are classified by categories and facets, these metadata can be used
to give a 2D space a structure. The TreeMap visualization (Shneiderman, 1992),
shown in Figure 8.28 shows results based on the hierarchy of cate gorization.
This TreeMap shows 850 files in a four-level hierarchy, where colours represent
different file types. Each level of the hierarchy is broken down by changing

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

164

Figure 8.26 The ClusterMap interface

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

165

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

Figure 8.27 The TopicShop interface (Amento et al., 2000).
Copyright © 2000 ACM, Inc. doi>10.1145/354401.354771.
Reprinted by permission.

Figure 8.28 The TreeMap interface (Schneiderman,1992).
Copyright © 1992 ACM Inc. doi>10.1145/102377.115768.
Reprinted by permission.

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

divisions between columns and rows. The 2D space is divided among the top-
level categories, in proportion to the volume of results they include. Each of
these top-level groups is, in turn, divided among its sub-categories, where the
size of those is determined by the number of results they include. Where the
space is first divided into columns, each column is then divided into rows. The
process continues recursively. This process provides a top-down view of a
hierarchy, and the results within them. Colour can often be used to highlight
results within this layout, according to a different dimension, such as price.
Although circular versions of TreeMaps have been explored, they are generally
considered to be less efficient and less clear about volume than the original
design.19

While TreeMaps provide a more structured layout, it can still be hard to
visually search through the results. To provide an even clearer view of a 2D
space, some visualizations allocate specific facets of results to each of the two
dimensions. If one dimension shows price, and another shows quality, then
users can easily identify the cheapest results that match a certain quality. The
GRIDL system (Shneiderman et al., 2000) (Figure 8.29) is an example that breaks
each dimension into groups, and lists results in the marked-out space that
match. Each dimension is browsable and/or filterable, so users can explore the
data at different granularities. The visualization within each square can be

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

166

Figure 8.29 The GRIDL interface (Schneiderman et al., 2000).
Copyright © 1992 ACM Inc. doi>10.1145/336597.336637.
Reprinted by permission.

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

changed. Similarly, the Envision system (Nowell et al., 1996) allowed users to
set the horizontal axis, vertical axis, icon, shape, colour coding and labels to
represent one of many different dimensions in the data. Timelines and maps
also use dimensions that are common and meaningful to searchers.

3D displays of results

Naturally, in line with the progression in technology, many visualizations have
been extended into 3D versions in order to increase the number of results that
can be displayed. 3D perspectives have been added to SOMs, and the
DataMountain (G. Robertson et al., 1998), shown in Figure 8.30, adds a third
dimension to the user-controlled spatial layouts. The BumpTop software,20 now
acquired by Google, provides a 3D layout and mock-physics to allow users to
organize files on the desktop. Further, the exploration of a hierarchy was extended
in a 3D space in a design called ConeTrees (G. Robertson, Mackinlay and Card,
1991), as shown in Figure 8.31 (overleaf). ConeTrees were later used in the Cat-a-
Cone prototype to enrich the 3D display with multiple simultaneous
categorizations, where any found results would be highlighted in multiple places

167

WILSON • INTERFACES FOR INFORMATION RETRIEVAL

Figure 8.30 The DataMountain interface (G. Robertson et al., 1998).
Copyright © 1992 ACM Inc. doi>10.1145/288392.288596.
Reprinted by permission.

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

168

across the full context of the hierarchy (Hearst and Karadi, 1997).
There have also been examples of stars-in-space style visualizations, which users
can explore in 3D. While many of these ideas are exciting, they have rarely been
shown to provide significant improvements to searchers (e.g. Cockburn and
McKenzie, 2000). Essentially, the overhead of manipulating and navigating
through a 3D space currently overrides the benefits of adding the third
dimension. 3D visualizations also typically involve overlap and occlusion.
Further, some research has shown that around 25% of the population finds it
hard to perceive 3D graphics on a 2D display (Modjeska, 2000). Finally, the
technology to deliver 3D environments on the web is still somewhat limited.
Research, however, continues into 3D environments and 3D controls, and so it
is not beyond the realm of possibility that we will see more common use of 3D
result spaces in future systems.

Additional informational controls

So far this section on informational features has mainly discussed the way in
which we can display results. There are many other secondary informational
features that can convey additional information, guide and support searchers
in a SUI:

• Guiding numbers – These can help searchers maintain awareness of how
they are narrowing down results; the pagination and number of results in
Google convey whether the user is going in the right or wrong direction.
Similarly, numbers can inform searchers as to how many results are in a
category, and thus help them make better browsing decisions.

• Zero-click information – Web search engines provide zero-click

Figure 8.31 The ConeTrees interface (G. Robertson, Mackinlay and Card, 1991).
Copyright © 1991 ACM Inc. doi>10.1145/108844.108883.
Reprinted by permission.

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

information, from converting currency to displaying local movie times,
which guides searchers and can help them be more efficient.

• Signposting – Signposting such as breadcrumbs (e.g. Flamenco in Figure
8.12) makes the status of a current search clear to users. Further,
interactive breadcrumbs allow users to control their search by jumping
back to previous searches.

• Simple animations 21 – These can help guide users towards change
(Baudisch et al., 2006). When hovering over a new notification in
Facebook, for example, its colour highlights and fades to show that it has
been addressed. Animation is especially important in conveying change
in 3D SUIs (G. Robertson et al., 2002).

Personalizable features
Personalizable features of SUIs tailor the search experience to the searcher,
either by their actions or by those of other searchers who are personally related
by some social network. Although these features have so far typically affected
the content and the display of informational and control features, there are
many specific features designed for personalization. For example, shopping
carts provide searchers with a space to collect search results proactively, while
many systems automatically track and display recent searches.

Much more could be said about individual and social personalization, but these
two topics are covered in much more detail in Chapter 10, ‘Web Retrieval, Ranking
and Personalization’, and Chapter 11, ‘Recommendation, Collaboration and Social
Search’. It is important to remember that such influences can pervasively manifest
themselves in SUIs as dedicated functionalities and within existing features.

Summary
This chapter has focused on describing SUIs and SUI features. It began by
breaking down the search features involved in the current Google SUI, and
categorizing the features into four groups: input, control, informational and
personalizable features. It then described examples of the four categories of SUI
features. Input features enable searchers in describing what they are searching
for. Beyond searching with keywords on the web, most approaches strive to
provide metadata to searchers that can be recognized and selected. Control
features help us to manipulate our searches, either by refining what we are
looking for, or by filtering the results to be more specific. Informational features
provide results to searchers in many different forms, but can also help the
searcher be aware of what they have done and can do next. Finally
personalizable features are designed to impact the preceding types of features

169

WILSON • INTERFACES FOR INFORMATION RETRIEVAL
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.29085/9781856049740.010
Downloaded from https://www.cambridge.org/core. The University of British Columbia Library, on 04 Feb 2022 at 23:34:23, subject to the Cambridge

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

so they mean more to the individual searcher.
The aim of this chapter has been to introduce SUIs and SUI features, and discuss

how they are designed. The goal was to provide a framework for thinking about
designs. Readers should think about the users first, and the technology that will
help design the right interface. When assessing or evaluating existing SUIs, readers
should think about the different features and the one or more categories they fall
into. When designing new SUIs, readers should think about the range of support
they are providing with the combination of features being included. Further,
readers should consider whether features can be extended or improved so that
they fall into more than one category of feature. More detailed surveys of SUIs are
available in dedicated and more detailed books (e.g. Hearst, 2009; Morville and
Callender, 2010; M. L. Wilson et al., 2010). The next chapter continues to help us
think about interactive information retrieval systems by focusing more on the
theory and models we have about searchers and searching behaviours.

Notes
1 Alternatives to ranking results by relevance are discussed later in this chapter.
2 Graphical user interfaces are typically defined as involving icons and windows and

as allowing mouse-based input, which differentiated them from earlier systems that
involved command lines or keyboard-based navigation of text user interfaces.

3 See www.altavista.com.
4 See http://labs.systemone.at/retrievr/.
5 See www.shazam.com/.
6 See www.clusty.com.
7 See http://flamenco.berkeley.edu/.
8 See http://mspace.fm/.
9 See www.google.com/products.

10 See www.flickr.com/photos/tags/.
11 See http://mrtaggy.com/.
12 See www.bing.com.
13 See www.globrix.com/.
14 See www.volkswagen.co.uk/new/.
15 Thumbnails are small screenshots of the website, as in a picture of the website,

rather than a picture from it.
16 See www.google.com/instant/.
17 See www.aduna-software.com/.
18 RDF is a data markup language for the Semantic Web that adds relationships to XML.
19 See http://lip.sourceforge.net/ctreemap.html.
20 See www.bumptop.com.
21 Animation should be used carefully and purposefully in SUI design.

INTERACTIVE INFORMATION SEEKING, BEHAVIOUR AND RETRIEVAL

170

https://www.cambridge.org/core/terms

https://doi.org/10.29085/9781856049740.010

https://www.cambridge.org/core

1
Introduction to Library
Databases

As they said in the Sound of Music, “let’s start at the very beginning.” In
this case, a definition and some history, leading up to the current state of
the database industry and its major players. Last, we’ll go over the recent
development known as “Discovery Services” and explain why those systems
are outside the scope of this book.

Electronic access to information by means of the Web is so pervasive
that we take it for granted. You have undoubtedly already used library data-
bases somewhere in your academic life, and either heard or tossed the word
“database” around yourself. But where did these “databases” come from?
Why are they important? What is a database, anyway? Let’s address that
last question first, and then find out where they came from.

What Is a Database?

The Oxford English Dictionary defines “database” as: “A structured set
of data held in computer storage and typically accessed or manipulated by
means of specialized software.” (So much for not using any part of a word
in its definition.) For “data,” let us substitute “information.” A database is
a way to structure, store, and rapidly access huge amounts of information
electronically. That “information” can be numerical or textual, even visual.
And as the Encyclopedia of Computer Science (2003) notes: “An important
feature of a good database is that unnecessary redundancy of stored data
is avoided.” The key concepts are structure (an organized way to store the
information, accomplished by tables, records, and fields, which are discussed
later in this chapter), efficiency (no redundancy), and rapid access (the abil-
ity to search and retrieve material from the database as quickly as possible).

C
o
p
y
r
i
g
h
t

2
0
1
5
.

L
i
b
r
a
r
i
e
s

U
n
l
i
m
i
t
e
d
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

M
a
y

n
o
t

b
e

r
e
p
r
o
d
u
c
e
d

i
n

a
n
y

f
o
r
m

w
i
t
h
o
u
t

p
e
r
m
i
s
s
i
o
n

f
r
o
m

t
h
e

p
u
b
l
i
s
h
e
r
,

e
x
c
e
p
t

f
a
i
r

u
s
e
s

p
e
r
m
i
t
t
e
d

u
n
d
e
r

U
.
S
.

o
r

a
p
p
l
i
c
a
b
l
e

c
o
p
y
r
i
g
h
t

l
a
w
.

EBSCO Publishing : eBook Collection (EBSCOhost) – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA
AN: 1197818 ; Bell, Suzanne S..; Librarian’s Guide to Online Searching: Cultivating Database Skills for Research and Instruction, 4th Edition : Cultivating
Database Skills for Research and Instruction
Account: s5672194.main.ehost

2 Librarian’s Guide to Online Searching

As my husband the computer scientist puts it, “a database isn’t magic but it
is pretty smart.”

Is Google a Database?

Absolutely, in the sense that Google is a vast collection of data (the con-
tents of web pages and material linked to web pages) that is searchable and
provides rapid access to results. But Google and similar web search engines
are not the focus of this textbook. These search tools build their databases
automatically, from material that is freely accessible on the web, and their
structure, scope, size, and many other aspects are not obvious. As far as one
can tell, there is no quality control, no human intervention involved in build-
ing the database.1

The commercial and governmental databases considered in this text are
products specifically crafted to achieve the goal of providing users access to
formally published information (e.g. articles, conference papers, books, dis-
sertations, reports), in a very organized and efficient fashion. (In the case of
commercial databases, part of that crafting is a mechanism for limiting access
to paid subscribers.) Let us refer to these as “library databases,” since that is
where you usually encounter them. Library databases tend to be targeted to
specific audiences, and to offer customized features accordingly. Their struc-
ture, scope, size, date coverage, publication list and many other details are
either obvious from their search interfaces, or explicitly provided. (I would
like to say that library databases are more structured than Google, but I can’t.
Because who knows how Google is structured or how its search algorithm re-
ally works? The “black boxness” of Google is another thing that distinguishes
it from the databases that are the focus of this text.) Library databases are
much less well known and ubiquitous than Google, and usually not free, but
there are good reasons for that. Finding out where these databases came from
should help explain why (as the adage goes, “you get what you pay for”).

Historical Background

Indexing and Abstracting Services

In the Beginning . . .

There was hard copy. Writers wrote, and their works were published
in (physical) magazines, journals, newspapers, or conference proceedings.
Months or years afterward, other writers, researchers, and other alert read-
ers wanted to know what was written on a topic. Wouldn’t it be useful if
there were a way to find everything that had been published on a topic, with-
out having to page through every likely journal, newspaper, and so forth?
It certainly would, as various publishing interests demonstrated: as early
as 1848, the Poole’s Index to Periodical Literature provided “An alphabeti-
cal index to subjects, treated in the reviews, and other periodicals, to which
no indexes have been published; prepared for the library of the Brothers in
Unity, Yale college” (Figure 1.1).

EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

Introduction to Library Databases 3

Not to be caught napping, the New York Times started publishing their
Index in 1851, and in 1896, taking a page from Poole’s, the Cumulative In-
dex to a Selected List of Periodicals appeared, which soon (1904) became
the canonical Readers’ Guide to Periodical Literature. Thus, in the mid-19th
century, the hard copy Index is born: an alphabetical list of words, represent-
ing subjects, and under each word a list of articles deemed to be about that
subject. The index is typeset, printed, bound, and sold, and all of this effort
is done, slowly and laboriously, by humans.

Given the amount of work involved, and the costs of paper, printing, etc.,
how many subjects do you think an article would have been listed under?
Every time the article entry is repeated under another subject, it costs the
publisher just a little more. Suppose there was an article about a polar expedi-
tion, which described the role the sled dogs played, the help provided by the
native Inuit, the incompetence on the part of the provisions master, and the
fund-raising efforts carried on by the leader’s wife back home in England. The

Figure 1.1. Index listing and title page (inset)
from Poole’s Index to Periodical Literature.
Courtesy of the Department of Rare Books and
Special Collections, University of Rochester Li-
braries, Rochester, NY.

EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

4 Librarian’s Guide to Online Searching

publisher really can’t afford to list this article in more than one or two places.
Under which subject(s) will people interested in this topic be most likely to
look? The indexer’s career was a continuous series of such difficult choices.

An index, recording that an article exists and where it would be found,
was a good start, but one could go a step further. The addition of a couple of
sentences (e.g., to give the user an idea of what the article is about) increases
the usefulness of the finding tool enormously—although the added informa-
tion, of course, costs more in terms of space, paper, effort, etc. But some index
publishers started adding abstracts, gambling that their customers would
pay the higher price (which they did). Thus, we have the advent of Abstract-
ing and Indexing services or “A & I,” terminology that you may still see in
the library literature.

The abstracts were all laboriously written by humans. They needed to
be skilled, literate humans, and skilled humans are very expensive (even
when they’re underpaid, they are expensive in commercial terms). Humans
are also slow, compared with technology. Paper and publishing are expensive,
too. Given all this, how many times do you think an entry for an article would
be duplicated (appear under multiple subjects) in this situation? The an-
swers are obvious; the point is that the electronic situation we have today is
all grounded in a physical reality. Once it was nothing but people and paper.

From Printed Volumes to Databases

Enter the Computer

The very first machines that can really be called digital computers were
built in the period from 1939 to 1944, culminating in the construction of the
ENIAC in 1946, “the first general-purpose, electronic computer” (Encyclopæ-
dia Britannica Online 2014). These machines were all part of a long pro-
gression of innovations to speed up the task of mathematical calculations.
After that, inventions and improvements came thick and fast: the 1950s and
1960s were an incredibly innovative time in computing, although probably
not in a way that the ordinary person would have noticed. The first ma-
chine to be able to store a database, RCA’s Bizmac, was developed in 1952
(Lexikon’s History of Computing 2002). The first instance of an online trans-
action-processing system, using telephone lines to connect remote users to
central mainframe computers, was the airline reservation system known
as SABRE, set up by IBM for American Airlines in 1964 (Computer His-
tory Museum 2004). Meanwhile, at Lockheed Missile and Space Company, a
man named Roger Summit was engaged in projects involving search and re-
trieval, and management of massive data files. His group’s first interactive
search-and-retrieval service was demonstrated to the company in 1965; by
1972, it had developed into a new, commercially viable product: Dialog—the
“first publicly available online research service” (Dialog 2005).

Thus, in the 1960s and 1970s, when articles were still being produced
on typewriters, indexes and abstracts were being produced in hard copy, and
very disparate industries were developing information technologies for their
own specialized purposes, Summit can be credited with having incredible
vision. He asked the right questions:

EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

Introduction to Library Databases 5

1. What do people want? Information.

2. Who produces information, and in what form? The government
and commercial publishers, in the form of papers, articles, news-
papers, etc.

3. What if you could put information about all that published mate-
rial into a machine-readable file: a database—something you could
search?

Summit also had the vision to see how the technological elements could
be used. The database needed to be made only once, at his firm’s headquar-
ters, and trained agents (librarians) could then access it over telephone lines
with just some simple, basic equipment. The firm could track usage exactly
and charge accordingly. Think of the advantages!

The advantages of an electronic version of an indexing/abstracting sys-
tem are really revolutionary. In a system no longer bound by the confines of
paper, space, and quite so many expensive skilled personnel:

• Articles could be associated with a greater number of terms describing
their content, not just one or two (some skilled labor is still required).

• Although material has to be rekeyed (i.e., typed into the database),
this doesn’t require subject specialists, simply typists (cheap labor).

• Turnaround time is faster: most of your labor force isn’t thinking
and composing, just typing continuously—the process of adding to
the information in the database goes on all the time, making the
online product much more current.

• If you choose to provide your index “online only,” thus avoiding the
time delays and costs of physical publishing, why, you might be able
to redirect the funds to expanding your business: offering other in-
dexes (databases) in new subject areas.

As time goes on, this process of “from article to index” gets even faster.
When articles are created electronically (e.g., word processing), no rekeying
is needed to get the information into your database, just software to convert
and rearrange the material to fit your database fields. So, rather than typ-
ists, you must pay programmers to write the software, and you still need
some humans to analyze the content and assign the subject terms.

In the end, the electronic database is not necessarily cheaper to create;
it very likely costs more! The costs have simply shifted. But customers buy
it because . . . it is so much more powerful and efficient. It is irresistible, and
printed indexes have vanished like the dodo. Online library databases are
an integral part of the research process.

The Library Database Industry Today

For a line of business and a product you probably weren’t very aware
of until you were in high school or college, the library database business is,

EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

6 Librarian’s Guide to Online Searching

for the moment, surprisingly robust. The juggernaut of Google and espe-
cially Google Scholar has not put the commercial database vendors out of
business (yet—I’m sure there is a constant undercurrent of fear throughout
the business). Probably the largest commercial vendors, and ones you might
have heard of before reading this book, are EBSCO, ProQuest, and Gale
(Gale Cengage). Other major names to add to your repertoire are Thomson-
Reuters (creators of the Web of Science and many other databases), JSTOR,
LexisNexis, OCLC FirstSearch, ABC-CLIO, Alexander Street Press, Project
MUSE, and OVID. Most of the databases produced by these vendors have
content drawn from many sources, many publishers: they aggregate content,
bringing it together so you can search across all of it in one database. Thus
the terminology aggregators that is often used to describe the multidisci-
plinary article databases from the vendors listed above. In contrast, major
publishers such as Elsevier, Oxford University Press, and Sage Publications
are big enough to create databases just of the materials they publish, for ex-
ample: Elsevier’s ScienceDirect database, Oxford Music Online, Sage Jour-
nals and the CQ databases (an imprint of Sage).

In addition to the commercial entities mentioned above, some profes-
sional associations create and manage the subscriptions to databases of
their materials. Examples include the Association for Computing Machin-
ery (ACM), the Institute of Electrical and Electronics Engineers (IEEE), the
American Society of Mechanical Engineers (ASME), the American Math-
ematical Society (AMS), and the American Chemical Society (ACS). Govern-
ment and international agencies also produce databases. US government
agencies such as the National Library of Medicine, the Department of Edu-
cation, the Census Bureau, and the Bureau of Labor Statistics are the au-
thors of key databases in their respective topical areas, which we will cover
in subsequent chapters. At the international level, the World Bank,2 the In-
ternational Monetary Fund, and the Organization for Economic Cooperation
and Development (OECD) all offer databases of their information.

The names of library database vendors listed above represent only the
largest and/or better-known entities. As in any line of business there are,
of course, many more companies, either smaller or focused on a particular
audience (the number of vendors that create databases specifically for the
business community, both corporate and academic, is remarkably exten-
sive). The database vendor industry is also a business like any other: it
is subject to consolidation and occasionally to expansion. Companies come
and go through mergers and acquisitions, start-ups, and occasional deaths.
Changes may not happen as rapidly as in some industries, but when they
do, they can be significant. Three of the notable changes in the current
decade were EBSCO’s acquisition of the H.W. Wilson databases, and two
major moves by ProQuest: the acquisition of the CSA databases and taking
over publication of the Statistical Abstract of the United States from the US
Census Bureau, including putting all the Statistical Abstract content into
a new database.

At the beginning of this section, I made a reference to the database
vendors’ (not to mention librarians’) fears about Google and Google Scholar:
that these free, ubiquitous, embedded-in-daily-life resources might spell
the end of the library database business. The vendors have been fighting

EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

Introduction to Library Databases 7

back for many years, however, first with something called “federated search”
(about which the less said the better; the title of Jody Fagan’s 2011 editorial
on the topic says it all: “Federated Search Is Dead—and Good Riddance!”).
The latest counter-attack by the database vendors, dubbed “Discovery Ser-
vices” or “Web Scale Discovery Services,” is far superior. Reports on usage
statistics from institutions that have adopted a discovery service indicate
that these products may have a strong chance of winning ground back from
the all-mighty Googleplex (Way 2010, Kemp 2012, Daniels, Robinson, and
Wishnetsky 2013, Calvert 2014).

The following section will provide a brief overview of discovery services,
concluding with why they will not be considered further in this text.

Discovery Services

Discovery Services are systems that harvest and pre-index a wide vari-
ety of library content from separate sources (records from library databases,
the online catalog, perhaps the local institutional repository or other locally
developed databases), build one giant index of all that content, and provide
near-instant, relevancy ranked results through one search box (Vaughan
2011, Adams et al. 2013). Sound familiar? It is exactly the Google model, but
instead of web pages it draws on all the vetted and expensive resources for
which the library has already paid, making them “discoverable.” These sys-
tems are frequently referred to as Web-Scale Discovery Services, “meaning
they search library collections the way Google searches the web: by search-
ing the entire breadth of content available in the library’s collection” (Fry
2013). (The “entire breadth of content” is at least the goal if not the reality
right now.) The essential key is the pre-indexing, getting the data from all
the disparate resources ahead of time, as it were, to build that one giant
index that can provide the speedy response time that users expect. Where
the discovery systems start to part ways with Google is on the results page,
which is loaded with options for refining and outputting results, and where
library-owned full text is instantly accessible.

The vendors and products in the discovery service market at the time of
this writing are EBSCO Discovery Service (EDS), Serials Solutions’ Summon
(note that Serials Solutions is owned by ProQuest), Ex Libris’ PrimoCentral,
OCLC’s WorldCat Discovery Services (WDS), and, though it works differ-
ently from the others, Innovative Interface’s Encore Synergy. AquaBrowser
is ProQuest’s discovery product aimed at the public library market.

The tricky part is that these companies are competitors both in the
individual database and now in the discovery service market. Their major
customers (large academic libraries) have resources from a wide variety of
vendors. Achieving the goal of providing “one search” access to all that con-
tent means that each discovery service company (A) must persuade the com-
peting discovery service companies (B), and all the other database vendors
(C), to give A access to their databases in order to harvest and pre-index
the data therein. This is a delicate dance, as you might imagine, but again,
the threat of Google is actually helping, and agreements are (carefully) be-
ing negotiated. From a customer’s point of view, it’s obvious: “You’ve got to
be in,” says Michael Kucsak, Director of Library Systems & Technology at

EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

8 Librarian’s Guide to Online Searching

the University of North Florida, talking about inter-vendor discoverability.
“You’re in—you win. You’re out—you’re not long for the world” (Fry 2013).

Discovery services hold immense promise for breaking down the silos in
library content, especially the one between the library catalog (OPAC) and
the article databases (each one of which is in its own silo). While students
may eventually understand that the catalog and the databases are separate,
and that the routes for accessing each one are different, for the casual user
who needs 15 good articles for tomorrow’s paper—it is simply too much ef-
fort. Discovery services meet that need fairly efficiently and painlessly. And
as a librarian who has made many, many purchase decisions and is painfully
aware of what quality database resources cost, a “tool [that] holds the poten-
tial to significantly increase the discovery and use of such content” (Vaughan
2011) does indeed get my notice and my vote.

So how can a textbook on (individual) database searching still be justi-
fied? Why master all sorts of esoteric knowledge and get comfortable with
interfaces having three search boxes (with attendant options and settings)
when there is a simple, one-box option that searches the same material? The
discovery services tools are a wonderful way to woo undergraduates back to
library resources. But you have this textbook in hand, presumably, because
you are studying to become a librarian or an information professional or
technologist. For you, a higher order of knowledge and familiarity with more
sophisticated tools and approaches is one of the essential points—otherwise
anyone could set up shop and call herself an expert searcher. Google and
the discovery services will take care of the lower order questions. Someone
still needs to be there to deal with the harder, higher order research queries.
When the discovery service search isn’t providing the answer, someone needs
to know how to go to next level: how to choose, access, and skillfully interact
with highly crafted, subject-specific databases on an individual basis. Jody
Fagan (2011) points out that “scholars working on more substantial research
projects . . . have already found—or will need to find—the native interface
to the subject-specific resources they need.” You need to be the person who
can point those scholars to subject-specific resources, and help them get the
most out of the “native interface” (which usually provides subject-specific fea-
tures) of those resources.3 This book is designed to do precisely that. Let’s get
started—because searching really can be just as rewarding as finding.

Notes

1. According to the Google Guide at http://www.googleguide.com/google_works.html,
the Google database is built by the GoogleBot, indexed by the Google Indexer, and
searches handled by the three parts of the Query Processor. The utterly massive
scale simply precludes any kind of human involvement.

2. Worth noting, the World Bank databases, formerly subscription-based, are now avail-
able to the world for free. Kudos to the World Bank for this daring and generous move!

3. Besides, it’s just ever so much more interesting. What fun is plunking words in a
box? Trust me, database skills make research much more efficient and satisfying.

EBSCOhost – printed on 2/4/2022 6:34 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

http://www.googleguide.com/google_works.html

2
Database Structure
for Everyone: Records,
Fields, and Indexes

Whether you are using this book in an upper-level database searching
course or an entry-level intro to reference course, it’s likely that you have or
will have the opportunity to take a true “database” course. This means that
some of you may already be familiar with the concepts in this chapter. My
goal is to focus on helping you learn and develop strategies to search and
interact more effectively with library databases rather than getting into the
real technology of how databases are built. This chapter provides a brief and
simple introduction to how databases are conceptually put together. In my
experience, this is as much as you need to know to apply appropriate search
techniques and use the database effectively. There’s no point in piling on
technical detail if it doesn’t further your ultimate goal, which in this case is
searching.

Database Building Blocks

Fields, Records, and Tables

In essence, databases are made up of fields and records. Fields are like
one cell in an Excel spreadsheet: a bit of computer memory dedicated to hold-
ing one particular type of information, one value. For example, an age field
might hold the value 28. The type of information could be text, numbers, or
an image. A set of fields makes up a record, the idea being that the informa-
tion in all the fields of one record relate to one thing: a person, a company, a
journal, a purchase order, etc. An analogy would be a row in Excel: one row
equals one record. But while you could have an Excel file with 5000 rows

C
o
p
y
r
i
g
h
t

1
5
.

L
i
b
r

r
i
e
s

U
n
l
i
m
i
t
e
d
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

a
y

n
o
t

b
e

r
e
p
r
o
d
u
c
e
d

i
n

a
n
y

f
o
r
m

w
i
t
h
o
u
t

p
e
r
m
i
s
s
i
o
n

f
r
o
m

t
h
e

p
u
b
l
i
s
h
e
r
,

e
x
c
e
p
t

f
a
i
r

u
s
e
s

p
e
r
m
i
t
t
e
d

u
n
d
e
r

U
.
S
.

o
r

a
p
p
l
i
c
a
b
l
e

c
o
p
y
r
i
g
h
t

l
a
w
.

EBSCO Publishing : eBook Collection (EBSCOhost) – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA
AN: 1197818 ; Bell, Suzanne S..; Librarian’s Guide to Online Searching: Cultivating Database Skills for Research and Instruction, 4th Edition : Cultivating
Database Skills for Research and Instruction
Account: s5672194.main.ehost

10 Librarian’s Guide to Online Searching

(records), and 30 columns (all the different fields), such a file wouldn’t ulti-
mately be very efficient to search, and definitely isn’t scalable (it would not,
actually, be a database, but only a “flat file”). Enter the idea of relational da-
tabases, which are structured with tables. It’s like having many Excel work-
sheets that can have indefinitely many rows, but only a few columns (fields).
One of the fields in every table is dedicated to a unique identifier, which ties

Figure 2.1. DMV relational database example: tables, fields, and complete record.

Personal Dat< ID Number 12345678

23456789

98765432

[etc.]

3 Table
Last Name
Smith
Jones
K e p l e r

First Name
John
Martha
J o h n

M l
Q
A

DOB

19451121
19950401
19620714

Gender
M
F
M

Eye Table
ID Number
12345678
23456789
98765432
[etc.]

Eye color
Blue
Brown

Grey

Corrective Lenses
Y
N
Y

Address Tabk
ID Number
12345678
23456789
98765432

Street

123 Main St
60 Merriman St
238 Bayview Dr

City

Clyde
Rochester
Greece

State
NY
NY
NY

Zip

14433
14607
14612

[etc.]

Photo Table
ID Number
12345678

23456789
98765432
[etc.]

BadPic Driving Histo
ID N u m b e r
12345678
23456789
98765432

ry Table
Years D r i v i n g
53
3
34

Accidents
2
1
0

[etc.]

“Show me the complete record for J o h n T. Kepler”

I D #

98765
432

Last

N a m e
Kepler

First
N a m e
John

T
DOB

19620
714

Gen.

M
Street

238
Bayview
Dr

City

Gre
ece

Sta
te
NY

Zip

146
12

Eyes

Grey

Lens
es
Y

Yrs
Drv
34

Ace.

PicpIC

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

Database Structure for Everyone: Records, Fields, and Indexes 11

together all the material relating to the same person, company, etc. together.
All of that material now represents a record. The table structure (and some
additional features we will touch on presently) make possible the desired
storage efficiency and speed of access even for huge amounts of information.

Think about driver’s licenses. They all have an ID number, the owner’s
name, address, date of birth, eye color, a bad photo, etc. All of that informa-
tion undoubtedly resides in a database administered by the state agency
that cares about driver’s licenses. It’s easy to imagine the Department of
Motor Vehicle’s database having fields with names such as ID #, Name, Addr,
DOB, Eyes, BadPic, etc. The fields are probably located in several tables: one
for address information, one for driving history, one for the photo, etc. Pulled
together by the ID number field, those fields make up records, each one of
which represents a person (Figure 2.1).

The fields in the complete record represent every bit of information that
appears on your license, and probably some that isn’t actually printed on
the license as well. When you send in the paperwork and the check to renew
your license, they look you up in the database by your ID number, make any
changes that you might have indicated in your paperwork (e.g., change the
values in your fields), and hit print. Presto, you’ve gone from being a data-
base entry to being a small card with an unflattering photo.

Decisions, Decisions: Designing the Database

From here on, I’m going to discuss databases only in terms of fields and
records, leaving the “tables” aspect out. In the real world, yes, what is behind
the interface you are looking at is almost undoubtedly a relational database,
built on tables. But those tables are simply fields that make up mini-records.
At essence what matters are the fields, and how many of them you need to
create a complete record.

And indeed, the crucial task in developing a database is deciding what
fields the records in your database are going to have, and how big they are
going to be, that is, how many characters or numbers they will be able to
hold. This “size” represents the computer memory allocated every time a
new record is added. (Although memory is cheap now, in a huge project,
how much memory will be allocated is still something to consider.) In the
best of all possible worlds, a whole design team, including software engi-
neers, subject experts, people from marketing and sales, and potential us-
ers, would wrestle with this problem. Nothing might ever get done in such
a large and varied group, however, and so probably a more limited team of
software engineers and content experts is the norm. The problem is that the
design team had better make good choices initially, because it can be dif-
ficult, if not impossible, to make significant changes to the record structure
later.1 This is good and bad. It means there’s a certain inherent stability, or
at least pressure on these database products not to change too much, but
when you wish that they would fix something, it can take a long time for
change to happen. You can take a certain amount of comfort, though, in the
knowledge that however much the interface to the database—the way it
looks—changes, behind the scenes the same types of information (fields),
are probably still there.

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

12 Librarian’s Guide to Online Searching

Food for Thought

For an article database, you’d probably have a field for the article title,
the name of the journal it appeared in . . . and what else? Think about the
other information you would want to capture. Again, the process is some-
thing like this:

Define your fields (Figure 2.2):
. . . that make up records (Figure 2.3) . . .
. . . and form the basis of your database.

Quick Recap

In this section we have described the structure of databases in very
simple terms and compared it to the structure of an Excel spreadsheet. The
most basic elements of a database are fields and records. (Technically, the
fields are usually structured in the form of tables, with one field in each
table acting as the “unique key” to pull all the information relating to one
record together.) A full set of fields makes up a record. Every record in the
database has the same set of fields (even if, in some records, some fields are
blank). All of the records together make up the database.

Beyond Fields and Records

Field Indexes

Fields and records are the basis, the “data” of a database. What makes a
database fast, powerful, and efficient are the indexes of the fields. It would be
very slow if every time you queried the database, it started at field1, record1,
and searched sequentially through each field of each record—you might as
well go back to hard copy at that rate.

An index, in the sense that we’re discussing now, is a list of all the
values from a particular field, with some kind of identifier indicating from
which record each value came (a pointer if you will). This is much like the

Figure 2.2. Database fields.

Figure 2.3. Database records.

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

Database Structure for Everyone: Records, Fields, and Indexes 13

way the index at the end of a book indicates on which pages a word appears.
In one sense, creating indexes of fields breaks one cardinal rule of databases:
not to duplicate any data. But this one kind of duplication is worth the re-
dundancy and extra storage space, because combined with sophisticated
algorithms indexes make it possible to locate and retrieve the records as-
sociated with specified values in nanoseconds. The field indexes become part
of the database but have a separate existence from the records. (You could
think of them as really minimal table structures: just two columns, one with
the values for field X, and the other containing a pointer back to the record
that each field X value came from.) Again, the power of an index is that it
can be sorted and in other sophisticated ways optimized for searching.

Let’s return to the driver’s license example. It has a field for Last Name.
You’d definitely want to create an index to that field, so you’d have your com-
puter program harvest all the values from the Last Name field, along with
the associated ID Number value for each one. Given that the data is textual
(a name), you’d probably want to sort the index alphabetically.2 Then if you
wanted to find the record for Smith, John, your computer program could zip
to the Ss in the Last Name index list (and then to the Js in the First Name
index), find a set where the ID Numbers matched, and based on that pull up
the full record for Mr. Smith. And do this all in much less time than it takes
to write about it. Using indexes to find records also means that the order of
the rows in your database, that is, what order you enter your records, doesn’t
matter at all. You simply build an index and search that when you want to
find something in your database. Or you can build several indexes; you can
make an index of any field you want. However, as always, there are costs and
reasons why you might not index every field.3

A Very Simple Example

Say we have three articles:

Milky Way’s Last Major Merger.

Science News. v. 162 no. 24 p. 376

It’s a Dog’s Life.

The Economist. December 21, 2002. p. 61

Manhattan Mayhem.

Smithsonian. v. 33 no. 9 p. 44

Let’s enhance these just a little by adding a one-line description to each re-
cord (so that we have a few more words to search on):

Record 1:

Milky Way’s Last Major Merger.

Science News. v. 162 no. 24 p. 376.

New clues about galaxy formation indicate early collision affected Milky
Way’s shape.

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

14 Librarian’s Guide to Online Searching

Record 2:

It’s a Dog’s Life.

The Economist. December 21, 2002. p. 61.

From hard labour to a beauty contest, a history of the work and whims of
dog breeding.

Record 3:

Manhattan Mayhem.
Smithsonian. v. 33 no. 9 p. 44

Martin Scorsese’s realistic portrayal of pre–Civil War strife—Gangs of New
York—re-creates the brutal street warfare waged between immigrant
groups.

My database will have just four fields (Figure 2.4):

1. Record number (four-number places, e.g., my database will never
grow to more than 9,999 articles)

2. Article title (50 characters allocated)

3. Journal name (50 characters allocated)

4. Abstract (200 characters allocated)

Now let’s index the fields.
The initial list of words from the Article Title field looks like this:

Milky

Way’s

Last

Major

Merger

It’s

Dog’s

Life

Manhattan

Mayhem

Figure 2.4. A very simple database record.

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

Database Structure for Everyone: Records, Fields, and Indexes 15

More Database Decisions

There are various things about this list that one might question. What
will our indexing program do with those possessives and contractions? Do
we want to clog it up with little words like a? There are many decisions for
database designers to make:

• How will the indexing program handle apostrophes and other punc-
tuation? We take it for granted now that the system will simply pre-
serve it, and users can search for contractions or possessives, but
you may still encounter systems that insert a space instead of the
apostrophe (dog s), or ignore it and treat the letters as a string (end-
ing up with “dogs” for “dog’s”).

• What will the indexing program do with the “little words”? That is,
words such as a, an, by, for, from, of, the, to, with, and so forth, which
are usually referred to as stop words. These are words that are so
common that database designers usually decide they don’t want to
expend time and space to index them. Indexing programs are pro-
grammed with a list of such words and will “stop” indexing when they
hit a word on the list. A more descriptive term would be skip words,
because that is what really happens: the indexing program skips any
stop list words and continues to the next word. Almost all databases
employ a stop word list, and it can vary greatly from one vendor to the
next. (Even Google has stop words, words it doesn’t index.)

• Should the system be designed to preserve information about capi-
talization, or to ignore the case of the words? We are so used to sys-
tems that do not distinguish upper and lowercase (so that you don’t
have to worry how you type in your query), but there are times when
you would really like the system to know the difference between, say,
AIDS (the disease) and aids (the common noun or verb).

Because this is a modern system, we’ll decide to preserve the apostro-
phes and to make a one of our stop words, so it won’t be included in the
index. We can then sort the list alphabetically:

Dog’s
It’s
Last
Life
Major
Manhattan
Mayhem
Merger
Milky
Way’s
EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

16 Librarian’s Guide to Online Searching

Can you see the problem here? We have neglected to include an identi-
fier to show which record a word came from. Let’s start over.

Better Field Indexing

Let’s make sure that our index list includes the record number and
which field the word came from:

0001 Milky TI

0001 Way’s TI

0001 Last TI

0001 Major TI

0001 Merger TI

0002 It’s TI

0002 Dog’s TI

0002 Life TI

0003 Manhattan TI

0003 Mayhem TI

One more thing: we can include a number representing the order of the
word within the field (why might this be useful?). We now have something
like this:

0001 Milky TI 01

0001 Way’s TI 02

0001 Last TI 03

Now we’ll sort again.

0002 Dog’s TI 03

0002 It’s TI 01

0001 Last TI 03

0002 Life TI 04

0001 Major TI 04

0003 Manhattan TI 01

0003 Mayhem TI 02

0001 Merger TI 05

0001 Milky TI 01
0001 Way’s TI 02

Note how even though we deleted the stop word a in the title “It’s a
dog’s life,” the numerical position of “dog’s” reflects that there was an inter-
vening word there: its position is recorded as 3, not 2.

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

Database Structure for Everyone: Records, Fields, and Indexes 17

Because people might want to search on the name of the publication,
it would be good to index that as well. Our index of the Journal Name field
looks something like this:

0002 Economist JN 02

0001 News JN 02

0001 Science JN 01

0001 Science News JN 01, 02

0003 Smithsonian JN 01

Note the multiple indexing of Science News. The technical term for this
is double posting.

To make things even faster and more efficient, after indexing each field,
combine the indexes so that you have only one list to search:

0002 Dog’s TI 03
0002 Economist JN 02
0002 It’s TI 01
0001 Last TI 03
0002 Life TI 04
0001 Major TI 04
0003 Manhattan TI 01
0003 Mayhem TI 02
0001 Merger TI 05
0001 Milky TI 01
0001 News JN 02
0001 Science JN 01
0001 Science News JN 01, 02
0003 Smithsonian JN 01
0001 Way’s TI 02

We undoubtedly want to index the content of the one-sentence “ab-
stracts,” as well. Here is a list of the words in raw form:

new beauty Pre-Civil

clues contest War

about a Strife

galaxy history Gangs

formation of Of

indicate the New

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

18 Librarian’s Guide to Online Searching

early work York

collision and Re-creates

affected whims The

Milky of brutal

Way’s dog street

Shape breeding warfare

From Martin waged

hard Scorsese’s between

labour realistic immigrant

to portrayal groups

a of

Decisions and cleanup are needed on this list of words:

• Stop words—what will they be?

• Hyphenated words—how will they be recorded?

• Proper names—“double post” to include the phrase too?

• Alternative spellings—do we do anything about them or not? (What
might you do?)

Luckily, software does almost all of this work for us. You probably will
never see any indexes in their raw state. What we’ve been going over here is
in real life very under the hood, often proprietary material for the database
vendors. You don’t need to know exactly how any particular database works;
you simply need to grasp some of the basic principles that govern how data-
bases in general are put together and how they are indexed. This determines
how you search them—and what you can expect to get out of them.

Quick Recap

This section discussed the idea of field indexes and the importance
of good planning in the design of huge databases. Field indexes refer to
the idea that the values in a database’s fields can be extracted and put
into their own lists that consist of just the value and a pointer back to
the record it came from. These indexes exist separately from the records
in the database, and make rapid, efficient searching of huge databases
possible. Much thought goes into the initial database design (i.e., what
fields to include, what they are called, how much space to allocate for each
one), because the design cannot be easily changed later. Many decisions
go into the design of indexes as well, for example, which fields will be in-
dexed, how contractions and possessives will be handled, which words will
be treated as stop words, and if and how identification of phrases will be
supported.

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

Database Structure for Everyone: Records, Fields, and Indexes 19

Examples of Indexes in Common Databases

In the examples that follow, see if you can relate what we’ve just gone
over with how the field indexes are presented to you as a user in these
common databases. We’ll start with two multidisciplinary databases: Aca-
demic OneFile from Gale, which has a single Subject list, and EBSCO’s
MasterFILE Premier, which offers separate Subjects, Places, and People
indexes. Last, we’ll consider the very elaborate indexing used by OCLC’s
WorldCat.

Gale’s Academic OneFile: One Subject List

Academic OneFile, one of the Gale Company’s “Infotrac” suite of data-
bases, prominently offers a Subject Guide Search. If you choose the Subject
Guide Search from the navigation bar, useful Search tips are displayed on
that interface page (Figure 2.5). The tips text suggests using this search
mode “when you want to browse a dynamic list of topics, people, products,
locations, organizations and more.”

Once you have searched for a term, then you can browse forward
through the list, as long as there are headings containing your search term
somewhere within them. But the Gale system doesn’t offer unlimited, free-
form browsing capability, unlike MasterFILE’s true browse access (i.e. Mas-
terFILE presents you with the very beginning of whichever index list you
choose, and you could, if you wanted, simply page through—browse—the
whole thing without any searching at all). An advantage to the Gale subject
list over MasterFILE’s is that it is all-in-one: you don’t have to think about
the nature of what you’re looking for (Is it a subject? a person? a place?).
You can look once and know for sure whether the topic you’re looking for
is there or not. For example, Academic OneFile at the time of this writing

Figure 2.5. Representation of the Academic OneFile Subject search interface.

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

20 Librarian’s Guide to Online Searching

does not appear to have any articles on the singing group Chanticleer. The
only Subject Guide entry is for Chanticleer and the Fox, which is helpfully
glossed “(Novel).” Being an all-in-one list, these parenthetical notes are
very useful. Other examples include “(Planet),” “(Medication),” “(Motion Pic-
ture),” the names of sports (to distinguish the different “World Cup” events),
and various others. In the Subject Terms results list, links to Subdivisions
and Related Subjects are provided if applicable, as well as “See” entries to
get you to the term Gale has decided to use (e.g., “Coffee addiction See Cof-
fee habit”). If one is willing to slow down enough to look through the list
of Subdivisions, it is well worth it, as examining the list can make finding
articles on exactly the aspect of [topic x] very easy and efficient. Looking at
the Subdivisions for Coffee (Beverage) provides an excellent example: just
looking for how much coffee is consumed? Try the Subdivision Consump-
tion data. Environmental aspects, Market share, Prices, Research, Risk fac-
tors, Statistics—the Gale indexers have done an excellent job identifying
the kinds of things people look for most often and which can be hard to find
without the human intervention of applying intelligent subject headings.
The Gale subject list also includes the number of results for every heading
and subdivision, which is extremely helpful. Being able to see the count lets
you know that Academic OneFile is probably a good place to find articles
about the “Health aspects” of coffee (945 results in May 2014), but perhaps
not for learning more about “Diseases and pests” of coffee (only 4 results in
May 2014).

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

Database Structure for Everyone: Records, Fields, and Indexes 21

EBSCO MasterFILE Premier: Subjects, Places,
& People Indexes

Even in a fairly simple display, there is a lot to look at and look for. In
this view of the subject index interface in EBSCO’s MasterFILE Premier
(Figure 2.6), A is the area identifying where we are: who is providing the
database, which database it is, and a search box to collect the results of our
choices from the Subjects, Places, and People indexes.

The section marked B tells us we are accessing the Subjects lists: there
are separate indexes for the Subjects, Places, and People fields. The interface
to these index lists allows us to simply start at the very beginning of the list
and browse forward, page by page, or to jump to any point in the index by
searching on a word or phrase, with the option of having our search term at
the beginning of the Subject entry (“Term begins with”), or anywhere within
it (“Term contains”). The third option, “Relevancy ranked,” will return all of
the Subject headings containing your search term arranged by relevance
rather than alphabetically (although it is hard to tell how “relevance” is be-
ing determined).

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

22 Librarian’s Guide to Online Searching

In Figure 2.7 we see the results of searching the subject list for the
word coffee. The entries in all caps are values from the field designated as
subjects in this database: terms the EBSCO indexers have chosen from the
predetermined list of subject headings for this database, that they feel cap-
ture the essence of the article’s content. We will talk more about this idea of
the “predetermined list” of subjects in chapter 3, but for now, just tuck away
the idea that entries in the Subjects list are not random: the indexers have
deliberately compiled this list of terms.4 Thus every article about the his-
tory of coffee is assigned the subject: COFFEE—History. But, as indicated
by the helpful “Use” note, if you are looking for articles about the cultivation
of coffee, rather than “COFFEE—Cultivation” you should “Use COFFEE
growing.” The number of results for each entry is not provided, which is a
bit annoying.

The “Places” list contains, obviously, names of places that have been the
subject of articles in this database, helpfully glossed with the name of the
country or state where they are located to disambiguate them (e.g., “abbe-
ville (ala.),” “abbeville (france),” and “abbeville (la.)” etc.). In the “People” list
you would find, obviously, names of people, but also of orchestras, musical
groups, and musical events, all glossed with the parenthetical note “(per-
former)” (e.g., “boston early music festival (performer)”). The entire content
of the Places and People lists is lowercase, which seems a little odd, but in
both of these lists the number of records for each entry is provided, which is
very helpful.

Field Indexes for the WorldCat Database

Moving on to our third example, OCLC’s WorldCat database (a union
catalog of library holdings from around the world) provides even more ex-
amples of the use of separate indexes for many fields. As in EBSCO’s Sub-
ject, Places, and People lists, the WorldCat Browse Index interface provides
the opportunity to roam around in the indexes, discovering what is there
(and thus, what is possible), before committing to a search. Some fields (such
as Author) are even indexed twice, in separate lists, creating one index for
single words only, and another for phrases. Figure 2.8 provides a drawing
of the initial view of the Browse Index interface (A), and an example of a
single-word and a phrase index for the same field. In the part of the draw-
ing marked B, the drop-down menu has been changed to Author, and in C to
Author Phrase. In the Author (single-word) index, you could browse only for
an author’s last name, for example, Austen. In the Author Phrase, you could
browse specifically for Austen, Jane.

You access the Browse Index screen via an icon in the WorldCat Ad-
vanced Search interface, discussed in greater detail in chapter 7. For now,
simply observe how it works.

Using the Subject Indexes

Figure 2.9 provides a stylized representation of looking up the word
librarians in the Subject index of the WorldCat database. (Note that the

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

Database Structure for Everyone: Records, Fields, and Indexes 23

dropdown for choosing which index you want to browse has been changed to
Subject, a single word index.)

The lower part of Figure 2.9 represents the results of searching this
index. In this stylized drawing, I have only written out the terms of most
interest and highest counts. The other entries represented by “[term]” are
almost always odd spellings or outright typos,5 and have counts in the
single digits. The Count column indicates how many records in the data-
base have been assigned that Subject. Since the Count numbers change
steadily, I have represented them with hash marks indicating the size of
the number. When you are actually using WorldCat, these count numbers
provide a rough indication of the content of the database, and how useful
it might prove for the topic you’re working on. In this case, WorldCat ap-
pears to have a wealth of material on librarians (plural) and librarianship,
but far fewer entries for “librarian” in the singular. Last, observe that the
term we searched for appears in the middle of the list, and is in bold. Why
do you think the database designers have chosen to display the results
this way?

In Figure 2.10, we see the results of a search for information retrieval
in the Subject Phrase index. This drawing uses the same conventions as in
Figure 2.9. Were you to get online and browse forward in this list, you would
find literally hundreds of entries beginning with the words: “information
retrieval.”

Figure 2.8. Representation of the initial Browse Index screen for the WorldCat®
database and examples of single-word and phrase indexes offered for some
fields.

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

24 Librarian’s Guide to Online Searching

Record Structure Reflected in Fields Displayed

As a reminder, indexes are built from the fields included in a database
record. The fields can be called the record structure, and you can get a sense
of how simple, or elaborate, a database’s record structure is by studying the
fields displayed when viewing a record from the database.

The WorldCat database has quite an elaborate record structure; these
database designers were making sure that they didn’t leave anything out,
and that the most complete set of bibliographic information they could as-
semble would be available to users. The OCLC interface designers have the
task of conveying a large amount of information as clearly as possible.

Get online, and look up the record for your favorite book in WorldCat.
Take time to study the full record display, noticing how the designers have
used different fonts, colors, and alignments to convey meaning. Notice how
the field names are lined up on the left, followed by colons, and the contents of
the fields appear to the right. Find the section labeled “Subject(s),” and notice
that the terms below are labeled “Descriptor.” (We will encounter some odd
terms for subject headings in the course of this book.) Some of the fields may
seem quite mysterious, but think about the purpose of the others, and why the
database designers might have decided to include them. WorldCat has been

Figure 2.9. Representation of results from the single-word
Subject index.

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

Database Structure for Everyone: Records, Fields, and Indexes 25

around since 1967, and has probably struggled to adjust its record structure
ever since to stay abreast of developments. If the WorldCat database had been
invented today, the database designers might have made different choices.

Exercises and Points to Consider

1. What would your ideal database record for a journal article look like?
Choose any article that interests you, and design a database record
for it, keeping in mind that what you do for this one article you will
do for every other article (how much do you expect your database to
grow?). What fields will you use? How big will each field be? What
will you call the fields? Sketch out what the overall database would
be like (and why this article would be included), and justify your
choices.

2. Why do you think WorldCat has separate one-word and phrase in-
dexes for the same fields?

3. What is a useful piece of information that is provided when you
browse the indexes (in Academic OneSource and WorldCat)? How
might this affect your search strategy?

Figure 2.10. Representation of results from the multiple word Subject Phrase
index.

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

26 Librarian’s Guide to Online Searching

4. Using the MasterFILE Places index, look up: mars. What is EB-
SCO’s preferred term for the “place” Mars? Switch to the Subject
index, and look up: self acceptance (no hyphen). Notice how EBSCO
clearly and easily gets you to the right form of the term, for this and
so many other “self-” entries.

5. Can you do a field search with Google?

6. People generally think of Google as indexing all the words of all the
web pages that it visits.6 With a few exceptions (it recognizes the
markup codes for the page Title, for example), it offers only one huge
index labeled “all the text.” Why do you think the commercial data-
base vendors go to so much trouble to provide an elaborate record
structure with indexed fields?

7. In the early days of online searching in libraries, only librarians per-
formed searches, after a detailed and careful interview with the patron
requesting the search. The librarian would plan the search carefully,
and then “dial up” to connect to the database, using a password and
employing a very terse, arcane set of commands to perform the search.
Access fees were charged by the minute, with additional charges for
records viewed. Try to picture this scenario, and then compare it with
the situation today. How do you think the totally open, end-user ac-
cess has affected databases and their interfaces? Can you think of
anything about the current situation that is not an improvement?

Notes

1. If you’re wondering why it would be so hard to change the record structure, remem-
ber to think of these databases as huge things: true, adding a new field to a database
of just five records would be trivial. But a database of 500,000 records? How are you
possibly going to retrospectively fill in the new field for all the existing records?

2. Computer scientists can cringe here. I’m sure it would actually be something much
more sophisticated than a simple alphabetic sort.

3. For one thing, the process of initially building the index can take hours. Although
this does not mean that it can’t be done, remember that every index has to be up-
dated frequently to reflect any changes in your database. It just adds to the complex-
ity of the whole operation.

4. And each vendor’s list is different. EBSCO has decided rather than “Coffee (Bever-
age),” the subject should just be “Coffee” in the MasterFILE database. Meanwhile,
the folks over at Gale decided that “Coffee (Beverage)” was preferable to simply “Cof-
fee,” and that’s the preferred subject heading in Academic OneFile.

5. Go online and take a look: these are odd and sometimes amusing entries. What
this shows is that the indexing process, that is, the harvesting of terms from the
Subject field, is done by a computer program: it simply picks up whatever is there,
typos and all. (The errors come from the humans who typed the values into the field.
For fallible humans, they are impressively accurate.)

6. Does Google have stop words, words that it ignores? And does it really index every
page all the way to the end? Does this matter?

EBSCOhost – printed on 2/4/2022 6:35 PM via UNIVERSITY OF BRITISH COLUMBIA. All use subject to https://www.ebsco.com/terms-of-use

Order your essay today and save 25% with the discount code: STUDYSAVE

Order Now

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Reading Summary and Analysis Discussion Post: Exploring Information Systems ”

Get high-quality paper

NEW! AI matching with writer

Order a unique copy of this paper

Type of paper needed:

Pages:

600 words

Academic level:

We'll send you the first draft for approval by September 11, 2018 at 10:52 AM

Total price:

$26

Our Services

Reading Summary and Analysis Discussion Post: Exploring Information Systems

Order a unique copy of this paper