CHARACTERIZING
BROWSING STRATEGIES IN THE WORLD-WIDE WEB
This
paper presents the results of a study conducted at Georgia Institute of Technology that
captured client-side user events of NCSA's XMosaic. Actual user behavior, as determined
from client-side log file analysis, supplemented our understanding of user navigation
strategies as well as provided real interface usage data. Log file analysis also yielded
design and usability suggestions for WWW pages, sites and browsers. The methodology of the
study and findings are discussed along with future research directions.
Keywords
Hypertext
Navigation, Log Files, User Modeling
Introduction
With
the prolific growth of the World-Wide Web (WWW) [Berners-Lee et.al, 1992] in the past year
there has been an increased demand for an understanding of the WWW audience. Several
studies exist that determine demographics and some behavioral characteristics of WWW users
via self-selection [Pitkow and Recker 1994a & 1994b]. Though highly informative, such
studies only provide high level trends in Web use (e.g. frequency of Web browser usage to
access research reports, weather information, etc). Other areas of audience analysis, such
as navigation strategies and interface usage remain unstudied. Thus, the surveys provide
estimations of who is using the WWW, but fail to provide detailed information on exactly
how the Web is being used. Actual user behavior, as determined from client-side log file
analysis, can supplement the understanding of Web users with more concrete data. Log file
analysis also yields design and usability guidelines for WWW pages, sites and browsers.
This
paper presents the results of a three week study conducted at Georgia Institute of
Technology that captured client-side user events of NCSA's XMosaic. Specifically, the
paper will first present a review of related hypertext browsing and searching literature
and how it's related to the Web, followed by a description of the study's methodology. An
analysis of user navigation patterns ensues. Lastly, a discussion and recommendations for
document design are presented.
Literature
Review
Many
studies have addressed user strategies and usability of closed hypermedia systems,
databases and library information systems [Caramel et. al., 1992]. Most distinguish
between browsing and searching. Cove and Walsh [Cove et. al. 1988] include a third
browsing strategy:
- Search
browsing; directed search; where the goal is known
- General
purpose browsing; consulting sources that have a high likelihood of items of interest
- Serendipitous
browsing; purely random
This
continuum provides a nice middle ground to distinguish between browsing as a method of
completing a task and open ended browsing with no particular goal in mind. Marchionini
[Marchionini, 1989] further develops this distinction in designating open and closed
tasks. Closed tasks have a specific answer and often integrate subgoals. Open tasks are
much more subject oriented and less specific. Browsing can be used as a method of
fulfilling either open or closed tasks.
Intuitively,
it would seem that browsing and searching are not mutually exclusive activities. In
Bates's [Bates, 1989] work on berrypicking, a user's search strategy is constantly
evolving through browsing. Users often move back and forth between strategies. Similarly,
Bieber and Wan [Bieber & Wan, 1994] discuss the use of backtracking within a
multi-windowed hypertext environment. They introduce the concept of "task-based
backtracking," in which a user backtracks to compare information from different
sources for the same task or to operate two tasks simultaneously. A similar technique, in
a Web environment, would be backtracking to review previously retrieved pages.
All of
these studies were performed on closed, single-author systems. The WWW however, is an
open, collaborative and exceedingly dynamic hypermedia system. These previous findings
provide the basis and structure for the describing the ways a user population behaves in a
dynamic information ecology, like the WWW.
Given
that we expect to find the same kinds of strategies used in the WWW, supporting both the
browser and the searcher in designing WWW pages and servers is necessary, although
difficult. Furthermore, supporting the kind of task switching described by Bates and
Beiber and Wan adds another level of complexity because the work implies that a user
should be able to switch strategies at any time.
It has
long been recognized that methods for supporting directed searching are needed. As a
response to this, certain WWW servers are completely searchable and there are World-Wide
Web search engines available.
Supporting
browsing, though, may be a more difficult task. Both Laurel [Laurel, 1991] and Bernstein
approach the topic of how to assess and design hypertexts for the browsing user. Laurel
considers interactivity to be the primary goal. She defines a continuum for interactivity
along three variables: frequency (frequency of choices), range (number of possible
choices) and significance (implication of choices). Laurel contends that users will pay
the price "often enthusiastically -- in order to gain a kind of lifelikeness,
including the possibility of surprise and delight." Bernstein takes a slightly
different approach with his "volatile hypertexts" [Bernstein, 1991]. He argues
that the value of hypertext lies in its ability to create serendipitous connections
between unexpected ideas.
There
is a tension between designing for a browser and designing for a searcher. The logical
hierarchy of a file structure or a searchable database may work fine for a closed-task,
goal oriented user. But a user looking for the unexpected element or a serendipitous
connection may be frustrated by the precision required by these methods. The first step in
balancing this problem is to determine what strategies are being used by the population.
In order to do this, we collected log files of users interacting with the Web.
Methodology
We sought to capture all events generated by consenting Georgia
Institute of Technology's College of Computing staff, faculty and student populations who
operate NCSA's XMosaic running Sun OS 4.1.3. Towards this end, a version of XMosaic was
coded to trap all user interface level events. The computing environment of the study
consisted of over 250 Sun OS 4.1.3 machines connected via a 100 Megabit/sec CDDI LAN. To
minimize the potential for data loss resulting from network and/or system failures, all
captured events were processed and forwarded to a secure disk via the syslog daemon.
Equally
important was infusing a meaningful representation into the data of user events. This
allows not only a clear understanding of the extent and functionality of the interface,
but also allowed for clear extraction of task specific data during analysis. Accordingly,
we recorded events according to the User Interface Design Environment (UIDE) [Sukaviriya,
et. al, 1993] guidelines for task representation. This permits all actions to be viewed on
three levels: an Application Action (high-level task, e.g. Open File), an Interface Action
(mid-level task, e.g. select item from pull-down menu), and an Interface Technique (low
level action, e.g. Mouse Click). In the example below, a user clicked on a hyperlink in
the document window that pointed to http://www.somehwhere/. The user is identified as
participant number 123, and the event was generated from machine foo.gatech.edu on August
3rd, 1994 at 12:21:10 a.m.
Aug
3 00:21:10 foo.cc.gatech.edu uel: 775887872 123 1 Mouse Navigate Anchor::
http://www.somewhere/
The
study was conducted for a three week period that commenced August 3, 1994. Participation
was solicited through a consent window that informed users of the experimental procedures
employed as well as of their rights as human subjects. The intent of the consent window
was both informative and to minimize the "Big Brother" effect [Nielsen, 1993].
This window appeared the first time XMosaic was executed by each user during the sampling
period. One hundred and seven users, or sixty-three percent, chose to participate in the
study.
The
selection of XMosaic was made for several reasons. According to some estimates at the time
[Kostner, 1994], XMosaic accounted for roughly 53% of all WWW related accesses to HTTP
servers. Furthermore, XMosaic was one of the only UNIX based GUI browsers available.
Still, since the computing environment studied also included several other platforms that
supported non-logging WWW browsers, certain portions of the computing population were not
able to participate. Another confound of the experimental design exists in that it was
possible for users to compute on multiple platforms during the sampling period, which may
have resulted in the users running the specialized Sun OS version of XMosaic in tandem
with other non-logging versions of WWW browsers.
Table 1. Occurrence of X Mosaic user events mapped to UIDE- like
representation,where M = mouse click; K = keyboard entry (after Sukaviriya et. al., 1993)
--------------------------------------------------------------------------------------------------------------------------- |
| Application Action |
Interface Technique |
Instances |
Percentage |
Category of
Action |
Description of Action |
--------------------------------------------------------------------------------------------------------------------------- |
| Anchor |
M |
16140 |
51.9 |
Navigate |
Selection of Hyperlink in Document |
| Back |
M K |
12633 |
40.6 |
Navigate |
Go Back One Document |
| Open URL |
M K |
707 |
2 |
Navigate |
Open File via a URL |
| Hotlist - Go To |
M |
636 |
2 |
Navigate |
Go to Document via Hotlist |
| Forward |
M K |
537 |
2 |
Navigate |
Go Forward One Document |
| Open Local |
M K |
221 |
.7 |
Navigate |
Open Local File |
| Home Document |
M K |
179 |
.5 |
Navigate |
Go to the Home Document |
| Window History |
M K |
39 |
.1 |
Navigate |
Go to Document via Window History |
--------------------------------------------------------------------------------------------------------------------------- |
Analysis and Results
The
original log file comprised over 43,000 events, with each record uniquely identifiable by
user id and time of occurrence. The file was sorted by user id and secondarily by event
time. This file includes all user interface events.
Since
users will often leave XMosaic running for extended periods of time without interacting
with it, determining session boundaries artificially was necessary. With the intent of
identifying these boundaries, the time between each event for all events across users was
calculated. The mean between each user interface event was 9.3 minutes. In order to
determine session boundaries, all events that occurred over 25.5 minutes apart were
delineated as a new session. This means that most statistically significant events
occurred within 1-1/2 standard deviations (25.5 minutes) from the mean. Thus, a new log
file was derived that indicated sessions for each user. Interestingly, a consistent third
quartile was observed across all users, though we note no clear explanation for this
effect.
Users
averaged 9.4 sessions each, or approximately one session every other day. For subsequent
analyses, navigational related events were extracted, which brought the total number of
events to 31,134 representing 73% of all generated events.
Document
requests were distinguished by protocol. Eighty percent of the document requests were of
type http (i.e. requests for a document from a WWW servers). Four percent of these were
generated by "cgi" scripts. Files accounted for 8%, followed by ftp and gopher
both at 4%. All other accesses combined (including news, wais, telnet, etc.) totalled 4%.
Methods
of Interaction
Hyperlinks
were by far the preferred method of traversal, accounting for 52% of all document
requests. Second, accounting for about 41%, was the "Back" command. Following in
order of popularity were "Open URL," "Hotlist," "Forward,"
"Open Local," "Home Document" and "Window History" (see
Table 1). This indicates that users typically did not know the location of documents a
priori, or relied on other heuristics to navigate to a specific document. Furthermore,
most users did not select items in the hotlist and window history. It seems that they
either preferred using "Go_To" or did not know how to employ this interface
technique.
While
all menu items have corresponding keyboard equivalents only 4272 events were instantiated
via the keyboard, though this may be due to the lack of display of keyboard equivalents
next to menu items, as is done on Macintosh applications. Finally, 486 or 1%
interrupts/asynchronous aborts (hitting the spinning globe) occurred during file transfer.
This indicates that the population as a whole was insensitive to retrieval latency,
although there may be a difference for users using modems or slower connections.
Within
Site Navigation
Average
successive document requests within a single site across all users was 12.64. Outlier
removal resulted in a mean of 10.31 (min=1, max=403) with a standard deviation of 28.56.
Popularity
of Sites
The
five most popular sites were:
- file://localhost
- http://www.gatech.edu
- http://w3.eeb.ele.tue.nl
- http://www.ncsa.uiuc.edu
- http://info.cern.ch
The
sites map to user document testing, Georgia Tech's home page server, a digital archive in
Nederland, NCSA, and CERN. Users accessed a total of 1222 unique sites outside of Georgia
Tech. Thus, given the estimate of Web servers during the observation period was 7,300 by
SG-Scout, roughly 16% of all available sites were accessed during the study.
Interestingly, items put on peoples hotlists did not match the most popular sites. The
sites most accessed through the hotlist were:
- http://www.secapl.com
- file://localhost
- http://info.cern.ch
- http://akebono.stanford.edu
- http://www.cs.ubc.edu
Site
Analysis
1222
sites outside of Georgia Tech were accessed by College of Computing users. A modified
version of the Pattern Detection Module (PDM) algorithm [Crow & Smith, 1991]
identified the frequency of repeating sequences of site and document accesses.
Specifically, the program tallied the number of occurrences of sequences of accesses, or paths.
Paths of length two through fifty were computed.
For
example, suppose a user went from www.gatech.edu to www.ici.edu to www.ncsa.uiuc.edu a
total of seven times throughout the study, the PDM would identify a path of length three
(three sites) with a frequency of seven (repeated seven times). Stated differently, the
length of a path is the number of successive document requests, which are to be viewed as
user navigation.
The PDM
analysis revealed long sequences of between-site access patterns on a per-session and a
per-user basis. By "per-session" we refer to patterns within a session by a
single user. Likewise, by "per-user" we refer to all sessions by a user, thus
allowing for the identification of between-session patterns. For the per-session analysis,
paths including seven different sites occurred with a frequency of five times. On a
per-user basis, the PDM algorithm identified sequences of length eight with a frequency of
nine. Furthermore, numerous shorter sequences were discovered with higher frequencies with
a maximum frequency of seventeen
|
High Frequency |
Low Frequency |
---------------------------------------------------------------------------------------- |
| Short |
home pages |
sporadic visits |
| Path Length |
orientation pages |
dead ends |
|
meta indexes |
un-useful pages |
---------------------------------------------------------------------------------------- |
| Long |
source of referefence sites |
one shot resources |
| Path Length |
like NCSA or CERN |
directed searching |
---------------------------------------------------------------------------------------- |
Table 2.Characterization of sites based on frequency and path length relations.
In
addition, an analysis of the length of paths within each site visited per user was
performed. Figure 1 shows the average frequency per path length. This corresponds to the
mean path of length x, for all x between 2 and 50. Exploratory data analysis revealed a
slightly negative linear relationship between frequency and path length, with the slope
across all users equalling -0.24. Thus,
frequency = -0.24 (path length)
This equation was derived from all sites except http://www.cnam.fr, and Georgia
Tech servers (http://www.cc.gatech.edu and http://www.gatech.edu) due to abnormal access
patterns.
Discussion
Given
the above relationship between frequency and depth, one can begin to characterize
navigation strategies based on users' average slope. Using Cove and Walsh's
characterizations, the following classifications can be made:
- "Serendipitous
Browser" (slope < -.24) These users avoid the repetition of long invocation
sequences. This shallow browser may be reflective of a WWW repository structuring in that
the databases visited by these users may be weakly connected.
- "General
Purpose Browser" (slope = -.24) Here users perform as expected. Probabilistically,
they have roughly a one in four chance of repeating a more complex navigation sequence.
This is the average inertia for all users sampled.
- "Searcher"
(slope > -.24) A user preforms the same short navigation sequences relatively
infrequently, but does perform long navigational sequences often.
Futhermore,
the slope can be used to classify sets of documents according to their usage patterns.
Table 2 displays the classification of several types of site visits as by frequency and
length as supported by the data.
Within
Site Navigation
Overall,
users tended to operate in one small area within a particular site. This structure
resembles a spoke and hub structure due to the frequent use of backtracking. Backtracking
occurs when a user issues the "Back" command to exit a server via the path used
for entry. This "leave as you've entered" strategy was heavily used by all
users. In contrast, the looping back strategy occurs when users return to the original
point of entry after a path traversal by utilizing the history feature or by selecting a
"Return to Home/Entry Page" link. Both navigation strategies can be visualized
as a kind of spoke and hub structure. In the example below, the user orientated with
http://www.cc.gatech.edu/people and http://www.cc.gatech.edu/people/People.Faculty.html as
hubs.
- http://www.cc.gatech.edu/people/
- http://www.cc.gatech.edu/people/People.Faculty.html
- http://www.cc.gatech.edu/gvu/people/Faculty/Neff.Walker.html
- http://www.cc.gatech.edu/people/People.Faculty.html
- http://www.cc.gatech.edu/gvu/people/Faculty/Piyawadee.Sukaviriya.html
- http://www.cc.gatech.edu/people/People.Faculty.html
- http://www.cc.gatech.edu/gvu/people/Faculty/Michael.J.Sinclair.html
- http://www.cc.gatech.edu/people/People.Faculty.html
- http://www.cc.gatech.edu/people/
The
example above is very typical in that users rarely traverse more than two layers in the
hypertext structure before returning to an entry point. Initial evidence suggests that
this pattern occurs independent of hyperlink per page ratios.
Other
Navigation Techniques
One
supplemental navigation method often observed was use of home pages as indexes to
interesting places. For instance, a typical session begins with the "College of
Computing Home Page" followed by a traversal to a user's personal home page. Once
there, jumps to other sites, or other parts of the local database ensue. While providing
similar functionality to "Hotlist" commands, the use of personal home pages as
indexes allows for better layout control and customization and therefore is a natural, yet
crafty adaptation to an impaired interface.
What's
worth Saving?
Surprisingly,
only 2% of retrieved documents were either saved to file or printed. Futhermore,
"Window History" and "Hotlist" based document accesses accounted for
less than 3% of all accesses. The minimal use of such archival interface commands may be
indicative one or more of the following: the quality of Web documents, the temporal nature
of certain documents, the design of these archival interfaces, or reliance on other
navigation techniques like personal home pages.
This
also implies that there is minimal potential copyright infringements by this population.
If material retrieved by users was printed or saved to disk, unauthorized local copies of
information could potentially violate certain copyright restrictions, although legal
precedence remains to be set.
Directions
for Design
Since
users accessed on average 10 pages per server, this would indicate that "must
see" information must be accessible within two to three jumps of the initial home
page (two/three navigations in, two/three out, performed three/two times). However, the
placement of numerous links on one page can lead to increased search time by users to find
relevant information as well as a cluttered screen layout. As such, information dense
interface tactics that preserve screen space, such as using image maps, may be a more
successful strategy for page design.
For
rich information ecologies, the use of indexes throughout the document space supports hub
and spoke observed usage patterns. Additionally, these pages help orient users, minimizing
the "lost in hypertext" phenomenon. Since most users explored small regions at a
time, this design recommendation can increase the exploration of clusters of related
information.
Document
designers need to be cognizant of the classification of expected visitors as serendipitous
browser, general browser, or searcher. Granted, within a server collections of documents
need to be targeted toward different users. Just the same, authors aware of the three
classes of users can tailor documents to suit the intended use of the documents. When more
than one class of visitor is expected, a separate document can be created for each class,
thus providing customized, alternative views of the information. Note that this already
occurs with the stratification of users based upon graphics-based and text users as well
as forms and nonforms-compliant Web clients.
In
designing for all strategies and behaviors, there exists a tension between "volatile
hypertexts" and efficiency (between the browser and the searcher) in all of these
recommendations. However, as Sproull and Kiesler [Sproull & Kiesler, 1993] found in
their study of the uses of electronic mail, efficiency may not always be the appropriate
metric for system evaluation. User satisfaction may provide a more accurate measure of the
success of an interface.
In the
future, servers may use the user classification to offer a "usual" view of a
database. Additionally, servers could also offer a guided tour of a server based on the
paths most travelled, or more excitingly, alter page design on the fly based on accesses
by users.
Future
Analysis
Recent
studies that correlate reading time with document relevancy for USENET news articles
suggest that a similar correlation may exist with Web information spaces as well. That is,
we hypothesize that browsers spend less time on pages and within sites than searchers.
Users
who access a large number of documents in a fixed period of time will have higher
y-intercepts in their individual frequency to path length plots. These users may well be
prime candidates for macro suggestion. Futhermore, it would be interesting to run a
correlation analysis on the y-intercepts and the total number of sites visited.
Finally,
a cost function for browsing can be developed based on analysis of expected value to the
user of particular information and the expected time to retrieve that information.
