------=_NextPart_000_0014_01BDE23B.98EF2840
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
I've recently been doing some research on exactly how the Web generates
network effects, and in the process I finally read the following excellent
paper:
"Summary of Web Characterization", James E. Pitkow, In Proc. WWW7, pages
551-558.
http://decweb.ethz.ch/WWW7/1877/com1877.htm
This is a survey of existing research on Web categorization which collects
together 8 invariant characteristics of Web behavior.
Quoted directly from table 1 of the paper, these invariants are:
Invariant
Sources
Metric
Requested file popularity
[Glassman 1994] [Cunha et al 1995] [Almeida et al 1996]
Zipf Distribution
File sizes (requested and from entire Web)
[Cunha et al 1995][Bray 1996][Woodruff et al 1996] [Arlitt and
Williamson 1996]
Heavy tailed (Pareto) with average HTML size of 46 KB and
median of 2 KB, images have an average size of 14 KB
Traffic properties
[Sedayao 1994][Cunha et al 1995][Arlitt and Williamson 1996]
Small images account for the majority of the traffic and
document size is inversely related to request frequency
Self-similarity of HTTP traffic
[Crovella and Bestavros 1995] [Gribble and Brewser 1997]
Bursty, self similar traffic between the micro second and
minute time range
Periodic nature of HTTP traffic
[Bolot and Hoschka 1996][Abdulla et al 1997a] [Gribble and
Brewer 1997]
Periodic traffic patterns able to be model by time series
analysis at the hour to weekly time range
Site popularity
[Arlitt and Williamson 1996] [Abdulla et al 1997b]
Roughly 25% of the servers account for over 85% of the traffic
Life span of documents
[Worrell 1994][Gwertzman and Seltzer 1996]
Around 50 days, with HTML files being modified and deleted more
frequently than images and other media
Occurrence rate of broken links while surfing
[WCG 1997-Xerox PARC, Virginia Tech]
Between 58% of all requested files
Occurrence rate of redirects
[WCG 1997-Xerox PARC, Virginia Tech]
Between 1319% of all requested files
Number of page requests
per site
[Huberman et al 1997][Catledge and Pitkow 1995][Cunha et al
1995]
Heavy tailed (Inverse Gaussian) distribution with typical mean
of 3, standard deviation of 9, and mode of 1 page request per site
Reading time
per page
[Catledge and Pitkow 1995][Cunha et al 1995]
Heavy tailed distribution with an average 30 seconds, median of
7 seconds, and standard deviation of 100 seconds
Session time outs
[Catledge and Pitkow 1995][Cunha et al 1995]
25 minutes, with mean time of 9 minutes
- Jim
------=_NextPart_000_0014_01BDE23B.98EF2840
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD W3 HTML//EN">
I've recently = been doing some=20 research on exactly how the Web generates network effects, and in the = process I=20 finally read the following excellent paper:
"Summary of Web Characterization", James E. =
Pitkow, In=20
Proc. WWW7, pages 551-558.
http://decweb.ethz.ch/WWW7/1877/com1877.htm
Th=
is is a=20
survey of existing research on Web categorization which collects =
together 8=20
invariant characteristics of Web behavior.
Quoted directly from =
table 1=20
of the paper, these invariants are:
Invariant |
Sources |
Metric |
Requested file popularity |
[Glassman 1994] [Cunha et al 1995] [Almeida et al=20 1996] |
Zipf Distribution |
File sizes (requested and from entire = Web) |
[Cunha et al 1995][Bray 1996][Woodruff et al 1996] = [Arlitt and=20 Williamson 1996] |
Heavy tailed (Pareto) with average HTML size of 4–6 = KB and=20 median of 2 KB, images have an average size of 14 = KB |
Traffic properties |
[Sedayao 1994][Cunha et al 1995][Arlitt and Williamson=20 1996] |
Small images account for the majority of the traffic and = document=20 size is inversely related to request = frequency |
Self-similarity of HTTP traffic |
[Crovella and Bestavros 1995] [Gribble and Brewser=20 1997] |
Bursty, self similar traffic between the micro second and = minute=20 time range |
Periodic nature of HTTP traffic |
[Bolot and Hoschka 1996][Abdulla et al 1997a] [Gribble = and Brewer=20 1997] |
Periodic traffic patterns able to be model by time series = analysis at the hour to weekly time = range |
Site popularity |
[Arlitt and Williamson 1996] [Abdulla et al = 1997b] |
Roughly 25% of the servers account for over 85% of the=20 traffic |
Life span of documents |
[Worrell 1994][Gwertzman and Seltzer = 1996] |
Around 50 days, with HTML files being modified and = deleted more=20 frequently than images and other media |
Occurrence rate of broken links while = surfing |
[WCG 1997-Xerox PARC, Virginia Tech] |
Between 5–8% of all requested = files |
Occurrence rate of redirects |
[WCG 1997-Xerox PARC, Virginia Tech] |
Between 13–19% of all requested = files |
Number of page requests |
[Huberman et al 1997][Catledge and Pitkow 1995][Cunha et = al=20 1995] |
Heavy tailed (Inverse Gaussian) distribution with typical = mean of=20 3, standard deviation of 9, and mode of 1 page request per=20 site |
Reading time |
[Catledge and Pitkow 1995][Cunha et al = 1995] |
Heavy tailed distribution with an average 30 seconds, = median of 7=20 seconds, and standard deviation of 100 = seconds |
Session time outs |
[Catledge and Pitkow 1995][Cunha et al = 1995] |
25 minutes, with mean time of 9=20 minutes |