Gibson et al. (1998). Inferring web communities from link topology

Gibson, D., Kleinberg, J., & Raghavan, P. (1998). Inferring web communities from link topology. … Of the Ninth ACM Conference on ….

Notes:

p.2: Rather, our experimentation with hits shows that such communities of hubs and authorities are a recurring consequence of the way in which creators of pages on the www link to one another in the context of topics of widespread interest. — Highlighted Nov 29, 2015

p.2: hits hypermedia sources for broad-topic information discovery. — Highlighted Nov 29, 2015

p.2: The technique underlying [17]: rst, that the implicit annotation provided by human creators of hyperlinks contains su cient information to infer a notion of \authority”; and second, building on this, that su ciently broad topics contain embedded communities of hyperlinked pages. We view such communities as containing two distinct, but interrelated, types of pages: authorities (highly-referenced pages) on the topic, as well as numerous pages that \point” to many of the authorities, and thus serve to \pull” them together. We refer to pages of the latter type as \hubs,” since they serve as strong central points from which authority is \conferred” on relevant pages. Hubs and authorities exhibit what could be called a mutually reinforcing relationship: a good hub points to many good authorities; a good authority is pointed to by many good hubs. — Highlighted Nov 29, 2015

p.3: Note that the method is extremely simple and mathematically clean: one can analyze its convergence properties in a rigorous fashion, and the only tunable parameter is the procedure for xing the root set. We feel that this makes the technique an appealing framework in which to search for inherent structure in Web communities. The fact that the method is designed to run on an arbitrary link structure, without ne-tuning or the incorporation of expert knowledge about the www , suggests that the structural observations that emerge are largely intrinsic to the Web itself, rather than an artifact of an \over-trained” algorithm. — Highlighted Nov 29, 2015

p.3: Multiple communities can also form in the base set because a query term has several meanings in di erent contexts; for example, the topic “geometry” produces communities on computational geometry, di erential geometry, and seismic geometry. — Highlighted Nov 29, 2015

p.3: The use of eigenvectors for the purposes of partitioning a graph was introduced by Donath and Ho man in [10] and has been studied extensively since — Highlighted Nov 29, 2015

p.4: Link structures have been studied in hypertext research that predates the www ; w ; in particular, Botafogo et al. [4] introduce graph-theoretic measures based on link density and node-to-node distances for clustering and searching in hypermedia — Highlighted Nov 29, 2015

p.4: The eld of bibliometrics studies the patterns of citation | an implicit type of \linkage” | among scienti c papers. See [27] for a review. A number of their measures have meaning in the context of hypermedia; some of these connections are studied in [18]. One can also interpret the behavior of hits as relying on a type of community memory, as studied by Marshall et al. [20]. — Highlighted Nov 29, 2015

p.5: Note that a fairly counter-intuitive point has been emerging from the development above. Speci cally: The greatest degree of orderly structure, as extracted by hits , is found in communities for which the number of relevant pages, and the density of hyperlinking, is the largest. We have seen this phenomenon with “cryptography” and “English literature”. This is in contrast to the standard point of view that the www is becoming increasingly \chaotic” and di cult to model; it suggests that the technique underlying hits is actually becoming more e ective as the size of the Web continues to increase. — Highlighted Nov 29, 2015

p.5: Robustness. For broad topics, robust communities despite starting from a very small sample of relevant pages in the initial root set. We have explored this by several direct methods, providing with a variety of di erent root sets relevant to the same topic — Highlighted Nov 29, 2015

p.5: Topic Generalization. There is no sharp boundary between those topics that are \broad” and those that aren’t; but one of the primary themes that emerges from our experience is that hits tends to \generalize” topics that are not su ciently broad. By this we mean that the principal community of hubs and authorities will be relevant to a topic which includes, but is larger than, the initial topic provided to hit — Highlighted Nov 29, 2015

p.6: Convergent Generalization and a Tree of Topics. The implicit topic hierarchy just discussed can be explored more fully through further experimentation with hits . It is natural to picture the process of abstraction and generalization as occurring on a tree of topics: the most general topics (e.g. Science, Art, Recreation) are closest to the root, and their descendants represent sub-topics. Such an idea has been realized on the www , by human ontologists, through the construction of searchable hierarchies such as yahoo . o . A searchable hierarchy includes hand-annotation of the various topics — Highlighted Nov 29, 2015

p.6: We have found that, given a broad topic on which the technique does not generalize, one often discovers very similar communities by applying hits to a range of more speci c sub-topics — Highlighted Nov 29, 2015

p.7: www The process of page creation on the has many simultaneous participants, and some of these can bring many resources to bear on engineering the link structure in a way that favors them. Thus, for topics with both commercial and individual involvement, the authorities in the principal community will overwhelmingly tend to be highly-commercialized pages. — Highlighted Nov 29, 2015

p.8: There can often be substantial short-term factors that temporarily in uence the set of principal authorities for a topic; in the above example, the Harvard Conference on the Internet and Society was still a prominent feature of the topic \Harvard” in January 1997, but not in August 1997. — Highlighted Nov 29, 2015

p.8: Short-term in uences die out as pages and links are removed over time; though they can be arti cially kept \current” by their inclusion in the indices of search engines. — Highlighted Nov 29, 2015

p.8: superimpose the results of hits on the same topic, spaced out over several-month periods — Highlighted Nov 29, 2015

p.8: This type of dissection can also be an e ective way to separate out the \Web-centric” in uences on a topic. — Highlighted Nov 29, 2015