Since its inception, the World Wide Web has changed the ways scientists communicate, collaborate, and educate. There is, however, a growing realization among many researchers that a clear research agenda aimed at understanding the current, evolving, and potential Web is needed. If we want to model the Web; if we want to understand the architectural principles that have provided for its growth; and if we want to be sure that it supports the basic social values of trustworthiness, privacy, and respect for social boundaries, then we must chart out a research agenda that targets the Web as a primary focus of attention.
When we discuss an agenda for a science of the Web, we use the term “science” in two ways. Physical and biological science analyzes the natural world, and tries to find microscopic laws that, extrapolated to the macroscopic realm, would generate the behavior observed. Computer science, by contrast, though partly analytic, is principally synthetic: It is concerned with the construction of new languages and algorithms in order to produce novel desired computer behaviors. Web science is a combination of these two features. The Web is an engineered space created through formally specified languages and protocols. However, because humans are the creators of Web pages and links between them, their interactions form emergent patterns in the Web at a macroscopic scale. These human interactions are, in turn, governed by social conventions and laws. Web science, therefore, must be inherently interdisciplinary; its goal is to both understand the growth of the Web and to create approaches that allow new powerful and more beneficial patterns to occur.
Unfortunately, such a research area does not yet exist in a coherent form. Within computer science, Web-related research has largely focused on information-retrieval algorithms and on algorithms for the routing of information through the underlying Internet. Outside of computing, researchers grow ever more dependent on the Web; but they have no coherent agenda for exploring the emerging trends on the Web, nor are they fully engaged with the emerging Web research community to more specifically focus on providing for scientists' needs.
Leading Web researchers discussed the scientific and engineering problems that form the core of Web science at a workshop of the British Computer Society in London in September 2005 (1). The participants considered emerging trends on the Web and debated the specific types of research needed to exploit the opportunities as new media types, data sources, and knowledge bases become “Webized,” as Web access becomes increasingly mobile and ubiquitous, and as the need increases for privacy guarantees and control of information on the Web.
The workshop covered a wide range of technical and legal topics. For example, there has been research done on the structure and topology of the Web (2, 3) and the laws of connectivity and scaling to which it appears to conform (4–6). This work leads some to argue that the development of the Web has followed an evolutionary path, suggesting a view of the Web in ecological terms. These analyses also showed the Web to have scale-free and small-world networking structures, areas that have largely been studied by physicists and mathematicians using the tools of complex dynamical systems analysis.
The need for better mathematical modeling of the Web is clear. Take the simple problem of finding an authoritative page on a given topic. Conventional information-retrieval techniques are insufficient at the scale of the Web. However, it turns out that human topics of conversation on the Web can be analyzed by looking at a matrix of links (7, 8). The mathematics of information retrieval and structure-based search will certainly continue to be a fertile area of research as the Web itself grows. However, approaches to developing a mathematical framework for modeling the Web vary widely, and any substantive impact will, again, require a new approach. The process-oriented methodologies of the formal systems community, the symbolic modeling methodologies of the artificial intelligence and semantics researchers, and the mathematical methods used in network analyses are all relevant, but no current mathematical model can unify all of these.
One particular ongoing extension of the Web is in the direction of moving from text documents to data resources (see the figure). In the Web of human-readable documents, natural-language processing techniques can extract some meaning from the human-readable text of the pages. These approaches are based on “latent” semantics, that is, on the computer using heuristic techniques to recapitulate the intended meanings used in human communication. By contrast, in the “Semantic Web” of relational data and logical assertions, computer logic is in its element, and can do much more.
Researchers are exploring the use of new, logically based languages for question answering, hypothesis checking, and data modeling. Imagine being able to query the Web for a chemical in a specific cell biology pathway that has a certain regulatory status as a drug and is available at a certain price. The engineering challenge is to allow independently developed data systems to be connected together without requiring global agreement as to terms and concepts. The statistical methods that serve for the scaling of language resources in search tasks and the data calculi that are used in scaling database queries are largely based on incompatible assumptions, and unifying these will be a major challenge.
Despite excitement about the Semantic Web, most of the world's data are locked in large data stores and are not published as an open Web of inter-referring resources. As a result, the reuse of information has been limited. Substantial research challenges arise in changing this situation: how to effectively query an unbounded Web of linked information repositories, how to align and map between different data models, and how to visualize and navigate the huge connected graph of information that results. In addition, a policy question arises as to how to control the access to data resources being shared on the Web. This latter question has implications both with respect to underlying technologies that could provide greater protections, and to the issues of ownership in, for example, scientific data-sharing and grid computing.
The scale, topology, and power of decentralized information systems such as the Web also pose a unique set of social and public-policy challenges. Although computer and information science have generally concentrated on the representation and analysis of information, attention also needs to be given to the social and legal relationships behind this information (9). Transparency and control over these complex social and legal relationships are vital, but require a much better-developed set of models and tools that can represent these relationships. Early efforts at modeling in the area of privacy and intellectual property have begun to establish the scientific and legal challenges associated with representing and providing users with control over their own information. Our aim is to be able to design “policy aware” systems that provide reasoning over these policies, enable agents to act on a user's behalf, make compliance easier, and provide accountability where rules are broken.
Web science is about more than modeling the current Web. It is about engineering new infrastructure protocols and understanding the society that uses them, and it is about the creation of beneficial new systems. It has its own ethos: decentralization to avoid social and technical bottlenecks, openness to the reuse of information in unexpected ways, and fairness. It uses powerful scientific and mathematical techniques from many disciplines to consider at once microscopic Web properties, macroscopic Web phenomena, and the relationships between them. Web science is about making powerful new tools for humanity, and doing it with our eyes open.
HyperNotes Related Resources on the World Wide Web
World Wide Web
International World Wide Web Conference Committee Sponsor of annual conferences about the World Wide Web. Some papers from the 2006 conference and proceedings of the 2005 and 2004 conferences are made available.
History of Internet and WWW: The Roads and Crossroads of Internet History An overview by G. R. Gromov.
Web Science Workshop
Structure and Topology of the Web
“The Structure of the Web” A Perspective by J. Kleinberg and S. Lawrence in the 30 November 2001 issue of Science.
“Modeling the Internet's Large-Scale Topology” A 15 October 2002 article by S.-H. Yook, H. Jeong, and A.-L. Barabási in the Proceedings of the National Academy of Sciences.
Scale-Free and Small-World Phenomena
“Scale-Free Characteristics of Random Networks: The Topology of the World Wide Web” 2000 Physica A article by A.-L. Barabási, R. Albert, and H. Jeong, made available by A.-L. Barabási.
“Science and the Semantic Web” Policy Forum by J. Hendler in the 24 January 2003 issue of Science.
“The Semantic Web Revisited” May-June 2006 article by N. Shadbolt, T. Berners-Lee, and W. Hall, made available in the EPrints Repository, School of Electronics and Computer Science, University of Southampton, UK.
“SW@5: Current Status and Future Promise of the Semantic Web” Keynote address by J. Hendler and O. Lassila at the 2006 Semantic Technologies Conference.
“Can Grid Computing Help Us Work Together?” A News Focus article by D. Clery in the 28 July 2006 issue of Science.
“Seven Questions: Battling for Control of the Internet” An interview with Lawrence Lessing made available on the Web 8 November 2005 by Foreign Policy.
Tim Berners-Lee and Daniel J. Weitzner are at the Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA.