Dynamically Linked Web Data – The Next Internet Revolution

Reconfiguring the World Wide Web into a giant relational database that can learn.

Science & Technology

It’s been almost 20 years since Tim Berners-Lee first created the hypertext markup language (HTML) and the hypertext transfer protocol (HTTP) between clients and servers that have become the mainstay structural model of the World Wide Web (WWW) we all use today. Web data – text, graphics, video, audio – are marked via simple hypertext links, and not much more. As such, the power of the web is limited structurally to a static system, where data types are often blind, and more information about them is not exploited in a dynamic link variable. Data context, or a semantic interpretation of the data, is also not exploited. In fact, dynamically linked variables often found in object-oriented programming languages remains elusive to what the web could revolve into – a giant relational database, where the power of data relationships can be leveraged to provide web users with a significant advantage in finding and using just the data they’ve been looking for, and not a mountain of data that is irrelevant or that they don’t need.

The person leading this new charge for a massive change in the web as we know it? Tim Berners-Lee. And yes, there are others, notably those associated with the World Wide Web Consortium (W3C) that Berners-Lee started. The new web paradigm is often referred to as the “Semantic Web,” as it denotes defining information and services on the web semantically, such that web tools can more intelligently interpret or “comprehend” web user requests and so that machines can process web data more efficiently [1]. I prefer to include the term “dynamically linked web data” or DLWD, in conjunction with the Semantic Web, as both together represent a direct moniker of proposed transformations (and beyond).

What are the features of DLWD that would make the Semantic Web (SW) a powerful experience for web users? How would they practically be implemented? We explore those nontrivial issues next.

Dynamically Linked Web Data (DLWD) and Implications

The power of DLWD is best understood through an example of how a web search for information could yield vastly more meaningful and targeted data or query returns.

In a typical web search based on keywords, we get back all HTML links with text or data files that have some or all of the keywords we specified. The mountain of query returns is many times not productive and can rank in the tens of thousands. However, if the web data (text, graphics, video, audio) we were searching each had an associated link variable, and that variable was populated with information that is also searchable, we’d have a better chance of getting back the data we need. Furthermore, the link variables themselves can be dynamically linked to each other, much like pointer variables in object-oriented programming, rendering a correlation factor among search data, and thereby potentially vastly increasing the value of search query returns.

As an example, let’s say you’re looking for a collection of surveys and reviews for a prescription drug on the web, but you want the query results to be targeted to include only surveys that women of a certain age range have responded to. If you enter keywords “<drug name> women age 40-45” into a typical search engine you might get back 1000 or more returns. On the first page alone you might see links to information that have nothing to do with the drug you listed – the link and associated text just happened to contain the keywords “women age 40-45.” Many of the returns may not be an accurate translation of what you meant by “40-45.” Some at the top of the list may include 40-45 to mean several completely different things, like “40-45”% of something, or pages “40-45” of some periodical. The primary issue is that the context or semantics of our query is not used in a meaningful way because the data links we search have no coded context associated with them and the reader of those data links (the search engine) wouldn’t be able to translate it if they did exist. Enter DLWD and the Semantic Web. In this realm, websites are programmed to tag information entered by users with links that are variables, that in turn are populated with useful, searchable information and may even point to other links (DLWD). How this is done is not trivial – many times the survey data we seek may be from users who wish to remain anonymous and just provide casual feedback on their experience on any number of informal forums, like a chat room. The information entered must be encoded to be machine-processable, into a DLWD variable. The feedback might then be combined with other feedback from an entirely different website in an aggregated form by connecting data links having meaningful relationships (fields of the DLWD variables). A search tool that is designed to read and interpret DLWD would be able to limit the query returns to those with relationships most closely matching what we specify in keyword and context. In our example, if there were age and gender variables associated with the searchable data that also contained the drug keyword then we would be sent back all the DLWD links that contain the gender field “women” and any age field with numbers in the range 40-45. We might still get back spurious results if the search results contained only the numbers 40 and 45, or other fields with a number range “40-45,” so context is still an issue. To solve that problem, there must be a way for the search tool to interpret what we meant by “age 40-45,” that we want ages 40,41,42,43,44, and 45. A human knows what the number range means but a machine might not unless it was told to translate 40-45 to mean 40,41,42,43,44,45. One way to ensure this happens is if there exists a pointer to a translation document that defines 40-45 to mean 40,41,42,43,44,45. This pointer could be part of the DLWD variable. In the lingo of the Semantic Web [1], this type of pointer document is referred to as an “ontology.” Ontologies can interpret the meaning of a particular data set, and more generally they can provide an extensive taxonomy (definitions of all kinds of data objects and the relationships between them, even across different platforms) and inference rules (an if-then-else relationship between variables). In our example an inference rule that might apply is “if age is associated with a number range 40-45 then return all related links with the age numbers 40,41,42,43,44,45.” The ontology may also interpret the gender field values “women” and “female” as equivalent, so that we get all of those DLWD links in our search query results.

The Semantic Web Implementation

In practice, the implementation of DLWD and the Semantic Web are quite nontrivial. After years of exponential growth of web data that is highly disorganized from a lack of inherent structure or logic, we face an uphill climb to reorganize the web into a system with structure and logic, and yes, maybe even the learning and comprehension ability found in artificial intelligence (AI) systems.

As a first approximation, Berners-Lee and the W3C have come up with a set of standards and specifications [2] for how data might be encoded, ontologies built and tools constructed.

For data encoding, a resource description framework (RDF) structure is proposed, whereby uniform resource identifiers (URIs), such as typical HTTP uniform resource locator (URL) addresses we all use, contain syntax statements that have or point to content descriptions and also point to other URIs. In a data model form, the syntax is “subject-predicate-object,” where the subject is a resource (e.g. a URI), and the predicate represents features of the resource and also relates the subject to the object. This is referred to as a “triple.” For our example, an RDF encoding might be:

subject URL:http://www.<drug name>.com

predicate URL:http://www.<drug reviews>.com/drugReviews

object URL:http://www.<health blog>.com/women/40-45/#1234

to represent “<drug name> reviewed by women of ages 40-45.” In graphical form, the URIs are nodes that have properties, and are linked to other nodes that have related properties and values. As another example, a website on the sun could contain temperature data that includes links to other websites with solar temperature data. Formal RDF would codify this into a knowledge representation so that it is processable and meaningful. It has a possibility of going beyond traditional relational database models, particularly if ontologies are included and, as I point out, data links are dynamic.

For ontology construction, a web ontology language (OWL) has been developed, which is actually a collection of semantics-based languages, many of which use RDF-style statement structures in the form of eXtensible Markup Language (XML) syntax. XML is a prolific open-standard web development framework. RSS feeds are written in XML, which allows for content to be defined separately from formatting, making it easy to parse and streamlined for use across different platforms and applications. Like the generic ontology described above, a typical OWL ontology includes definitions of data objects and their relationships, plus inference rules. In OWL, data objects are organized into individuals and classes, with general and specific properties, all to exploit the semantic nature of the information. As an example, “women” and “female” are individuals of the class “gender.” Next, there are datatype properties and object properties. The “age” class may have individual members that can be defined by whole numbers between 1 and 150. Object properties relate two classes: the property “demographics” relates instances of class “gender” to instances of class “age.” The property “ageRange” relates “40-45” to “40,41,42,43,44,45.” From these examples, one begins to see the power of an ontology when it is included in the realm of a search query. The challenge with ontologies is maximizing their value so that they are applicable to a wide range of uses, are re-usable, and are merge-capable with other ontologies to further increase value. Ontology mapping is key to the concept of association in intelligent systems. Think of a learning system – we start with one way to learn (one ontology) and then merge that with another way to learn (another ontology) and we end up with a more powerful learning capability (the collection of ontologies). This may be simplified given that learning systems are more complex than a typical ontology, but what is conveyed here is that ontologies are necessary components of a Semantic Web in which learning capability might exist.

Tools have been and continue to be developed to take advantage of the W3C Semantic Web paradigm. One of the more useful tools is a query tool that is optimized for RDF and other semantic data encodings – SPARQL Protocol and RDF Query Language (SPARQL). SPARQL has been compared to relational database query languages like SQL, but has a potential to be much more powerful, with the ability to query data web-wide to find high value returns. For example, the following might be a SPARQL query for the data we sought earlier from the search query keywords “<drug name> women age 40-45:”

PREFIX abc: <http://example.com/exampleOntology#>
SELECT <drug name> women 40-45
 abc:<drug name> ?drug
     abc:drugReviews .
 abc:women ?gender ;
 abc:female ?gender ;
     abc:demographics .
 abc:40-45 ?age ;
     abc:ageRange abc:40,41,42,43,44,45 .

The expressions “?drug, ?gender, ?age” are class variables and the expressions “drugReviews, demographics, ageRange” are properties. This query specifies exactly what we want and how each of the keyword terms is related semantically. Of course the problem is that the web data we search may not be marked with RDF syntax with inter-related links. Unless that standard is adopted and followed, SPARQL queries would be less than useful. The point is that by setting up the standards for RDF, OWL, SPARQL, etc. web developers and users are encouraged to follow the standards to enable the Semantic Web and its powerful properties.

So then – how are web users and developers encouraged to adopt and follow these standards? Providing RDF information for data links and linking open data sets (i.e., including RDF statements that link to other URIs so that they can discover related properties) is an ongoing mission for the W3C Semantic Web Education and Outreach (SWEO) Community Project. Linking open data (LOD) datasets on the web have grown from 500 Million RDF syntax triples and 120,000 RDF links between data sources in May 2007 to more than 2 Billion RDF triples and 3 Million RDF links in April 2008 [3]. Many organizations, mainly academic, are deploying link data browsers that feature SPARQL and other SW tools. Popular RDF browsers in current use are Disco and Tabulator. This era in time for the SW is not unlike the early era of web development tools, including those that led to Mosaic, the first widely-used web browser developed by Mark Andreessen and Eric Bina at the NCSA/University of Illinois Urbana-Champaign.

Another important project is “DBpedia” which is essentially a tool to extract structured information from Wikipedia and to make that information available in RDF format for further SW use. As of November 2008, the DBpedia dataset consists of around 274 million RDF triples from extracted Wikipedia information. These dataset triples are further linked with other Open Data datasets – a graphical view of current interlinks can be found here [4]. Links include the CIA World Factbook, FOAF and Project Gutenberg. SPARQL queries of the DBpedia dataset can be entered from here.

Future Concepts Based on DLWD

With all the organic development going on to implement the SW it is not too early to ask where it could evolve, or revolve (as in revolution), to. RDF is a convenient way to express web data in a semantic or knowledge representation using URIs that are interlinked. But to really grow toward a system that has AI properties, or a seamless intelligent learning system for the casual user, the SW must become more dynamic. One built-in way this would happen is through ontology mapping, where ontologies are strengthened, re-used and linked with other such ontologies to grow a knowledge base, or as I like to put it, a “collective memory.” But to really become powerful, links should be dynamic themselves, or changeable, based on any number of forces: learning models, revised or improved information, new information, environmental effects, etc. Hence the term “DLWD.” How one goes from the SW and RDF in its semi-static form to SW and DLWD is, I think, a grand challenge. It will involve thinking about how to make links truly dynamic but not lead to information loss (a real downside threat). The motivation for this is to more closely match a system like that of the human brain, which is a neural network of synaptic circuits that represent a collective memory that can learn and think (that’s putting it simply!). Synapses have strengths that are dynamic and the brain is a highly interconnected system (ultra high integration density) of these local memory circuits. A single neuron cell can contain several thousand synapses. Though there are cells, layers and regions, the brain has built-in redundancy, another feature to consider for the SW future, and is one prevention of information loss. Cloud computing incorporates dynamically scalable architecture concepts (scale-out databases, autonomic computing, reliability, etc.) that will undoubtedly be important for the growth of the SW and DLWD. Association is an important global component to human learning, and a feature in SW ontologies.

The immediate focus of the SW is to (or ought to be to) provide seamless tools for the casual user (read: the consumer), so that he/she may be able to easily extract intelligent, relevant information and even add to the learning cycle. But for those of us who dream, we’d like to eventually see a web that can pass a Turing Test, and may even be a companion on lonely days.

References and Endnotes:

[1] “The Semantic Web,” T. Berners-Lee, J. Hendler and O. Lassila, Scientific American, May 17, 2001.

[2] See W3C Semantic Web Activity and working group links contained therein. This site is updated regularly with standards, specifications, publications, presentations, case studies, and an activity weblog.

[3] “Linked Data: Principles and State of the Art,” C. Bizer, T. Heath, T. Berners-Lee, April 2008.

[4] “Interlinking DBpedia with other Data Sets,” DBpedia.org. See also [3]. 

All written content is copyright owned by individual authors and/or Eidolonspeak.com. All rights reserved.