Big Data and the Semantic Web

Friday, 29 October 2010, 1:00 pm

Data Storage

Data Centers today require high performance, highly scalable designs. The options available to re-architect networks have grown dramatically in the last 3 years. Spanning Tree Protocol (STP) is a link layer network protocol that allows only a single link to be active between any two nodes. This helps ensure a loop-free topology but limits the total bandwidth of the network. Switch architectures were designed with limited bandwidth to support these over-subscribed configurations. Spanning Tree has been enhanced in ways over the years - see details in Wikipedia. Cisco has been a leader in developing and introducing new protocols, along with the software and hardware required to implement them.

Internal IT infrastructure must increase their own efficiencies in the face of growing competition from cloud offerings. CIOs should pilot new network technologies to determine the impact of new architectures on their stack and on change control and management practices. Look for solutions that support interoperability and commitment to standards support.

Linked Data

If the world-wide web is a global repository for documents, then the semantic web is a repository for data, the difference being that documents' end users are humans, while data is generally thought of as needing to be processed and manipulated by machine first. For information to be really useful, we need to understand the relationships between data, and this basically is what is meant by Linked Data. More specifically, Linked Data is hypermedia-based structured data.

Tim B-L has come up with the following principles for Linked Data:

1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those names.
3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
4. Include links to other URIs. so that they can discover more things.

In short RDF provides a structured way of making assertions about resources, SPARQL is a language for querying RDF. RDF comprises triples, specifically subject-predicate-object expressions (e.g. "a zebra", "is", "stripy"). In RDF, the subject and the predicate are named by URIs, while the object may be a URI or a literal string. A predicate and object together are known as a property of the subject. A collection of RDF statements intrinsically represents a labeled, directed multi-graph. the idea is that every element of data is accessible independently via a URI, and

The Notation 3 form of RDF (N3) uses terse whitespace-separated lists to write down triples, with triples being terminated by a full-stop. URIs are written inside angle brackets, strings within double quotes. You can specify multiple properties for a single object in one N3 statement by separating each property with a semi-colon. Similarly you can list multiple objects for a single subect/predicate pair by separating the objects with a comma.

The W3C's N3 primer provides a very accessible intro.

Data Modelling

Data Modelling is “a method used to define and analyze data requirements needed to support the business processes of an organization“. The problem is that the real world is messy, and describing it in a way that can be manipulated by computers is always problematic.

Basically data modelling is difficult. It is probably true of any sector, but anyone working in libraries who has looked at how we represent bibliographic and related data, and library processes, in our systems will know it gets complicated extremely quickly. With library data you can easily get bogged down in philosophical questions (what is a book?, how do you represent an ‘idea’?).

This is not a problem unique to Linked Data – modelling is hard however you approach it, but my suspicion is that using a Linked Data approach brings these questions to the fore. I’m not entirely sure about this, but my guess is that if you store your data in a relational database, the model is much more in the software that you build on top of this than in the database structure. With Linked Data I think there is a tendency to try to build better models in the inherent data structure (because you can?), leaving less of the modelling decisions to the software implementation.

Big Data and the Semantic Web

Data Storage

Linked Data

Data Modelling

Please enter your comment in the box below. Comments will be moderated before going live. Thanks for your feedback!

/xkcd/ Main Span