Semantic Web

Implementation Bootcamp
Journal of Biomedical Semantics encompasses all aspects of semantic resources and their use in data integration, mining, modeling, interpretation and exploitation in biomedical research.

LinkedData:

Using DOIs for linking the data

RDF

Resource Description Framework
- RDF Schema describes how to use RDF to describe RDF vocabularies. It provides mechanisms for describing groups of related resources and the relationships between these resources.
RDFizers project is directory of tools for converting various data formats (JPEG, EML, TEX, DEB, JAVA, ICAL) into RDF.

Using datatype-aware inferences with RDF by Graham Klyne.

To read:

RDF Primer
http://www.w3.org/TR/rdf-syntax-grammar/
http://www.w3.org/TR/REC-rdf-syntax/
http://www.w3.org/TR/rdf-concepts/
http://www.w3.org/TR/rdf-mt/
Embedding RDF in XHTML: RDF/A Task Force, RDFa Wiki, RDF in HTML: Approaches

http://gearon.blogspot.com/

Where not to use RDF:

Highly granular data (like absolute expression-level changes for microarrays) might not be appropriate for conversion into RDF because it explodes the size of the dataset in a circumstance where:

the dataset is generally going to be used as a whole anyway

there are completely adequate parsers for existing file-formats

the benefit of being able to reason over an RDF representation of the data is limited, or absent

Implementation Bootcamp

Mapping the data, which has natural horizontal representation (records in the table) into vertical representation (triples) makes sense only if all below is true:

Many heterogeneous objects of similar classes are needed to be stored in the database.
These classes might have some common properties, but the weight of common properties is low. That means if the objects of these classes are put into one table, the weight of table cells with NULL value should be high.
It is not known, which classes/properties will appear in the future (but we know they certainly will).

Other Triple Formats

N3 (Notation 3) – a compact and readable alternative to RDF's XML syntax.
N-Triples a line-based, plain text format for encoding an RDF graph. It was designed to be a fixed subset of N3.
Turtle (Terse RDF Triple Language) – an extension of N-Triples carefully taking the most useful and appropriate things added from N3. Turtle is intended to be compatible with, and a subset of, N3.
TriG – plain text format for serializing Named Graphs and RDF Datasets (extension of Turtle).
TriX – an experimental alternative serialization for expressing RDF triples in XML, which aims to provide a highly normalized, consistent XML representation for RDF graphs.

RDF Storage Engines / Libraries

Sesame Triple Store
Jena¹⁾ is a Java framework that provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. It supports reading and writing RDF in RDF/XML, N3 and N-Triples and provides in-memory and persistent storage implementations.
Mulgara is a scalable RDF database (triplestore) and fork of the original Kowari project written entirely in Java. Can be queried via iTQL and SPARQL query language.
JRDF²⁾ is an attempt to create a standard set of APIs and base implementations to RDF which includes a graph API (e.g. graph comparison, manipulating graph objects), IoC support, RDF datatypes, query handling (SPARQL support). It does not currently provide support for OWL.
4store is an efficient, scalable and stable RDF database³⁾.
Redland RDF Libraries is a set of free software C libraries that provide support for RDF.
OpenLink Virtuoso is Universal Server to implement Web, File, and Database server functionality alongside Native XML Storage, and Universal Data Access Middleware, as a single server solution. It includes support for key Internet, Web, and Data Access standards such as: XML, XPATH, XSLT, SOAP, WSDL, UDDI, WebDAV, SMTP, SQL, ODBC, JDBC, and OLE-DB. It has native connectors to the following frameworks: Jena, Sesame and Redland.
Aperture is an open source Java framework for extracting full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems. Data exchange based on Semantic Web standards (e.g. RDF).
AllegroGraph is a database and application framework for building Semantic Web applications. Provides RDFS reasoning, SPARQL and Sesame 2.0 HTTP interfaces.
Bigdata is high-performance RDF store supporting RDFS and OWL Lite inference. Bigdata Cluster Setup Guide contains notes about how to optimize Linux nodes to build a cluster.

RDF Mapping

RDF/XML → XSL/Fresnel lens/JSON.
Sesame Elmo POJO mapping tool module.
Java Annotations & the Semantic Web by Henry Story describes the possibilities of RDF-to-Java mapping.

Extending Relational Databases to Support Semantic Web Queries (Zhengxiang Pan) (online)
Storage and Querying of E-Commerce Data (Rakesh Agrawal) [2001] – makes a comparison of horizontal, vertical, and binary (table per attribute = predicate) presentations of XML data (online)
You can find the comparison of different approaches to map RDF to SQL in Mapping Semantic Web Data with RDBMSes.

Hadoop MapReduce

RDFS/OWL reasoning using the MapReduce framework (Jacopo Urbani) [2009] (online, short article) – The introduction describes very well the basic principles of Semantic Web, the relation between RDF, RDFS and OWL, as well as different OWL classes (Full, DL, Lite and Horst) and the reasoning problems for them. Gives very good background to Hadoop programming model.
Parallel Inferencing for OWL Knowledge Bases (Ramakrishna Soma) [2008] – provides the algorithms for data and rule partitioning approaches.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce [2009] by Mohammad Farhan Husain

Benchmarks

RDF Store Benchmarking provides some testing material and Large Triple Stores lists those of them with good scalability.
Berlin SPARQL Benchmark Results - 09/17/2008
RDF Triple Stores Evaluations (online)
An Evaluation of Triple-Store Technologies for Large Data Stores (Kurt Rohloff) [2007] (online)
An Evaluation of Knowledge Base Systems for Large OWL Datasets (Yuanbo Guo) [2004] (online)
Scalability Report on Triple Store Applications
Triple store load testing
Triple store scalability
Large triple stores
RDF store benchmarking
(outdated) Scalability and Storage: Survey of Free Software / Open Source RDF storage systems
(outdated) Survey of RDF/Triple Data Stores

SPARQL

To read:

Available endpoints

Biogateway – an integrated system offering an interface (via SPARQL) to the entire set of the OBO foundry candidate ontologies, the whole set of GOA files, SwissProt, the NCBI taxonomy as well as in-house ontologies.
Cell Cycle Ontology (CCO) extends existing ontologies for cell cycle knowledge. CCO integrates and manages knowledge about the cell cycle components and regulatory aspects in OBO, OWL, RDF and other commonly used ontology representations. This knowledge is assembled from a diverse set of already existing resources (GO, UniProt, IntAct, GOA, NCBI taxonomy, and so forth): the combination of the knowledge gives an overall picture of the cell division process.
Linked Life Data – search and explore over 5 billion triples from various sources including UniProt, PubMed, EntrezGene and more.

Federation:

OWL

OWL is based on the DL formalism. It provides a set of rich data modeling constructs like classes, class hierarchies, property hierarchies etc. Such features are used to define data schemata or ontologies for a domain, which describe entities in the domain, their properties and relationships, and constraints between them.

OWLIM is a scalable semantic repository which has full RDFS and limited OWL Lite support. It is available as SAIL⁴⁾ for Sesame.

Pellet is an open source reasoner for OWL 2 DL in Java which provides standard and cutting-edge reasoning services for OWL ontologies. Free for non-commercial use.
RACER stands for Renamed ABox and Concept Expression Reasoner. RacerPro can process OWL Lite as well as OWL DL documents (knowledge bases) with some restrictions. Implementation of the SWRL is provided. Commercial.
OntoBroker is scalable Semantic Web middleware that supports OWL, RDF, RDFS, SPARQL and F-logic. It provides a Java API for programmatic management of OWL DL and SWRL ontologies, an inference engine for answering, and conjunctive queries using SPARQL. Commercial.
Oracle RDF management platform. Features of Oracle Spatial 11g Option for Oracle Database 11g Enterprise Edition (requires Partitioning and Advanced Compression options):
- An RDF Data Model with inferencing (RDFS, OWL DL and user-defined rules)
- Performs SQL-based access to triples and inferred data, combines SQL query of relational data with RDF graphs and ontologies
- SPARQL-like queries⁵⁾
- Jena plug-in for Oracle can be used which includes a full SPARQL API
- SKOS inference support
- See also: A Scalable RDBMS-Based Inference Engine for RDFS/OWL, Oracle Database 11g Semantics Technical Talk, Oracle Semantic Technologies Inference Best Practices with RDFS/OWL

Converting Natural Language to RDF:

AquaLog is a portable question-answering system which takes queries expressed in natural language and an ontology as input and returns answers drawn from one or more knowledge bases, which instantiate the input ontology with domain-specific information.
ThoughtTreasure is commonsense knowledge base and architecture for natural language processing.
ACE View is an ontology and rule editor that uses Attempto Controlled English (ACE) in order to create, view and edit OWL 2 ontologies and SWRL rulesets (project page, Protege page).
A SKOS analyzer module for Apache Lucene and Solr

Online tools:

WonderWeb OWL Ontology Validator

To read:

SKOS

SKOS on W3C ⁶⁾
- SKOS Primer provides introductory examples and guidance in the use of the SKOS vocabulary.
- SKOS Core Guide
JAVA SKOS API
SKOS tools
- Protege
SKOS Core - Simple Knowledge Organisation for the Web (Alistair Miles) [2005] (online)
Concept Web – a dynamic, interactive fabric of concepts and their relationships. The Concept Web is constructed from research literature, Internet databases and other web sites together with off-line resources. The aim of creating the Concept Web is to remove both redundancy and ambiguity from available knowledge in order to help deal with information overload, to semantically “connect” concepts, and so to maximize the potential for knowledge discovery.

To read:

http://www.w3.org/TR/skos-reference/

Semantic desktop

Gnome 3.0: Get rid of the file hierarchy

Vocabulary

reification/reincation – is a form of RDF in which any RDF statement itself can be the subject or object of a triple. This means graphs can be nested as well as chained. On the Web this allows us, for example, to express doubt or support for statements created by other people. A description of RDF statement using RDF reification vocabulary is called a reification of the statement. The examples are given here and here. SeRQL example for Sesame is here.

query statements inferencer – the ability of query processor to intercept and preprocess new statements as needed to enable data semantic support (e.g. RDF), which in Sesame is implemented for SAIL⁷⁾.

ontology reasoner – basically checks that ontology makes sense (consistency checking, concept satisfiability). A reasoner creates an entailment of the RDF graph.

The terms 'class' and 'subclass' also appear in the context of XML Schemas, and more generally in object-oriented programming. There is an analogy between the use of the terms in those contexts and in this one, but it is a loose analogy: the use of types in the XSchema and O-O contexts is broadly to constrain behaviour and help identify errors, whereas the corresponding assertions in the context of RDF allow a reasoner to deduce a larger volume of implicit information. In particular, RDF schemata do not function as constraints, and mistakes made when defining concepts in an ontology, or when asserting information about resources, do not manifest themselves as 'schema violations', but instead more indirectly, when a reasoner finds it is able to deduce contradictory information, for example being able to prove that some resource urn:example#X is simultaneously a Person and not a Person⁸⁾.

Reasoning can be performed either when the data is loaded into the knowledge base or when a query is issued. The former class of knowledge bases, which perform reasoning when data is loaded are called materialized knowledge bases. Materialized knowledge-bases trade-off space and increased loading time for shorter query times. This approach is suited for applications domains where the frequency of data being added is much smaller than that of queries being presented. Examples of such applications are data warehouses and (for most part) web-search. Moreover, since the worst case for OWL reasoning is exponential in time and memory, this approach is often considered to be a good way to store and query OWL knowledge-bases. Most reasoning engines for OWL are implemented using either tableau algorithms or rule based/logic programming based engines. The OWL reasoners that are implemented using rule based engines, have been suggested as a practical alternative to the more correct and complete tableau algorithms. In rule based reasoners, the OWL ontology definitions are first compiled into a set of rules which are then applied on the presented data-set to create the new inferred triples. The main advantages of this class of reasoners are that they are well studied and many robust implementations exist. The disadvantages are that only a subset of the OWL specification can be implemented using them. Many popular open source (Jena) and commercial OWL toolkits (OWLIM, Oracle), are implemented using rule based reasoners.⁹⁾

entailment – the process of transforming the RDF graph by following DL rules through the unification/resolution process to its transitive closure. The process of deriving new information is sometimes called reasoning.

semantic web, RDF, SPARQL, linked data

¹⁾ See also Jena (framework)

²⁾ See also JRDF (framework)

³⁾ See release of this triple store under GNU GPL

⁴⁾ , ⁷⁾ SAIL stands for Storage And Inference Layer

⁵⁾ SPARQL-like capability is not full SPARQL because the standard wasn't finalized at the time of Oracle Database 11g release. SPARQL support in the database is planned for the next major release.

⁶⁾ See also Simple Knowledge Organization System

⁸⁾ This text was taken from here

⁹⁾ Quoted from Parallel Inferencing for OWL Knowledge Bases