Tutorial objectives and description

Topic

Data science is in the hot spot today because data analyses play a fundamental role in scientific studies and business decisions. However, it was estimated that about 80% of the effort in a data science project has to be dedicated to data preparation. A data preparation task that is common in different industry-driven and science-driven application domains is data enrichment. Data enrichment is the task of enriching one main dataset, which describes a phenomenon of interest (usually by means of an initial set of variables), with information from third-party data sources, which contain additional variables to feed an analytical model. By using the enriched data, a data scientist can analyze the phenomenon under observation using a larger number of dimensions or train a predictive model using a richer set of features.

On the one hand, data enrichment for data science provides new high-impact application scenarios for exploiting the vast amount of information available in the Linked Open Data (LOD) cloud, either as direct source of additional information, or as a bridge for fetching data from information sources outside the LOD. On the other hand, data enrichment provides several challenges for semantic data integration methodologies, revamping known problems under a different perspective, e.g., taxonomy matching, and prompting genuinely new problems, e.g., interactive data reconciliation at scale.

Objectives

This tutorial has three main learning objectives:

To provide an in-depth understanding of the role that semantics play in data enrichment for data science.
To review advantages and limitations of tools for semantic enrichment available today and of research work relevant to this specific task.
To provide a practical dive into the creation of semantic data enrichment transformations with Grafterizer and ASIA, two integrated tools that support the interactive specification of these transformations and their scalable execution on large datasets, as well as into the usage of the enriched data to train predictive models with the QMiner tool.

In relation to 1., we will use examples and use cases from data science projects developed in two industry-driven H2020 innovation projects, EW-Shopp and euBusinessGraph, which target the development of data-driven innovative services in domains such as digital marketing, eCommerce, and business reporting. In relation to 2., we will discuss the limitations of the approaches based on ad-hoc coding and other semantic agnostic data transformation technologies to deliver data-scientist-friendly solutions. In addition we will discuss research work and prototypes developed in the semantic web, which addressed some of the challenges that a data-scientist-friendly solution should solve, and their current limitations (e.g., scalability, support for interaction, etc.). In relation to 3., we will provide an hands-on session to enrich (anonimized) data from a digital marketing agency, which wants to estimate the impact of weather on the impressions of the ads it manages in its campaigns. We will show how to enrich the source datasets with spatial coordinates extracted from Geonames, our "bridge", to eventually fetch data from a weather API. Finally, we will show how to use the enriched data to train a predictive model to estimate the impact of weather on the performance campaign.

Semantic Data Enrichment for Data Scientists

Topic

Objectives

Program

Data enrichment for data science 45 min

State-of-the-art: semantic data enrichment for data science 45 min

A toolkit for user-driven data enrichment at scale endowed with robust analytical modeling 30 min

Hands-on session: digital marketing data enrichment with weather data and analytical modelling 60 min

Intended audience

Speakers

Matteo Palmonari PhD

Dumitru Roman PhD

Vincenzo Cutrona

Nikolay Nikolov

Aljaž Košmerlj PhD

Media

Resources (Slides & Hands-On Session)