Semantic Data Enrichment for Data Scientists

Tutorial website for the ESWC2019 tutorial on semantic data enrichment for data scientists.

Topic

Data science is in the hot spot today because data analyses play a fundamental role in scientific studies and business decisions. However, it was estimated that about 80% of the effort in a data science project has to be dedicated to data preparation. A data preparation task that is common in different industry-driven and science-driven application domains is data enrichment. Data enrichment is the task of enriching one main dataset, which describes a phenomenon of interest (usually by means of an initial set of variables), with information from third-party data sources, which contain additional variables to feed an analytical model. By using the enriched data, a data scientist can analyze the phenomenon under observation using a larger number of dimensions or train a predictive model using a richer set of features.

On the one hand, data enrichment for data science provides new high-impact application scenarios for exploiting the vast amount of information available in the Linked Open Data (LOD) cloud, either as direct source of additional information, or as a bridge for fetching data from information sources outside the LOD. On the other hand, data enrichment provides several challenges for semantic data integration methodologies, revamping known problems under a different perspective, e.g., taxonomy matching, and prompting genuinely new problems, e.g., interactive data reconciliation at scale.

Objectives

This tutorial has three main learning objectives:

  1. To provide an in-depth understanding of the role that semantics play in data enrichment for data science.
  2. To review advantages and limitations of tools for semantic enrichment available today and of research work relevant to this specific task.
  3. To provide a practical dive into the creation of semantic data enrichment transformations with Grafterizer and ASIA, two integrated tools that support the interactive specification of these transformations and their scalable execution on large datasets, as well as into the usage of the enriched data to train predictive models with the QMiner tool.

In relation to 1., we will use examples and use cases from data science projects developed in two industry-driven H2020 innovation projects, EW-Shopp and euBusinessGraph, which target the development of data-driven innovative services in domains such as digital marketing, eCommerce, and business reporting. In relation to 2., we will discuss the limitations of the approaches based on ad-hoc coding and other semantic agnostic data transformation technologies to deliver data-scientist-friendly solutions. In addition we will discuss research work and prototypes developed in the semantic web, which addressed some of the challenges that a data-scientist-friendly solution should solve, and their current limitations (e.g., scalability, support for interaction, etc.). In relation to 3., we will provide an hands-on session to enrich (anonimized) data from a digital marketing agency, which wants to estimate the impact of weather on the impressions of the ads it manages in its campaigns. We will show how to enrich the source datasets with spatial coordinates extracted from Geonames, our "bridge", to eventually fetch data from a weather API. Finally, we will show how to use the enriched data to train a predictive model to estimate the impact of weather on the performance campaign.

Program

The tutorial will be a half-day tutorial, requiring approximately three hours including presentations and a hands-on section. The draft schedule proposed for the tutorial is the following:

  • Data enrichment for data science 45 min

    Topics: challenges; semantics technologies as key enablers; analytics with enriched data: use cases from industry-driven data science projects.

  • State-of-the-art: semantic data enrichment for data science 45 min

    Topics: semantic table annotation tools and techniques; doing it with OpenRefine: lessons learned and limitations; data enrichment with batch pipelines: lessons learned and limitations; user-driven data enrichment at scale: the Grafterizer/ASIA approach; QMiner and data analytics on top of enriched data.

  • A toolkit for user-driven data enrichment at scale endowed with robust analytical modeling 30 min

    Topics: the Grafterizer/ASIA data transformation tool: cleaning, annotating and enriching data; QMiner: data analytics on top of enriched data.

  • Hands-on session: digital marketing data enrichment with weather data and analytical modelling 60 min

    Environment set up; design of the data transformation workflows with Grafterizer/ASIA; batch execution of the transformations on a larger data set; training the predictive model with QMiner; deployment of the predictive model.

Intended audience

You should attend the tutorial if you:

  • are a researcher working with Semantic Web Technology (e.g., ontology matching, semantic reconciliation, table annotation) looking to discover new industry-driven, real-world application scenarios for your work,
  • have recently entered into the Semantic Web community and want learn not only about new application scenarios, but also about challenging problems, the solution of which may have significant impact both in the industry as well as in other science fields, where analyses on top of semantically enriched data may be bring added value application scenarios for your work,
  • are a practitioner/worker in industry and want to learn about concrete use cases for using Semantic Web Technology to support data-driven innovation at scale.

Speakers

Matteo Palmonari PhD

Dumitru Roman PhD

Vincenzo Cutrona

Nikolay Nikolov

Aljaž Košmerlj PhD

Media

See our tools in action!

Resources (Slides & Hands-On Session)

Get all the datasets, slides and instructions you will need!

Get the Resources