About the S-PIC4CHU Project
The S-PIC4CHU project is promoted by the Free Univ. of Bozen-Bolzano (UNIBZ, Project Leader), Univ. Roma Tre (ROMA3), Politecnico di Milano (POLIMI), Univ. di Bologna (UNIBO), and Univ. della Calabria (UNICAL). These Research Units (RUs) are characterized by top expertise in complementary research areas and will work in tight collaboration.
Project Supervisors
Diego Calvanese (UNIBZ)
Sergio Greco, (UNICAL)
Riccardo Torlone (ROMA3)
Ilaria Bartolini (UNIBO)
Davide Martinenghi (POLIMI)
All the RUs boast a long history of research collaborations and joint project participation, e.g., in the fields of personalization and ranking for structured and multimedia data (UNIBO, POLIMI, ROMA3), emantics-based data integration (UNIBZ, POLIMI, ROMA3), ethics and data quality (ROMA3, POLIMI), raph-based data processing (UNICAL, POLIMI), logic-based query languages (POLIMI, UNICAL, UNIBZ, ROMA3), uery relaxation (POLIMI, ROMA3, UNIBO), and privacy-aware data management (ROMA3, UNICAL).
The project is organized in 4 Work Package (WP) as follows:
WP1 Models and Architectures for DPPs and STPs
The goal of WP1 is the definition of the basic models, as well as semantic and system architectures supporting the development of end-to-end Data Preparation Pipelines (DPP) in Data Science (DS). This WP is spearheaded by the ROMA3 University.
T1.1 Semantic architecture and quality measures (UNIBZ, All)
We will define the models, ontologies, and mapping formalisms used to semantically describe, through an Semantic Transformation Pipeline (STP), the datasets in a DPP. This task also aims at identifying the most appropriate quality measures for both the data and the process.
T1.2 Building blocks for data preparation (ROMA3, All)
We will define a suite of building blocks that implement the activities most used in practice for curation, integration, and validation of data, and that, suitably composed, realize the DPP. Since many of the data preparation steps can hardly be made fully automatic, we will consider the possibility of human intervention.
T1.3 System architecture design (POLIMI, All)
We will design the architecture that implements the pipelines according to T1.1 and T1.2 and provides features enabling quality checks along the two pipelines.
Deliverables
D1.1 [M6] Semantic architecture and quality measures (Report)D1.2 [M12] Building blocks for data preparation (Report)
D1.3 [M15] System architecture (Report
WP2 Improving the Quality of Data
The aim of WP2 is to investigate techniques to improve data quality by handling inconsistency and incompleteness, and performing data reduction, while ensuring fairness of data. This WP is spearheaded by the UNICAL university
T2.1 Managing data inconsistency and incompleteness
We will develop novel techniques for managing inconsistent and incomplete data in the context of OBDA. These techniques include (a) the development of new methods for missing data imputation, (b) the evelopment of inconsistency-tolerant query answering over inconsistent knowledge, and (c) the pecification and analysis of inconsistency and incompleteness management policies. Particular attention will be paid to exploiting knowledge in the form of ontologies, as well as user preferences, for better missing value replacement and to refine query answers.
T2.2 Managing biased data
We will design a framework that focuses on mining various kinds of data dependencies to find bias in datasets and propose new metrics for evaluating this bias and dealing with it. Special attention shall be paid to the final objective of the analysis since, as already mentioned, modifying a dataset may ompromise its original semantics, therefore the bias mitigation operations should only be applied when the final quality objective is compatible with a (possibly slight) alteration of the original data. For MM data, to which such methodologies are not directly applicable, we will investigate specific thodologies for managing bias. Bias can also arise when using top-k queries to select the best options in a dataset, due to possible data distributions that hide relevant objects (typically, good trade-offs that do not excel in any of the analyzed dimensions). To this end, after characterizing the relationship between top-k and skyline queries, we will design a novel query paradigm aiming to remove such bias by mpowering top-k queries with the intrinsic ability of skyline queries to offer an unbiased view of all the best options.
T2.3 Data reduction
We will investigate object and feature selection for structured and MM data. We will develop efficient methods, leveraging semantic information, to determine how the parameters of a family of ranking unctions affect desired properties of the output. We will develop a framework for semi-automatic selection of the most appropriate features to characterize MM data for our use-cases, also considering real-time analysis of massive streams.
Deliverables
D2.1 [M15] Managing data inconsistency and incompleteness (Report)D2.2 [M15] Managing biased data (Report)
D2.3 [M18] Data reduction (Report)
WP3 Semantic Enrichment, Provenance, and Explanation
The aim of WP3 is to investigate mechanisms for semantic enrichment and for explanation both of provided and of missing query answers, also aiming at better understanding bias in the data and improving its quality. This WP is spearheaded by the UNIBZ university
T3.1 Semantic enrichment
We will develop techniques for semantic enrichment of data through (semi-)automatic extraction of mappings between data at the sources or in DPP stages and the ontologies in STP stages and study how the building blocks of T1.2 can be abstracted into the semantic transformation steps of the STP.
T3.2 Management of provenance data
We will define methods and techniques for capturing, storing, and querying, within a DS process, fine-grained provenance information collected from the execution of each pipeline operator.
T3.3 Provenance and explanation in the context of OBDA
We will extend the framework of provenance semirings for relational DBs to consider mappings and ontologies, additional types of data sources, and annotations for non-functional requirements. Abstraction mechanisms will be adopted to increase understandability for different types of users. Inconsistent knowledge will also be considered, under different minimality criteria. We will also study metrics for the explanation of ranking functions and how they are affected by F-dominance (see T2.3).
Deliverables
D3.1 [M12] Semantic enrichment (Report)D3.2 [M15] Management of provenance data (Report)
D3.3 [M18] Provenance and explanation in OBDA (Report)
WP4 Coordination, Experimentation and Impact Enactment
The purpose of WP4 is to coelesce all of the prior work done into two main uses cases drawn from different domains, with the aim of showing the generality and effectiveness of the developed solutions. This WP is spearheaded by the UNIBO university
T4.1 Project coordination
The coordinator will lead this task, and will organize two plenary project meetings.
T4.2 Development, experimentation, and validation on selected use-cases
The S-PIC4CHU approach to data preparation will be validated on two use cases, drawn from different domains, with the aim of showing the generality and effectiveness of the developed solutions (see attached letters of intent).1) Health. Policlinico Universitario Agostino Gemelli (IRCCS) is the second-largest hospital in Italy and the largest hospital in Rome. IRCCS, which has an established partnership with UNIROMA3, will provide data, problems, and feedback from domain experts regarding the DSA of a wide range of different health sources, including ambulatories, hospitalization, and drugs.
2) Architecture and sustainable development: S-PIC4CHU addresses research problems related to the support of policy-makers in urban environments, which are studied at the IMM Design Lab of PoliMI http://www.immdesignlab.com/people/, whose aim is the optimization of urban systems according to the European Sustainable Development Goals (SDGs).
In particular, we will experiment with all the techniques studied in WP2 and WP3, and will develop open-source software tools for the above use-cases, compatible with the S-PIC4CHU reference architecture.
T4.3 Dissemination and impact enactment
The project results will be promoted by all RUs to academia, companies, and public administrations through publications at top international venues, seminars, and courses.
Deliverables
D4.1 [M12, M24] Yearly progress report and final reportD4.2 [M18] Prototype tools for selected techniques (Software)
D4.3 [M24] Validation results on the project use-cases (Report)