2018/2019 |
Techniques for Big Data Integration in Distributed Computing Environments |
Data sources that provide a huge amount of semi-structured data are available on Web as tables, annotated contents (e.g. RDF) and Linked Open Data. These sources can constitute a valuable source of information for companies, researchers and government agencies, if properly manipulated and integrated with each other or with proprietary data. One of the main problems is that typically these sources are heterogeneous and do not come with keys to perform join operations, and effortlessly linking their records. Thus, finding a way to join data sources without keys is a fundamental and critical process of data integration. Moreover, for many applications, the execution time is a critical component (e.g., in finance of national security context) and distributed computing can be employed to significantly it. In this dissertation, I present distributed data integration techniques that allow to scale to large volumes of data (i.e., Big Data), in particular: SparkER and GraphJoin. SparkER is an Entity Resolution tool that aims to exploit the distributed computing to identify records in data sources that refer to the same real-world entity—thus enabling the integration of the records. This tool introduces a novel algorithm to parallelize the indexing techniques that are currently state-of-the-art. SparkER is a working software prototype that I developed and employed to perform experiments over real data sets; the results show that the parallelization techniques that I have developed are more efficient in terms of execution time and memory usage than those in literature. GraphJoin is a novel technique that allows to find similar records by applying joining rules on one or more attributes. This technique combines similarity join techniques designed to work on a single rule, optimizing their execution with multiple joining rules, combining different similarity measures both token- and character- based (e.g., Jaccard Similarity and Edit Distance). For GraphJoin I developed a working software prototype and I employed it to experimentally demonstrate that the proposed technique is effective and outperforms the existing ones in terms of execution time. |
2017/2018 |
Analysis and development of advanced data integration solutions for data analytics tools |
This thesis shows my research and development activities performed on the MOMIS Dashboard, an interactive data analytics tool to explore and visualize the content of data sources through different types of dynamic views. The software is very versatile and supports connection to the main relational DBMSs and Big Data sources; for the data connection MOMIS Dashboard uses MOMIS, an Open Source data integration system that can integrate heterogeneous data sources. The research activity focused on the development of new tools in MOMIS that enhanced the ability to generate integrated schemas: the framework was integrated indeed with NORMS, a tool for the standardization of the schema labels, and with SparkER, a tool for entity resolution. Thanks to NORMS, MOMIS can find the semantic relationships existing between sources whose schema labels (i.e. the names of classes or attributes of a schema) contain acronyms, abbreviations and compound terms. SparkER, on the other hand, is a tool for Entity Resolution created by the DBGroup laboratory of the University of Modena and Reggio Emilia (Italy). It employs advanced Meta-Blocking techniques and thus outperforms other Entity Resolution tools based on Hadhoop MapReduce. The SparkER tool in MOMIS enables the schema matching based on the content of the data sources and not on the schema labels, thus going to determine the semantic relationships that otherwise would be difficult to identify even for domain experts. Finally, this thesis shows how MOMIS was used as a data integration engine to implement the MOMIS Dashboard tool. This tool was developed to create a data analytics tool that has been applied both in industrial contexts within the framework of the Italian industry 4.0 plan and in the medical scientific domain. |
2017/2018 |
A distributed HPC infrastructure to process very large scientific data sets |
The goal of this thesis work is to develop a distributed HPC infrastructure to support the processing of very large scientific data sets federating different compute and data resources across Europe. A set of common technical specifications has been derived to provide a high-level specification of the overall architecture and give details of key architectural elements that are essential for realizing the infrastructure, including scientific cases, emerging technologies and new processing methodologies. The work has been mainly fueled by the need to provide a scalable solution, handle new memory technologies, such as those based on non-volatile chips, provide easy access to data, improve user experience by fostering the convergence between traditional High Performance and Cloud Computing utilization models. Nowadays the main access model for large scale HPC systems is based on the scheduling of batch jobs. This approach is not connected with a requirement from the computational science community, but it reflects the predominant issue in the management of the HPC systems: the maximization of resource utilisation. Conversely, the situation differs when taking into consideration personal workstations or shared memory servers, where timesharing interactive executions are the norm. Our design, which proposes a new paradigm called “Interactive Computing” refers to the capability of a system to support massive computing workloads while permitting on-the-fly interruption by the user. The real-time interaction of a user with a program runtime is motivated by various factors, such as the need to estimate the state of a program or its future tendency, to access intermediate results, and to steer the computation by modifying input parameters or boundary conditions. Within the neuro-science community, one of the scientific cases taken in consideration in this work, most used applications (i.e. brain activity simulation, large image volume rendering and visualization, connectomics experiment) imply that the runtime can be modified interactively so that the user can gain insight on parameters, algorithmic behaviour, and optimization potentials. The commonly agreed central components of interactive computing are, on the front-end, a sophisticated user interface to interact with the program runtime and, on the back-end, a separated steerable, often CPU and memory consuming application running on an HPC system. A typical usage scenario for interactive computing regards the visualization, the processing, and the reduction of large amounts of data, especially where the processing cannot be standardized or implemented in a static workflow. The data can be generated by simulation or harvested from experiment or observation. In both cases, during the analysis the scientist performs an interactive process of successive reductions and production of data views that may include even complex processing like convolution, filtering, clustering, etc. This kind of processing could be easily parallelized to take advantage of HPC resources, but it would become clearly counterproductive to break-down a user session into separate interactive steps interspersed by batch jobs as their scheduling would delay the entire execution degrading the user experience. Besides that, in many application fields, the computational scientists are starting to use interactive frameworks and scripting languages to integrate the more traditional compute and data processing application running in batch, e.g. the use of R, Stata, Matlab/Octave or Jupyter Notebook just to name few. The work has been supported by Human Brain Project (www.humanbrainproject.eu) and Cineca (www.hpc.cineca.it), the largest supercomputing centre in Italy. |
2017/2018 |
Scalable Joins Methods and their applications in Data Integration Systems |
Every second, we produce a large amount of data, as consequence, the ability to transform these data into useful information is crucial to manage efficiently the society we are living in. Such data can be stored into different and heterogeneous systems. Therefore, the integration is an important task for viewing and processing of these data. In this context, existing Data Integration techniques have to be upgraded to support the large and complex data. In this context, this dissertation has the goal to study and to improve the performance of critical operations in Data Integration. The main topic of this work is the Join operator in the Big Data Integration. Join is a key operator in the Data Integration. There are two types of joins most used in this area. The first is equi-join is used in the Merge Join Step, it is used to merge two or more data sources. The join used for this context is a join with equality predicate. In addition, it is usually an outer join, Since the data present in a data source may not be present in others. Another issue is the number of data sources. If you have many data sources with common attributes, using only one binary join can be inefficient. In this perspective, this dissertation propose a new join algorithm, ie. SOPJ. This join algorithm is created specifically to make the Merge Join Step more efficient, parallelizable and scalable. These features allows to manage efficiently not only large data sources but also a huge number of data sources. The second type of join is similarity join, this operator is used for many purposes in Data Integration, especially in data cleansing and normalization operations such as duplication detection and entity resolution. Similarity join has been widely studied in literature. With the Map Reduce paradigm, the study to make this operation efficient and scalable has become a hot topic. In this dissertation, we studied one of the most famous similarity join algorithms, PPJoin. We implemented this algorithm through Apache Spark, and we introduced improvements to make this algorithm more efficient. The experiment data show the effectiveness of the proposed solutions. Finally, we present an alternative to the similarity join for the entity resolution operation, called metablocking, and our contribution is to implement this method through Apache Spark to make scalable and usable metablocking for large data queues. The goal of this work is to study and to improve the scalability and efficiency of operations in Data Integration Systems, like MOMIS, to able to manage the huge amount of available data. |
2015/2016 |
Loosely Schema-aware Techniques for Big Data Integration |
A huge amount of semi-structured data is available on the Web in the form of web tables, marked-up contents (e.g. RDFa, Microdata), and Linked Open Data. For enterprises, governative agencies, and researcher of large scientific project, this data can be even more valuable if integrated with the data that they already own, and that are typically subject of traditional Data Integration processes. Being able to identify records that refer to the same entity is a fundamental step to make sense of this data. Generally, to perform Entity Resolution (ER), traditional techniques require a schema alignment between data sources. Unfortunately, the semi-structured data of the Web is usually characterized by high heterogeneity, high levels of noise (missing/inconsistent data), and very large volume, making traditional schema alignment techniques no longer applicable. Therefore, techniques that deal with this kind of data typically renounce to exploit schema information, and rely on redundancy to limit the chance of missing matches. This dissertation tackles two fundamental problems related to ER in the con- text of highly heterogeneous, noisy and voluminous data: (i) how to extract schema information useful for ER, from the data sources, without performing a traditional schema-alignment; (ii) how can this information be fully exploited to reduce the complexity of ER; in particular, to support indexing techniques that aim to group similar records in blocks, and limit the comparison to only those records appearing in the same block. We address those open issues introducing: a set of novel methodologies to induce loose schema information directly from the data, without exploiting the semantic of the schemas; and BLAST (Blocking with Loosely Aware Schema Techniques), a novel unsupervised blocking approach able to exploit that information to produce high quality block collections. We experimentally demonstrate, on real world datasets, how BLAST can outperform the state of the art blocking approaches, and, in many cases, also the supervised ones. |
2015/2016
|
Revealing the underlying |
The Linked Data Principles ratified by Tim-Berners Lee promise that a large portion of Web Data will be usable as one big interlinked RDF (i.e. Resource Description Framework) database. Today, with more than one thousand of Linked Open Data (LOD) sources available on the Web, we are assisting to an emerging trend in publication and consumption of LOD datasets. However, the pervasive use of external resources together with a deficiency in the definition of the internal structure of a dataset causes that many LOD sources are extremely complex to understand. The goal of this thesis is to propose tools and techniques able to reveal the underlying structure of a generic LOD dataset for promoting the consumption of this new format of data. In particular, I propose an approach for the automatic extraction of statistical and structural information from a LOD source and the creation of a set of indexes (i.e. Statistical Indexes) that enhance the description of the dataset. By using this structural information, I defined two models able to effectively describe the structure of a generic RDF dataset: Schema Summary and Clustered Schema Summary. The Schema Summary contains all the main classes and properties used within the datasets, whether they are taken from external vocabularies or not. The Clustered Schema Summary, suitable for large LOD datasets, provides a more high-level view of the classes and the properties used by gathering together classes that are object of multiple instantiations. All these efforts allowed the development of a tool called LODeX able to provide a high-level summarization of a LOD dataset and a powerful visual query interface to support users in querying/analyzing an unknown datasets. All the techniques proposed in this thesis have been extensively evaluated and compared with the state of the art in their field: a performance evaluation of the LODeX’s module delegated to the extraction of the indexes is proposed; the technique of schema summarization has been evaluated according to ontology summarization metrics; finally, LODeX itself has been evaluated inspecting its portability and usability. In the second part of the thesis, I present a novel technique called CSA (Context Semantic Analysis) that exploits the information contained in a knowledge graph for estimating the similarity between documents. This technique has been compared with other state of the art measures by using a benchmark containing documents an measures of similarity provided by human judges. |
2012/2013 |
Heterogeneous DataWarehouse Analysis and Dimensional Integration |
The DataWarehouse (DW) is the main Business Intelligence instrument for the analysis of large banks of operational data and for extracting strategic information in support of the decision making process. It is usually focused on a specific area of an organization. Data Warehouse integration is the process of combining multidimensional information from two or more heterogeneous DWs, and to present users an unified global overview of the combined strategic information from the DWs. The problem is becoming more and more frequent as the dynamic economic context sees many companies merges/acquisitions and the formation of new business networks, like co-opetition, where managers need to analyze all the involved parties and to be able to take strategic decisions concerning all the participants. The contribution of the thesis is to analyze heterogeneous DW environments and to present a dimension integration methodology that allows users to combine, access and query data from heterogeneous multidimensional sources. The integration methodology relies on graph theory and the Combined WordSense Disambiguation technique for generating semantic mappings between multidimensional schemas. Subsequently, schema heterogeneity is analyzed and handled, and compatible dimensions are uniformed by importing dimension categories fromone dimension to another. This allows users from different sources to have the same overview of the local data, and increases local schema compatibility for drill-across queries. The dimensional attributes are populated with instance value by using a chase algorithm variant based on the RELEVANT clustering approach. Finally, several quality properties are discussed and analyzed. Dimension homogeneity/heterogeneity is presented from the integration perspective; also the thesis presents the theoretical fundamentals under which mapping quality properties (like coherency, soundness and consistency) are preserved. Furthermore, the integration methodology will be analyzed when slowly changing dimensions are encountered. |
2012/2013
|
On Declarative Data-Parallel Computation: Models, Languages and Semantics |
If we put under analysis the plethora of large-scale
data-processing tools avail- able nowadays, we can recognize two
main approaches: a declarative approach pursued by parallel DBMS
systems and firmly grounded on the relational model theory; and an
imperative approach followed by modern data-processing “MapReduce-
like” systems which are highly scalable, fault-tolerant, and mainly
driven by industrial needs. Although there has been some work
trying to bring together the two worlds, these works focus mainly
on exporting languages and interfaces - i.e., declarative languages
on top of imperative systems, or MapReduce-like functions over
parallel DBMS – or in a systematic merging of the features of the
two approaches. We advocate that, instead, a declarative imperative
approach should be attempted: this is, the development of a new
computational model with related language, based on the relational
theory and following the same patterns commonly present in modern
data-processing systems, while maintaining a declarative
flavor. |
2011/2012 |
Information Integration for biological data sources |
This thesis focuses on data integration and data provenance in the context of the MOMIS data integration system that was used to create the CEREALAB database. Its main contribution is the creation of the CEREALAB database V2.0 with new functionalities derived from the needs of the end users and the study of different data provenance models to finally create a new component for the MOMIS system in order to offer data provenance support for the CEREALAB users. |
2010/2011 |
Label Normalization and Lexical Annotation for Schema and Ontology Matching | The goal of this thesis is to propose, and experimentally evaluate automatic and semi-automatic methods performing label normalization and lexical annotation of schema labels. In this way, we may add sharable semantics to legacy data sources. Moreover, annotated labels are a powerful means in order to discover Lexical Relationships among structured and semi-structured data sources. Original methods to automatically normalize schema labels and extract lexical relationships have been developed and their affectiveness for automatic schema matching shown. |
2010/2011 |
Query Optimization and Quality-Driven Query Processing for Integration Systems | This thesis focused on some core aspects in data integration, i.e. Query Processing and Data Quality. First this thesis proposed new techniques that consider the optimization of the full outerjoin operation, which is used in data integration systems for data fusion. Then this thesis demonstrated how to achieve Quality-Driven Query Processing, where quality constraints specified in Data Quality Aware Queries are used to perform query optimization. |
2009/2010 |
Data and Service Integration: Architectures and Applications to Real Domains | This thesis focuses on Semantic Data Integration Systems, with particular attention to mediator system approaches, to perform data and service integration. One of the topics of this thesis is the application of MOMIS to the bioinformatics domain to integrate different public databases to create an ontology of molecular and phenotypic cereals data. However, the main contribution of this thesis is a semantic approach to perform aggregated search of data and services. In particular, I describe a technique that, on the basis of an ontological representation of data and services related to a domain, supports the translation of a data query into a service discovery process, that has also been implemented as a MOMIS extension. This approach can be described as a Service as Data approach, as opposed to Data as a Service approaches. In the Service as Data approach, informative services are considered as a kind of source to be integrated with other data sources, to enhance the domain knowledge provided by a Global Schema of data. Finally, new technologies and approaches for data integration have been investigated, in particular distributed architecture, with the objective to provide a scalable architecture for data integration. An integration framework in a distributed environment is presented that allows realizing a data integration process on the cloud. |
2008/2009 |
Automatic Lexical Annotation: an effective technique for dynamic data integration. | La tesi illustra come l'annotazione lessicale sia un elemento cruciale in ambito di integrazione dati. Grazie all'annotazione lessicale, vengono scoperte nuove relazioni tra gli elementi di uno schema o tra elementi di schemi diversi. Diversi metodi per eseguire automaticamente l'annotazione delle sorgenti dati vengono descritti e valutati in diversi scenari. L'annotazione lessicale può perfezionare anche sistemi per la scoperta di matching tra ontologie. Sono presentati alcuni esperimenti di applicazione dell'annotazione lessicale ai risultati di un matcher. Infine, viene introdotto l'approccio all'annotazione probabilistica e viene illustrata la sua applicazione nei processi di integrazione dinamici. |
2008/2009 |
Query Management in Data Integration Systems: the MOMIS approach. |
This thesis investigates the issue of Query Management in
Data Integration Systems, taking into account several problems that
have to be faced during the query processing phase. The achieved
goals of the thesis have been the study, analysis and proposal of
techniques for effectively querying Data Integration Systems. The
proposed techniques have been developed in the MOMIS Query Manager
prototype to enable users to query an integrated schema, and to
provide users a consistent and concise unified answer. The
effectiveness of the MOMIS Query Manager prototype has been
demonstrated by means of the THALIA testbed for Data Integration
Systems. Experimental results show how the MOMIS Query Manager can
deal with all the queries of the benchmark.
A new kind of metadata that offers a synthesized view of an attributes values, the relevant values, has been defined and the effectiveness of such metadata for creating or refining a search query in a knowledge base is demonstrated by means of experimental results. The security issues in Data integration/interoperation systems have been investigated and an innovative method to preserve data confidentiality and availability when querying integrated data has been proposed. A security framework for collaborative applications, in which the actions that users can perform are dynamically determined on the basis of their attribute values, has been presented, and the effectiveness of the framework has been demonstrated by an implemented prototype. |
2002/2003 |
Agent Technology Applied to
Information Systems |
The thesis is thus divided into three parts. In the first one, software agents are presented and critically compared to other mainstream technologies. We also discuss modeling issues. In the second part, some example systems where we applied agent technology are presented and the solution is discussed. The realistic scenarios and requirements for the systems were provided by the WINK and SEWASIE projects. The third part presents a logical framework for characterizing the interaction of software agents in virtual societies where they may act as representatives of humans. |
2002/2003 |
Dai Dati all'Informazione: il sistema MOMIS |
The thesis introduces the methodology for the construction of a Global Virtual View of structured data sources implemented in the MOMIS system. In particular, the thesis focuses on the problem of the management and update of multi-language sources. Noreover, the thesis proposes a comparison between MOMIS and the main mediators available in the literature. Finally, some applications of the MOMIS systems in the fields of Semantic Web and e-commerce (developed within National and European projects) have been proposed. |
2001/2002 |
Knowledge Management for Electronic Commerce applications | This work summarizes the activities developed during the Ph. D studies in Information Engineering. It is organized in two parts. The first part describes the Knowledge Management Systems and their applications to Electronic Commerce. In particular, a technical and organizational overview about the most critical issues concerning the Electronic Commerce applications is presented. This part is the result of a two years long research carried out in cooperation with Professor Enrico Scarso within the interdisciplinary – ICT and business organization – MIUR project “Il Commercio Eletronico: nuove opprtunità e nuovi mercati per le PMI”. The second part introduces the Intelligent Integration of Information (I3) research topic and presents the MOMIS system approach for I3. It outlines the theory underlying the MOMIS prototype and focuses on the generation of virtual catalogs issues in the electronic commerce environment exploiting the SIDesigner component. A new MOMIS architecture, based on XML Web Service, is finally proposed. The new architecture not only aims at addressing specific virtual catalogs’ issues, but it also lead to a general improvement of the MOMIS system. |
1999/2000 |
Intelligent Information Integration: The MOMIS Project | This thesis describes the work done during my Ph.D studies in Computer Engineering. It is organized in two parts. The first and main part describes the reseach project MOMIS for the Intelligent Integration of heterogeneous information. It outlines the theory for Intelligent Integration and the design and implementation of the prototype that implements the theoretical techniques. During my Ph.D. studies I stayed at the Northeastern University in Boston, Mass. (USA). Subject of the second part of this document is the work I did with Professor Ken Baclawski in information retrieval on annotation of documents using ontologies, and retrieval of the annotated documents. |
1997/98 |
Utilizzo di tecniche di Intelligenza Artificiale nell'Integrazione di Sorgenti Informative Eterogenee | Nella tesi di Dottorato viene presentato il sistema MOMIS (Mediator envirOnment for Multiple Information Sources), per l'integrazione di sorgenti di dati strutturati e semistrutturati secondo l'approccio della federazione delle sorgenti. Il sistema prevede la definizione semi-automatica dello schema univoco integrato che utilizza le informazioni semantiche proprie di ogni schema (col termine schema si intende l'insieme di metadati che descrive un deposito di dati). |
1992/93 |
Uno Strumento di Inferenza nelle Basi di Dati ad Oggetti (Subsumption inference for Object-Oriented Data Models) |
Object-oriented data models are being extended with recursion to
gain expressive power. This complicates both the incoherence
detection problem which has to deal with recursive classes
descriptions and the optimization problem which has to deal with
recursive queries on complex objects. In this phd thesis, we
propose a theoretical framework able to face the above problems. In
particular, it is able to validate and automatically classify in a
database schema, (recursive) classes, views and queries, organized
in an inheritance taxonomy. The framework adopts the ODL formalism
(an extension of the Description Logics developed in the area of
Artificial Intelligence) which is able to express the semantics of
complex object data models and to deal with cyclic references at
the schema and instance level. It includes subsumption algorithms,
which perform automatic placement in a specialization hierarchy of
(recursive) views and queries, and incoherence algorithms, which
detect incoherent (i.e., always empty) (recursive) classes, views
and queries. As different styles of semantics: greatest
fixed-point, least fixed-point and descriptive can be adopted to
interpret recursive views and queries, first of all we analyze and
discuss the choice of one or another of the semantics and,
secondly, we give the subsumption and incoherence algorithms for
the three different semantics. We show that subsumption computation
and incoherence detection appear to be feasible since in almost all
practical cases they can be solved in polynomial time algorithms.
Finally, we show how subsumption computation is useful to perform
Semantic query optimization, which uses semantic knowledge (i.e.,
integrity constraints) to transform a query into an equivalent one
that may be answered more efficiently. The phd thesis is in Italian. The content of this phd thesis can be found in the following two papers:
|