Wisdom: Unity of Roma

Research Unit: University of Roma Tre

Department di Informatica e Automazione

Research Program of the Unit (model B)

Research Program Coordinator of the Unit

Prof. Merialdo Paolo

Department of INFORMATICA E AUTOMAZIONE
Faculty of ENGINEERING
University of ROMA TRE

Via della Vasca Navale, 79 - 00149 Roma, Italy
Tel :+39 06 55173218
Fax :+39 06 5573030

E-mail: merialdo@dia.uniroma3.it
Home page: www.dia.uniroma3.it/~merialdo

Participants to this Research Unit

Participant	Department	Qualification
MERIALDO PAOLO	Dep. INFORMATICA E AUTOMAZIONE	Researcher
ATZENI PAOLO	Dep. INFORMATICA E AUTOMAZIONE	Full Professor
TORLONE RICCARDO	Dep. INFORMATICA E AUTOMAZIONE	Associate Professor
CABIBBO LUCA	Dep. INFORMATICA E AUTOMAZIONE	Associate Professor

Specific Title of the Research Program of this Unit

Automatic extraction of data and schemas from data-intensive web sources

Description of the Research Program of this Unit

The research unit of Roma Tre is mainly involved in the activities of the first Theme of the project; but we also participate the activities of the second Theme.
Our unit has studied issues related to the extraction of data from data-intensive web sites. In particular, we have developed a system, called RoadRunner (Crescenzi et al 2001, Crescenzi et al. 2002) to automatically generate wrappers for pages from data-intensive
web sites (see scientific basis). Given a small set of pages similar in structure, the system generates a wrapper. the wrapper can then be applied to extract data from pages that share the same structure as the input pages. Several experiments on real life web sites have demonstrated the effectiveness and the efficiency of the approach.
The experiences we have maturated in developing the roadRunner system (and the system as well) represent the basis of the activities of the research unit. Our main contributions concentrate on the theme 1.2 (Adding a new information source to the domain ontology)
in collaboration with the research unit of Modena. As discussed in the Model A of the proposal, the extension of a domain ontology corresponds to the adding of a new information source. In the case of a data-intensive web source, this process involves the
following tasks: (i) inferring a schema that describes the organization of data offered by the source, (ii) definition of wrappers to extract the data from the source, (iii) providing semantics to the extracted data and schema, and (iv) extension of the Global Virtual
View.
The techniques developed for the automatic generation of wrappers represent a partial and limited solution to the above tasks. With respect to the first two tasks, approaches in the literature can infer a schema, and its associated wrapper, for a set of structurally homogeneous pages. Modern web sites organize their pages in several classes (each class containing similar pages), in a complex and articulated hypertextual structure. In order to generate wrappers to extract data from a whole site (or from more than one site)
we need to understand and describe the organization of pages in the site. Presently, such a description can be depicted only manually, drastically limiting the scalability of the approach.
The following example illustrates the issue. Consider the web site of a sport event of worldwide interest. It contains thousands of pages containing information about teams, players, matches, and news. The site content is organized in a regular way; for example
we may find one page for each player, one page for each team, and so on. These pages are themselves well-structured. For instance all the player pages share the same structure and, at the intensional level, they present similar information (the name of the player,
his current club, a short biography, etc.). Similarly, all team pages share a common structure and a common intensional information, which are different from those of the players. Also, pages contain links to one another, in order to provide effective navigation paths that reflect semantic relationships; for example, every team page contains links to the pages of its players.
In order to extract data from this site, we need to generate a wrapper for each class of pages (one for player pages, one for team pages, etc.). Then once the wrappers are generated, in order to continuously extract data from the site, we need a description of the
hypertext paths connecting the various classes of pages. Observe that the extension of the extension of every class of pages may evolve. Continuing with our example, every day one or more new match pages can be added to the site. Only if we have the paths that lead to the extensions of classes of pages offered by the site, we can reach the instances.
Then, in order to extract data from data-intensive web site we have to generate a description (a schema) of the site structure. Such a description should emphasize the classes of pages offered by the site add the hypertextual connections among them.
The main goal of our study is to define and develop techniques to automatically generate the description of a data-intensive web site.
An important requirement is the efficiency of the proposed technique. In this context, the efficiency is related to the number of pages to visit in order to generate the description. We aim at inferring a schema for the site exploring a small yet representative portion of its pages.
It is worth observing that reasoning on the site schema it is possible to address the issue of associating semantics to the extracted data as well (task (iii)). The site schema describes classes of pages with similar structure. It is likely that pages in the same class carry the same intensional information, and that links among classes represent conceptual associations. Consider again our running example, the class of player pages is connected to the class of team pages. We aim at studying techniques to associate semantics to classes and associations by analyzing the contents of pages of each class and the links to pages of other classes. Our unit has proposed techniques for annotating the schemas associated with automatically generated wrappers (Arlotta et al - 2003). The direction we will follow is complementing and extending these techniques with the recent studies about lexical chains proposed by the research unit of Modena.
We now describe the various phases in which the project will be divided.

PHASE 1
During the first phase the research unit will work with all the other units involved in the project in order to define the methodological and functional architecture for the whole project (deliverable D0.R1). Also we will collaborate with the other research units to develop a critical analysis of the emerging standards and languages for for ontologies (deliverable D1.R1)

DELIVERABLES
D0.R1 Technical Report describing the methodological and functional architecture of the project (in collaboration with Modena e Reggio Emilia - MO, Bologna - BO, Trento - TN)
D1.R1: Technical Report describing a critical analysis of ontology languages and standards (in collaboration with BO,MO,TN)

PHASE 2
During the second phase the research unit will concentrate on the development of techniques to automatically infer the schema of a data-intensive web site. The proposed techniques will described in a technical report (deliverable D1.R5). In addition, the unit will work together with the other units to the definition of the interfaces of the components for the integrated prototype (deliverable D0.R2).

DELIVERABLES
D0.R2 Definitions of the interfaces of the components of the integrated prototype (in collaboration with MO, BO, TN)
D1.R5 Technical Report describing techniques to automatically infer the schema of a data-intensive web site

PHASE 3
During the third phase of the project the research unit will develop and experiment the prototype for automatically inferring a schema of a data-intensive web site. (deliverable D1.P4). Also, the research unit, together with the unit of Modena, will develop techniques for associating semantics to the schema a web site (deliverable D1.R6). Finally, the unit will collaborate with the other units at the integration of the prototypes developed in the project (deliverable D0.P1).

DELIVERABLES
D0.P1 Integrated system prototype (in collaboration with MO, BO, TN)
D1.R6 Technical Report describing techniques for associating semantics to the schema a web site (in collaboration with MO)
D1.P4 Prototype for the automatic inference of the schema of a data-intensive web site