| The research unit of Roma Tre
is mainly involved in the activities of the first Theme of the
project; but we also participate the activities of the second
Theme.
Our unit has studied issues related to the extraction of data
from data-intensive web sites. In particular, we have developed
a system, called RoadRunner (Crescenzi et al 2001, Crescenzi
et al. 2002) to automatically generate wrappers for pages from
data-intensive
web sites (see scientific basis). Given a small set of pages
similar in structure, the system generates a wrapper. the wrapper
can then be applied to extract data from pages that share the
same structure as the input pages. Several experiments on real
life web sites have demonstrated the effectiveness and the efficiency
of the approach.
The experiences we have maturated in developing the roadRunner
system (and the system as well) represent the basis of the activities
of the research unit. Our main contributions concentrate on
the theme 1.2 (Adding a new information source to the domain
ontology)
in collaboration with the research unit of Modena. As discussed
in the Model A of the proposal, the extension of a domain ontology
corresponds to the adding of a new information source. In the
case of a data-intensive web source, this process involves the
following tasks: (i) inferring a schema that describes the organization
of data offered by the source, (ii) definition of wrappers to
extract the data from the source, (iii) providing semantics
to the extracted data and schema, and (iv) extension of the
Global Virtual
View.
The techniques developed for the automatic generation of wrappers
represent a partial and limited solution to the above tasks.
With respect to the first two tasks, approaches in the literature
can infer a schema, and its associated wrapper, for a set of
structurally homogeneous pages. Modern web sites organize their
pages in several classes (each class containing similar pages),
in a complex and articulated hypertextual structure. In order
to generate wrappers to extract data from a whole site (or from
more than one site)
we need to understand and describe the organization of pages
in the site. Presently, such a description can be depicted only
manually, drastically limiting the scalability of the approach.
The following example illustrates the issue. Consider the web
site of a sport event of worldwide interest. It contains thousands
of pages containing information about teams, players, matches,
and news. The site content is organized in a regular way; for
example
we may find one page for each player, one page for each team,
and so on. These pages are themselves well-structured. For instance
all the player pages share the same structure and, at the intensional
level, they present similar information (the name of the player,
his current club, a short biography, etc.). Similarly, all team
pages share a common structure and a common intensional information,
which are different from those of the players. Also, pages contain
links to one another, in order to provide effective navigation
paths that reflect semantic relationships; for example, every
team page contains links to the pages of its players.
In order to extract data from this site, we need to generate
a wrapper for each class of pages (one for player pages, one
for team pages, etc.). Then once the wrappers are generated,
in order to continuously extract data from the site, we need
a description of the
hypertext paths connecting the various classes of pages. Observe
that the extension of the extension of every class of pages
may evolve. Continuing with our example, every day one or more
new match pages can be added to the site. Only if we have the
paths that lead to the extensions of classes of pages offered
by the site, we can reach the instances.
Then, in order to extract data from data-intensive web site
we have to generate a description (a schema) of the site structure.
Such a description should emphasize the classes of pages offered
by the site add the hypertextual connections among them.
The main goal of our study is to define and develop techniques
to automatically generate the description of a data-intensive
web site.
An important requirement is the efficiency of the proposed technique.
In this context, the efficiency is related to the number of
pages to visit in order to generate the description. We aim
at inferring a schema for the site exploring a small yet representative
portion of its pages.
It is worth observing that reasoning on the site schema it is
possible to address the issue of associating semantics to the
extracted data as well (task (iii)). The site schema describes
classes of pages with similar structure. It is likely that pages
in the same class carry the same intensional information, and
that links among classes represent conceptual associations.
Consider again our running example, the class of player pages
is connected to the class of team pages. We aim at studying
techniques to associate semantics to classes and associations
by analyzing the contents of pages of each class and the links
to pages of other classes. Our unit has proposed techniques
for annotating the schemas associated with automatically generated
wrappers (Arlotta et al - 2003). The direction we will follow
is complementing and extending these techniques with the recent
studies about lexical chains proposed by the research unit of
Modena.
We now describe the various phases in which the project will
be divided.
PHASE 1
During the first phase the research unit will work with all
the other units involved in the project in order to define the
methodological and functional architecture for the whole project
(deliverable D0.R1). Also we will collaborate with the other
research units to develop a critical analysis of the emerging
standards and languages for for ontologies (deliverable D1.R1)
DELIVERABLES
D0.R1 Technical Report describing the methodological and functional
architecture of the project (in collaboration with Modena e
Reggio Emilia - MO, Bologna - BO, Trento - TN)
D1.R1: Technical Report describing a critical analysis of ontology
languages and standards (in collaboration with BO,MO,TN)
PHASE 2
During the second phase the research unit will concentrate on
the development of techniques to automatically infer the schema
of a data-intensive web site. The proposed techniques will described
in a technical report (deliverable D1.R5). In addition, the
unit will work together with the other units to the definition
of the interfaces of the components for the integrated prototype
(deliverable D0.R2).
DELIVERABLES
D0.R2 Definitions of the interfaces of the components of the
integrated prototype (in collaboration with MO, BO, TN)
D1.R5 Technical Report describing techniques to automatically
infer the schema of a data-intensive web site
PHASE 3
During the third phase of the project the research unit will
develop and experiment the prototype for automatically inferring
a schema of a data-intensive web site. (deliverable D1.P4).
Also, the research unit, together with the unit of Modena, will
develop techniques for associating semantics to the schema a
web site (deliverable D1.R6). Finally, the unit will collaborate
with the other units at the integration of the prototypes developed
in the project (deliverable D0.P1).
DELIVERABLES
D0.P1 Integrated system prototype (in collaboration with MO,
BO, TN)
D1.R6 Technical Report describing techniques for associating
semantics to the schema a web site (in collaboration with MO)
D1.P4 Prototype for the automatic inference of the schema of
a data-intensive web site
|