Abstracts

Large-Scale Data Management for the Sciences

Professor Tanu Malik, Purdue University

Traditional enterprises and novel scientific applications are accumulating petabyte-scale datasets, which makes the need for large-scale data management more pressing than ever. Geographic distribution of the datasets accompanied by complex demands on data makes large-scale data management challenging. This is especially true for sciences that model complex physical and biological phenomena using data from multiple sources.

In this talk I will address two critical problems in scientific data management: combining large number of diverse data sources for execution of scientific queries and executing data-intensive scientific queries efficiently, in terms of both network and I/O, on these data sources. I will present SkyQuery--a system that federates data from several petabyte size, autonomous and heterogeneous astronomy databases scattered worldwide. Using SkyQuery, scientists can write declarative queries that compare and merge multiple astronomical datasets. For efficient query execution and scalability, I will present Bypass-Yield Caching--a novel caching framework for database systems that dramatically reduces the network bandwidth requirements of data-intensive federations such as SkyQuery making them good network citizens. Distributed applications such as the Bypass Yield Cache often rely on a priori knowledge of query cardinalities to make optimization decisions. In this context, I will present a black-box approach to selectivity estimation that is suitable for distributed applications.

The success of SkyQuery and its adoption by the National Virtual Observatory is an example of data management systems enabling scientific endeavors.


On the Expressiveness of Implicit Provenance in Query and Update Languages

Stijn Vansummeren, PhD
Postdoctoral fellow, Research Foundation -- Flanders

Many contemporary scientific databases, sometimes referred to as curated databases, are constructed by a labor-intensive process of copying, correcting, and annotating data from other sources. The value of curated databases lies in their organization and in the trustworthiness of their data. To assess the latter, knowing the origin of data (especially where it was copied or created from) -- its "provenance" -- is particularly important. In practice, provenance, if it is recorded at all, is recorded manually. This process is both time-consuming and error-prone. Automated provenance recording support is therefore clearly desirable. In order to provide such support, however, it is important to obtain a clear understanding of the meaning and expressiveness of existing database "operations" -- both queries and updates with respect to provenance.

Our first aim in this talk is to discuss one possible provenance semantics for queries and updates on complex objects that is particularly attractive for its expressive completeness. That is, for every query or update O that manually records provenance it can be shown that there exists a normal query or update (without any reference to provenance) whose provenance semantics is equivalent to O. We feel that this strongly argues in favor of the proposed provenance semantics as the "right" basis for recording provenance automatically. Our second aim in this talk is to discuss ongoing research with respect to the proposed provenance semantics, namely (1) Conservative extension properties for provenance-aware nested update languages, and in particular how the above expressive completeness results transfer to SQL updates; and (2) the relationship between recording provenance (to which our expressive completeness results applies) and querying provenance (to which it does not).

Stijn Vansummeren is a postdoc at the University of Hasselt, Belgium. He is supported by the Belgium National Foundation of Science.


Protocols for Business Service Engagements

Nirmit Desai, PhD
Postdoctoral fellow, NCSU

An increasing portion of our economic activities is supported via business service engagements. Managing such service engagements is challenging: not only are the participating organizations autonomous and their information systems heterogeneous, but also the underlying requirements evolve continually. Current approaches lack in two respects: (1) they employ activity-based abstractions for modeling inherently interactive engagements, and (2) they either employ specification-level or data-level semantics to capture inherently business-level interactions.

I propose a modular abstraction of business protocols to capture the business interactions in service engagements. Protocols are understood in terms of the business-level notion of commitments of the parties involved. Also, protocols can be reused and composed to yield composite protocols. For evaluation, we captured the TWIST foreign exchange standard processes after soliciting requirements from financial domain experts and TWIST committee members. The protocol-based specifications are then compared with those found in the TWIST standards documents with respect to ambiguity and redundancy. We find that not only does protocol-based engineering of TWIST result in compact, unambiguous, and verifiable specifications, but also new foreign exchange scenarios of serious business significance are discovered in the process.


Workflow Mining: A Framework and Algorithms

Aubrey Rembert
University of Colorado at Boulder

Workflow systems are model-driven software systems that semi-automate the coordination of resources, activities, agents, and goals in business processes. However, there tends to be a gap between the organizational dependencies captured in the workflow models that drive workflow systems, and the actual organizational dependencies that exist. This gap exists, primarily, because the people that develop the workflow models are not the same people that are involved in executing the business processes these workflow models are designed to support. To mitigate this gap, I have developed workflow mining algorithms that learn workflow models from a log of activity instances captured during the execution of a business process. The foundation of my approach is based on the development of a workflow mining framework that considers business processes to be multidimensional and multi-perspective.

In this talk, I will present my workflow mining framework and a workflow mining algorithm that learns the control-flow of a business process from a log of activity instances. I will then present what I think is the future of workflow mining research.


Top Problems of the Internet and How to Help Solve Them

KC Claffy, PhD
San Diego Supercomputer Center

Drawing on 15 years of investment in analyzing Internet data (workload, topology, routing, performance), Dr. Claffy describes her vision of the current state of the Internet and the most acute problems it faces now and in the future. She will report on an ongoing project to taxonomize what we have learned, and what we have failed to learn, from empirical study of the Internet, and how to apply these lessons to future Internet research and development. She will cover historical background and context of Internet R&D, and and list the most pervasive weaknesses in the current infrastructure, including unsupported assumptions and neglected research directions. She will also argue that technological and economic forces will inevitably demand an interdisciplinary re-evaluation of the fundamental aspects of Internet architecture, engineering, and governance, and that academics should play a key role in shaping the future of the Internet. Audience participation will be encouraged!

 

National Science Foundation IU School of Informatics Florida International Indiana University