Abstracts - Fall 2008
Using Metadata to Find Relevant Data in the e-Science Haystack
Indiana University, Ph.D. Candidate
As scientists increasingly have access to powerful computational grids through e-science portals, huge volumes of valuable scientific data are being generated. Being able to reuse this data for validation of experimental results and further research is crucial to the advancement of science - resulting in an increasing need for accurate and detailed metadata. Scientific communities use detailed XML schemas to describe their data, and our research looks at the characteristics of these metadata schemas - how they differ from general XML and how these differences can be exploited to address the particular requirements of scientists and enable scientific communities to easily catalog and search their data using the schema of their domain.
Indiana University, Ph.D. Candidate The Web has grown beyond anyone's imagination. While significant research has been devoted to understanding aspects of the Web from the perspective of the documents that comprise it, we have little data on the relationship among servers that comprise the Web. In this talk, we explore the extent to which Web servers are co-located with other Web servers in the Internet. In terms of the location of servers, we find that the Web is surprisingly smaller than it seems. This has important implications for the availability of Web servers in case of DoS attacks and blocklisting. University of Utah To analyze and understand the growing wealth of scientific data, complex computational processes need to be assembled, often requiring the combination of loosely-coupled resources, specialized libraries, distributed computing infrastructure, and Web services. Workflow (and workflow-based) systems have recently emerged as an alternative to ad-hoc approaches to constructing computational tasks widely used in the scientific community. But although the benefits of using workflow systems are well known, the fact that workflows are hard to create and maintain has been a major barrier to wider adoption of this technology in the scientific domain. This is especially true for exploratory analysis tasks, where the path from data to insight requires a laborious, trial-and-error process, where users successively assemble, modify, and execute multiple workflows. We advocate a data-centric view of workflow-based computational processes, where provenance of exploratory processes is captured through the workflow specifications, information about their evolution and impact on the data they manipulate. In this talk, we discuss how this detailed provenance information can be used to provide intuitive interfaces and tools that support collaborative analysis of scientific data. In particular, we will present a query-by-example interface for querying workflows whereby users query workflows through the same familiar interface they use to create them; a mechanism for semi-automatically creating and refining workflows by analogy, without requiring users to directly manipulate or edit the workflow specifications; and a recommendation system that guides users through the workflow design process by automatically suggesting completions based on a database of previously created workflows. We will also demonstrate how these tools have been implemented and can be used in VisTrails (http://www.vistrails.org), an open-source provenance management system. Joint work with Claudio T. Silva, Erik Anderson, Steven P. Callahan, Tommy Ellkvist, David Koop, Lauro Lins, Emanuele Santos, Carlos E. Scheidegger and Huy T. Vo. University of Louisville There is considerable talk in the database research community that we are at a turning point, and that a new agenda should refocus efforts into non-traditional areas. But, what are these new areas? How should database research play a role on it? And, more importantly, if there are out there areas in which databases should play an important role, how did we get to this point, where databases are not a player? In this talk we present our viewpoint on how we got to this situation and some of the things we should be doing to get out of it. We argue that database research has taken too narrow a view of the phenomenon of information flow. We make the ideas concrete by presenting some new research projects. All such projects involve the basic idea of collaboration: collaboration among the users of a database (which form, implicitly or explicitly, a community, and can therefore be analyzed with tools developed lately in Social Network Theory and related fields), and collaboration between users and the database: the users are no longer passive recipients of whatever data the database offers to them, but they should have the ability to annotate the database (influencing not only content but also structure) or to direct the way the data is treated (creating workflows that control information processing). Clearly, some research in this areas already exists, but it is now taking certain stage and reaching further, as applications like e-science come to dominate the landscape and demand that databases relinquish the absolute control of the data that they enjoyed so far. Indiana University Various universal regularities characterize text from different domains and languages. Most notable are Zipf's law on the distribution of word frequencies, Heaps' law on vocabulary size, and the bursty nature of topical words. However, no single model of text generation explains how these properties emerge. Furthermore, no model exists to interpret the empirical distribution of similarity between documents. Here we present and validate a generative model that produces simultaneously all of the statistical features of textual corpora. Our results point to frequency ranking as a key mechanism for understanding language generation. Understanding the emergence of structure and topicality in written text can shed light into the collective cognitive processes we use to organize and store information, and find broad applications in literature analysis, Web mining, and social media. Joint work with Mariangeles Serrano and Alessandro Flammini Indiana University, Ph.D. Candidate The emerging popularity of in silico experimentation within the scientific community has brought with it not only an abundance of resultant data, but also provenance -- metadata describing the pedigree of the results. As a knowledge source, provenance can be leveraged for the task of automated assistance for scientists in need of technical assistance or a useful information source for planning which grid resources to employ in their experiments. Through several experiments examining a large collection of existing experimental workflows, we have found that case-based methods of generating suggestions are effective in providing quality assistance. Indiana University, Ph.D. Candidate Traditionally, information and opinions were filtered and amplified by two classes of trusted intermediaries: institutional media and our social networks of friends and family. The advent of social media is disrupting these mechanisms by fostering Web-mediated brokers such as blogs, wikis, folksonomies, and search engines, through which anyone can easily publish and promote content online. This "second age of information" is driven more than ever before by the economy of attention. Popularity (the accumulation of attention) is its measure of success; popular sources have formidable power to impact opinions, culture, and policy, as well as profit through online advertising. Yet the dynamical processes that drive popularity in our online world are still unclear and largely unexplored. Here we provide for the first time a quantitative large scale and longitudinal analysis of the dynamics of different popularity measures for online content. We analyze the dynamical processes underlying the evolution of two massive model systems, the Wikipedia and an entire country's Web space, finding that the temporal and magnitude evolution of popularity measures follow statistical laws typical of critical avalanche processes such as earthquakes and depinning phenomena. Such statistical features hold across measures, systems, and their histories. To make sense of these empirical results, we offer a model that mimicks, with a simple random mechanism, the exogenous shift of the user attention and the ensuing non-linear perturbations in the popularity ranking of online resources. Remarkably this stylized model recovers the key features observed in the empirical analysis of the model systems analyzed here.
The Web is Smaller than it Seems
Beyond Reproducibility: Using Provenance to Streamline Data Exploration through Workflows
Databases 'R 'Us: A Turning Point in Database Research?
A Generative Model of Text Documents Capturing Bursts and Similarity
Leveraging Provenance for Case-Based Support of e-Science Experimentation
The Avalanche Dynamics of Popularity

