Abstracts - Fall 2008


Using Metadata to Find Relevant Data in the e-Science Haystack

Scott Jensen

Indiana University, Ph.D. Candidate

As scientists increasingly have access to powerful computational grids through e-science portals, huge volumes of valuable scientific data are being generated. Being able to reuse this data for validation of experimental results and further research is crucial to the advancement of science - resulting in an increasing need for accurate and detailed metadata. Scientific communities use detailed XML schemas to describe their data, and our research looks at the characteristics of these metadata schemas - how they differ from general XML and how these differences can be exploited to address the particular requirements of scientists and enable scientific communities to easily catalog and search their data using the schema of their domain.


The Web is Smaller than it Seems

Craig Shue

Indiana University, Ph.D. Candidate

The Web has grown beyond anyone's imagination. While significant research has been devoted to understanding aspects of the Web from the perspective of the documents that comprise it, we have little data on the relationship among servers that comprise the Web. In this talk, we explore the extent to which Web servers are co-located with other Web servers in the Internet. In terms of the location of servers, we find that the Web is surprisingly smaller than it seems. This has important implications for the availability of Web servers in case of DoS attacks and blocklisting.


Beyond Reproducibility: Using Provenance to Streamline Data Exploration through Workflows

Dr. Juliana Freire

University of Utah

To analyze and understand the growing wealth of scientific data, complex computational processes need to be assembled, often requiring the combination of loosely-coupled resources, specialized libraries, distributed computing infrastructure, and Web services. Workflow (and workflow-based) systems have recently emerged as an alternative to ad-hoc approaches to constructing computational tasks widely used in the scientific community. But although the benefits of using workflow systems are well known, the fact that workflows are hard to create and maintain has been a major barrier to wider adoption of this technology in the scientific domain. This is especially true for exploratory analysis tasks, where the path from data to insight requires a laborious, trial-and-error process, where users successively assemble, modify, and execute multiple workflows.

We advocate a data-centric view of workflow-based computational processes, where provenance of exploratory processes is captured through the workflow specifications, information about their evolution and impact on the data they manipulate. In this talk, we discuss how this detailed provenance information can be used to provide intuitive interfaces and tools that support collaborative analysis of scientific data. In particular, we will present a query-by-example interface for querying workflows whereby users query workflows through the same familiar interface they use to create them; a mechanism for semi-automatically creating and refining workflows by analogy, without requiring users to directly manipulate or edit the workflow specifications; and a recommendation system that guides users through the workflow design process by automatically suggesting completions based on a database of previously created workflows. We will also demonstrate how these tools have been implemented and can be used in VisTrails (http://www.vistrails.org), an open-source provenance management system.

Joint work with Claudio T. Silva, Erik Anderson, Steven P. Callahan, Tommy Ellkvist, David Koop, Lauro Lins, Emanuele Santos, Carlos E. Scheidegger and Huy T. Vo.


Databases 'R 'Us: A Turning Point in Database Research?

Dr. Antonio Badia

University of Louisville

There is considerable talk in the database research community that we are at a turning point, and that a new agenda should refocus efforts into non-traditional areas. But, what are these new areas? How should database research play a role on it? And, more importantly, if there are out there areas in which databases should play an important role, how did we get to this point, where databases are not a player? In this talk we present our viewpoint on how we got to this situation and some of the things we should be doing to get out of it. We argue that database research has taken too narrow a view of the phenomenon of information flow. We make the ideas concrete by presenting some new research projects. All such projects involve the basic idea of collaboration: collaboration among the users of a database (which form, implicitly or explicitly, a community, and can therefore be analyzed with tools developed lately in Social Network Theory and related fields), and collaboration between users and the database: the users are no longer passive recipients of whatever data the database offers to them, but they should have the ability to annotate the database (influencing not only content but also structure) or to direct the way the data is treated (creating workflows that control information processing). Clearly, some research in this areas already exists, but it is now taking certain stage and reaching further, as applications like e-science come to dominate the landscape and demand that databases relinquish the absolute control of the data that they enjoyed so far.


A Generative Model of Text Documents Capturing Bursts and Similarity

Dr. Filippo Menczer

Indiana University

Various universal regularities characterize text from different domains and languages. Most notable are Zipf's law on the distribution of word frequencies, Heaps' law on vocabulary size, and the bursty nature of topical words. However, no single model of text generation explains how these properties emerge. Furthermore, no model exists to interpret the empirical distribution of similarity between documents. Here we present and validate a generative model that produces simultaneously all of the statistical features of textual corpora. Our results point to frequency ranking as a key mechanism for understanding language generation. Understanding the emergence of structure and topicality in written text can shed light into the collective cognitive processes we use to organize and store information, and find broad applications in literature analysis, Web mining, and social media. Joint work with Mariangeles Serrano and Alessandro Flammini


Leveraging Provenance for Case-Based Support of e-Science Experimentation

Joey Morwick

Indiana University, Ph.D. Candidate

The emerging popularity of in silico experimentation within the scientific community has brought with it not only an abundance of resultant data, but also provenance -- metadata describing the pedigree of the results. As a knowledge source, provenance can be leveraged for the task of automated assistance for scientists in need of technical assistance or a useful information source for planning which grid resources to employ in their experiments. Through several experiments examining a large collection of existing experimental workflows, we have found that case-based methods of generating suggestions are effective in providing quality assistance.


The Avalanche Dynamics of Popularity

Jacob Ratkiewicz

Indiana University, Ph.D. Candidate

Traditionally, information and opinions were filtered and amplified by two classes of trusted intermediaries: institutional media and our social networks of friends and family. The advent of social media is disrupting these mechanisms by fostering Web-mediated brokers such as blogs, wikis, folksonomies, and search engines, through which anyone can easily publish and promote content online. This "second age of information" is driven more than ever before by the economy of attention. Popularity (the accumulation of attention) is its measure of success; popular sources have formidable power to impact opinions, culture, and policy, as well as profit through online advertising. Yet the dynamical processes that drive popularity in our online world are still unclear and largely unexplored.

Here we provide for the first time a quantitative large scale and longitudinal analysis of the dynamics of different popularity measures for online content. We analyze the dynamical processes underlying the evolution of two massive model systems, the Wikipedia and an entire country's Web space, finding that the temporal and magnitude evolution of popularity measures follow statistical laws typical of critical avalanche processes such as earthquakes and depinning phenomena. Such statistical features hold across measures, systems, and their histories. To make sense of these empirical results, we offer a model that mimicks, with a simple random mechanism, the exogenous shift of the user attention and the ensuing non-linear perturbations in the popularity ranking of online resources. Remarkably this stylized model recovers the key features observed in the empirical analysis of the model systems analyzed here.


Graph Kernels for Predicting Functionally Important Residues in Proteins

Dr. Predrag Radivojac

Indiana University

In this talk I will present our machine learning methodology for prediction of functionally important residues in protein structures. Protein structures are first converted into graphs and a kernel-based method is proposed for functional inference. I will show that our inference method from protein structures is superior to inference from protein sequences only. It also generalizes some previous bioinformatics methods and, more importantly, the framework is not limited to bioinformatics. Finally, I will discuss the biological importance of this type of inference. In particular, I will show that mutations in cancer are frequently characterized by both gain and loss of functional residues and how computational methods in general can be used to create hypotheses on the molecular basis of disease.


The View-constraint Duality in Database Systems, Software Engineering, and Systems Engineering

Dr. Edward Robertson

Indiana University

In database systems, software engineering, and systems engineering, the concepts of constraints and views are commonly and effectively used. Considered distinct, they stand as well-established notions in each domain’s body of knowledge. The focus of this paper is to explore the duality between views and constraints in these domains and investigate the efficacy of this duality in enabling more effective model interoperability. We provide empirical evidence for the duality and demonstrate cases where the duality is useful for constraint specification across modeling paradigms as commonly occurs across multiple organizations.


BayeShield: An Integrated Approach to Anti-Phishing

Dr. Eunjin Jung

University of Iowa

Identity theft is one of the fastest growing crimes in the nation, and phishing has become a primary vector for identity theft. In this talk, we present BayeShield, a Bayesian Anti-Phishing Toolbar designed to help users identify phishing websites. We describe the development process of the anti-phishing engine as well as describing the iterative, user-centered design principles of our novel, conversational anti-phishing UI. Experimental results show that our toolbar effectively detects phishing sites without incurring a noticeable page delay and an empirical study finds that BayeShield outperforms Firefox 2.0. When combined with a blacklist, BayeShield can detect above 98% of phishing websites with a ver y low false positive rate of 3% and prevented all par ticipants in our study from falling for a phishing attack delivered via email. we evaluate BayeShield’s usability and obtain positive results including high user satisfaction ratings, and a high-level of engagement as demonstrated by perceived duration of tasks being lower than actual durations. In addition, we learned user characteristics that affect the likelihood users will enter information on phishing websites.


Semantic Grounding of Tag Relatedness in Social Bookmarking Systems

Dr. Ciro Cattuto

ISI Foundation, Turin, Italy

The popularity of collaborative tagging systems is prompting research on mining large bodies of social annotations to improve navigation, search, and to populate semantic web applications. Several measures of tag similarity have been introduced to support tasks such as synonym detection and discovery of concept hierarchies. These measures of tag similarity, however, often appear to be rather ad hoc and the underlying assumptions on the notion of similarity are seldom made explicit. In this talk we discuss a systematic characterization and validation of tag similarity in terms of formal representations of knowledge. Using data from the social bookmarking system delicious.com, we investigate several measures of tag similarity. We provide a semantic grounding by mapping pairs of similar tags in the folksonomy to pairs of synsets in Wordnet, where we use validated measures of semantic distance to characterize the semantic relation between the mapped terms. This exposes important features of the investigated measures of tag similarity, and indicates which of them are better suited in the context of a given application.


Logical Properties of Stable Conditional Independence

Mathias Niepert

Indiana University, Ph.D. Candidate

Probabilistic graphical models are successfully employed to model and reason under uncertainty in data mining and other areas of artificial intelligence. However, they have their limitations as they can only model certain special cases. In an attempt to generalize graphical models, we investigate the logical and algorithmic properties of stable conditional independence (CI) as an alternative structural representation of conditional independence information. We utilize recent results concerning a complete axiomatization of stable conditional independence relative to discrete probability measures to derive perfect model properties of stable CI structures. We show that stable CI can be interpreted as a generalization of undirected graphical models and establish a connection between sets of stable CI statements and propositional formulae in conjunctive normal form. Consequently, we derive that the implication problem for stable CI is coNP-complete. Finally, we show that SAT solvers can be employed to efficiently decide the implication problem and to compute concise, non-redundant representations of stable CI, even for instances involving hundreds of variables.


Trie Indexes for Efficient XML Query Evaluation

Sofia Brenes

Indiana University, Ph.D. Candidate

As the number of applications that rely on XML data increases, so does the need for performing efficient XML query evaluation. A critical part of the solution involves providing new techniques for designing XML indexes and lookup algorithms. We leverage the results of our research on coupling the partitions induced by fragments of XPath algebra and those induced by the structural properties of an XML document to lead the design of the Trie family of XML indexes. We present the rationale behind our approach, and detail the structure and algorithms for the Nk, Pk, and Wk Trie indexes, as well as presenting evaluation, performance results, and future work plans.