Talk
Depth First Algorithms and Inferencing for AFD Mining
Jeremy Engle, Ph.D. Candidate
Center for Data and Search Informatics,
School of Informatics and Computing,
Indiana University
Abstract: Approximate Functional Dependencies (AFDs) are rules which are "almost" a Functional Dependency where "almost" is determined by an approximation measure. The AFD mining problem exists in the overlap of database and data mining fields and the search space consists of powerset lattices. The work we are presenting focuses on a customizable framework, MoLS, we developed which modularizes an algorithm at every level and provides functionality to improve the performance of algorithms. The second focus of our work is the prototyping and evaluation of algorithms and their ability to improve performance. Finally, we develop statistics which demonstrate precisely how algorithms accomplish improved performance.
Biography: Jeremy Engle is a 7th year PhD student in the DB lab of the CS department. His advisor is Dr. Ed Robertson. He has publications on this topic in BNCOD 2008 and IDEAS 2009. His research interests include databases, algorithms, data mining, machine learning, and data integration. He expects to graduate in Spring of 2010.
Metadata and Preservation in Geosciences: Issues at Scale
Dr. Beth Plale
Director, Data to Insight Center, Pervasive Technologies Institute,
Director, Center for Data and Search Informatics,
Associate Professor of Computer Science, School of Informatics and
Computing,
Indiana University
Abstract: As the environment and climate have increasing impact on the economic sustainability of our country, scientists are being compelled through their own interest or through directives from funding agencies to share the results of their research, which often take the form of collections of data. Sharing collections, particularly at scale where the volumes are large, introduces numerous challenges that we discuss in the context of our research and additional challenges that we point out as unaddressed problems. We discuss in particular provenance collection with a system independent collection tool, Karma, the XMC Cat application schema friendly metadata catalog, and the integration of data streams into a workflow composer, XBaya. We conclude with a discussion of the goals of the Data to Insight Center within the Pervasive Technologies Institute of which the Center for Data and Search Informatics and the Digital Library Program have a role.
Biography: Beth Plale is Director of the Center for Data and Search Informatics within the School of Informatics and Computing Bloomington, and Director of the Data to Insight Center in the Pervasive Technologies Institute. Plale is a Professor of Computer Science Indiana University. Prior to joining Indiana University, she was a Postdoc Fellow in the Center for Experimental Research and Computer Systems at Georgia Institute of Technology. Plale's Ph.D. is in computer science from State University of New York Binghamton.
Plale is an experimental computer scientist working in the area of data management, data driven cyberinfrastructure in an interdisciplinary research setting. In particular, her research interests are in data provenance, metadata catalogs, automated digital curation, workflow systems in e-Science, and complex events processing. Plale is a recipient of the DOE Early Career award and is an ACM Senior Member and IEEE Member.
Resource-limited Computing in Virtual Worlds
Mitja Hmeljak, Ph.D. Candidate (ABD)
School of Informatics and Computing,
Indiana University
Abstract: Virtual worlds, with their accepted conventions and supporting infrastructures, provide useful platforms for social behavioral research: recent experimental techniques have been successfully used in quantitative studies and to observe population trends under predefined conditions. There is also a tremendous scope for controlled experiments in virtual worlds, to study how groups of individuals behave under well-defined conditions when undertaking a specified task. This talk presents the major challenges from a computer science standpoint in developing applications for such environments - their computing infrastructure being accessible only indirectly. This recreates resource-limited computing conditions on a distributed grid, with old computational limits presenting themselves in a new guise: processing capabilities, multi layered communications and real-time user-data tracking are all constrained by the virtual world's scripting API and its related networking protocols.
Biography: Mitja Hmeljak is an ABD PhD candidate in Computer Science. His advisor is Dr. Andrew J. Hanson. His recent work with Dr. Robert L. Goldstone in the Department of Psychological and Brain Sciences includes the design and implementation of a virtual environment infrastructure to support experiments in social behavior in Second Life. His previous experience includes cooperations at the School of Fine Arts on virtual reality art installations, and numerous visualization projects at the UITS Advanced Visualization Lab.
Statistical Considerations in Analyzing Massive Data Sets
Dr. Karen Kafadar
Department of Statistics,
Indiana University
Abstract: The analysis of massive, high-volume data sets stresses usual statistical software systems and requires new ways of drawing inferences beyond the conventional paradigm (optimal estimation of parameters from a hypothesized distribution), since the entire data set often cannot be read into the software system. Internet traffic data and data from high-energy particle physics experiments raise additional challenges: nearly continuous streams of observations from multiple systems or channels that interact and exchange information in nondeterministic ways. Internet data in particular invite cyber attacks, which can spread very rapidly, and which thus require methods that can detect very rapidly potential departures from ``typical'' behavior. This talk discusses analyses of Internet traffic data and data from high-energy physics experiments. Some open issues in analyzing high-volume data in general are mentioned. (The Internet portion of this talk involved E.J. Wegman, George Mason University; the physics part involved R.L. Jacobsen, UC-Berkeley.)
Biography: Dr. Kafadar is Rudy Professor of Statistics and Physics at Indiana University. She received her B.S. and M.S. degrees from Stanford and her Ph.D. in Statistics from Princeton under John Tukey. Her research focuses on exploratory data analysis, robust methods, characterization of uncertainty in quantitative studies, and analysis of experimental data in the physical, chemical, biological, and engineering sciences. Prior to Indiana University, she was Professor and Chancellor's Scholar in the Departments of Mathematical Sciences and Preventive Medicine & Biometrics at the University of Colorado-Denver;
Fellow at the National Cancer Institute (Cancer screening section); and Mathematical Statistician at Hewlett Packard Company (R&D laboratory for RF/Microwave test equipment) and at National Institute of Standards and Technology (where she continues as Guest Faculty Visitor on problems of measurement accuracy, experimental design, and data analysis). Previous engagements include consultancies in industry and government as well as visiting appointments at University of Bath, Virginia Tech, and Iowa State University. She has served on previous NRC committees and also on the editorial review boards for several professional journals as Editor or Associate Editor and on the governing boards for the American Statistical Association, the Institute of Mathematical Statistics, and the International Statistical Institute. She is an Elected Fellow of the American Statistical Association and the International Statistical Institute, and has authored over 80 journal articles and book chapters, and has advised numerous M.S. and Ph.D. students.
Using Dataflow Models to Validate Enterprise Distributed Real-time and Embedded System Quality-of-Service Properties
Dr. James H. Hill
Department of Computer and Information Science,
Indiana University/Purdue University at Indianapolis
Abstract: Enterprise distributed real-time and embedded (DRE) systems, such as large-scale traffic management systems, command and control systems, and shipboard computing environments, are steadily increasing in size (e.g., lines of source code and number of hosts in the target environment) and com-plexity (e.g., application scenarios). Because of the steady increase in size and complexity of such sys-tem, it is becoming more critical to validate their quality-of-service (QoS) properties (e.g., latency, throughput, and scalability) continuously throughout the software lifecycle via a process called con-tinuous system integration testing. This is due in part to the serialized-phasing problem where infra-structure- and application-level services are developed and validated in different phases of the software lifecycle, but fail to meet QoS requirements when integrated and deployed together on the target architecture. System execution modeling (SEM) tools help distributed system developers and testers overcome the serialized-phasing development problem by enabling them to conduct system integration test on the target architecture continuously throughout the software lifecycle. QoS validation techniques, however, are usually limited to SEM tools capabilities, which can hinder evaluation capabilities.
This talk presents and evaluates techniques for validating enterprise DRE system QoS properties ir-respective of the SEM tool of choice. First, this talk discusses the existing state of SEM tools and chal-lenges associated with validating QoS properties in enterprise DRE systems. Secondly, this talk de-scribes how dataflow models, which are models that show how data moves throughout the system, and system execution traces offer an independent approach to validating enterprise DRE QoS properties. This talk concludes by discussing future research directions in applying dataflow models to validate QoS properties. The techniques presented in this talk have been realized in an open-source tool called CUTS and validated in the context of representative DRE systems from production projects in several mission-critical domains.
Biography: Dr. James H. Hill is an Assistant Professor in the Department of Computer and Information Science at Indiana University/Purdue University at Indianapolis. He received his Ph.D. and M.S. in Computer Science from Vanderbilt University, and B.S. in Computer Science from Morehouse College. Dr. Hill’s research focuses on techniques for validating enterprise distributed real-time and embedded system quality-of-service properties continuously throughout the software lifecycle on the target architecture, as opposed to waiting until complete system integration time. His research in this area has led to the development of an open-source research-based system execution tool called the Component Workload Emulator (CoWorkEr) Utilization Test Suite (CUTS), which has been used in academic- and industry-related projects/settings throughout the world, including mission-critical systems at the Australian Defense Science and Technology Organization, DARPA, General Electric Research, Northrop Grumman, Raytheon, and Lockheed Martin.
Making Metadata Happen: Engaging Data Producers in Archiving and Reuse of Scientific Data
Dr. Margaret Hedstrom
School of Information,
University of Michigan
Abstract: Long-term preservation of data is predicated on assumptions of cooperation between data producers and the repositories that assume responsibility for long-term data management. New policies that mandate public access to publicly-funded data also assume that researchers will make their data available for dissemination and reuse. This presentation will explore barriers to cooperation between data producers and repositories, assess new approaches to engaging data producers in metadata production and management, and discuss the implications of these strategies for archiving strategies.
Biography: Margaret Hedstrom is an Associate Professor at the School of Information, University of Michigan where she teaches in the areas of archives, electronic records management, and digital preservation. Her current research investigates incentives for producers to create “archive-ready” data. She was project director for the CAMiLEON Project, an international research project that investigated the feasibility of emulation as a digital preservation strategy. Her current research interests include digital preservation strategies, sharing and reuse of scientific data, and the role of archives in shaping collective memory. She has served on the National Digital Strategy Advisory Board to the Library of Congress, and the Advisory Committee on Historical Diplomatic Documentation, U.S. Department of State, and on the ACLS Commission on Cyber-Infrastructure for the Humanities and Social Sciences. Hedstrom is a fellow of the Society of American Archivists and recipient of a Distinguished Scholarly Achievement Award from the University of Michigan for her work with archives and cultural heritage preservation in South Africa.
Policy-based Data Management
Dr. Reagan Moore
School of Information and Library Science,
University of North Carolina at Chapel Hill
Abstract: Scientific data proceed through a data life-cycle. Science researchers typically generate data that are managed within a local project (collection). They may then share data with other researchers (data grid), publish their data for use by the discipline (digital library), and create reference collections against which future research is compared (persistent archive). Each stage of the data life cycle is governed by a social consensus that determines the arrangement, retention, access, description, and manipulation mechanisms that are applied to the collection. The social consensus can be characterized as the management policies and procedures that enforce the desired collection properties. The iRODS (integrated Rule Oriented Data System) supports all stages of the data life cycle by mapping management policies to computer actionable rules, by mapping management procedures to computer executable workflows assembled from well-defined micro-services, and by verifying assessment criteria through queries on persistent state information. The iRODS data grid will be presented, along with current use of the iRODS technology in multiple data management applications.
Biography: Reagan Moore is a Professor in the School of Information and Library Science at the University of North Carolina at Chapel Hill, Chief Scientist for Data Intensive Cyber Environments at the Renaissance Computing Institute, and Director of the Data Intensive Cyber Environments Center at UNC. He coordinates research efforts in development of data grids, digital libraries, and preservation environments. Developed software systems include the Storage Resource Broker data grid and the integrated Rule-Oriented Data System. Supported projects include the National Archives and Records Administration Transcontinental Persistent Archive Prototype, and science data grids for seismology, oceanography, climate, high-energy physics, astronomy, and bio-informatics. An ongoing research interest is use of data grid technology to automate execution of management policies and validate trustworthiness of repositories.
Moore’s previous roles include: Director of the DICE group at the San Diego Supercomputer Center, and Manager of production services at SDSC. He previously worked as a computational plasma physicist at General Atomics on equilibrium and stability of toroidal fusion devices. He has a Ph.D. in plasma physics from the University of California, San Diego, (1978) and a B.S. in physics from the California Institute of Technology (1967).
Semantic Web @ KEG
Dr. Juanzi Li, Tsinghua University
Abstract: KEG, named Knowledge Engineering Group at the department of computer science and technology in Tsinghua University, started up in 1996. Its original research direction is knowledge engineering on the Internet. Currently, the research areas at KEG are classified into Semantic Web and Semantic Web Services, Text and Social Network Mining. In this talk, I will introduce our related research to semantic web. The talk consists of three parts: brief introduction to KEG, key technologies in semantic web, their applications and our future research. This talk will focuses on the methods of semantic annotation, ontology matching and semantic search.
Biography: Dr. Juanzi Li is a professor at Tsinghua University. She obtained her Ph.D degree from Tsinghua University in 2000 and finished her research work as a postdoctoral at the department of electronic engineering in Tsinghua in 2001. Her main research interests include Semantic Web and Web Service, Text and Social network mining. She is in charge of and takes part in many projects supported by natural science foundation of China, national basic science research program and international cooperation projects (60443002, 90604025, 60703059, 2007CB31080). She has published about 90 papers in many international journals and conferences such as WWW, SIGIR, SIGMOD, SIGKDD, ISWC, CIKM, JoDS and JoWS. She is the local organization chair of 2006 Asian Semantic Web Conference and services as PC members of many important international conferences such as WWW, ISWC, and ICSW. For more information, please visit her homepage: http://keg.cs.tsinghua.edu.cn/persons/ljz
A mobile health application for a chronically ill, low-literacy population
Kay Connelly
School of Informatics and Computing,
Indiana University
Abstract: In this presentation, we describe the design of the Dietary Intake Monitoring Application (DIMA[1]), a mobile, electronic food diary for low-literacy patients with stage 5 Chronic Kidney Disease (CKD). CKD patients do not have functioning kidneys, requiring them to undergo hemodialysis three times a week. Because excess fluids and toxins normally removed continuously by the kidneys are only removed every other day with dialysis, CKD patients have an extremely restricted prescribed diet. For example, a typical patient must limit their fluid to 1 liter a day, and their nutrients to 2 g of sodium. Failure to adhere to the diet can lead to a host of complications, including exacerbated hypertension, pulmonary edema, and even death. However, this population often lacks the computational and memory skills necessary to track their fluid and nutrient intake on their own, with as many as 80% of patients not restricting their fluid and 67% not limiting their nutrients. Further, this patient group is particularly difficult to design for as they have varying literacy skills, prohibiting text-based input and output. In this presentation, we describe our approach to designing for a chronically ill patient population that is not tech-savy and has educational barriers for using technology.
[1] Funded by the National Institute of Biomedical Imaging and Bioengineering (NBIB): Award #1 R21 EB007083-01A1, titled Self-Monitoring of Dietary and Fluid Intake Using a PDA.
Biography: Dr. Kay Connelly is an Associate Professor in the School of Informatics at Indiana University. Her research interests are in the intersection of mobile and pervasive computing and healthcare. In particular, she is interested in issues that influence user acceptance of health technologies, such as privacy, integration into one's lifestyle, convenience, and utility. Dr. Connelly works with a variety of patient groups, including very sick populations who need help in managing their disease, healthy populations interested in preventative care, and senior citizens looking to remain in their homes for as long as possible. Dr. Connelly is the Senior Associate Director for the Center for Applied Cybersecurity Research, and has recently taken the challenge to start a new Health Informatics program at Indiana University. Dr. Connelly received a BS in Computer Science and Mathematics from Indiana University (1995), and an MS (1999) and Ph.D. (2003) in Computer Science from the University of Illinois.
InPhO @ Work
Dr. Colin Allen
Department of History and Philosophy of Science
and Program in Cognitive Science
and Center for the Integrative Study of Animal Behavior
College of Arts and Sciences,
Indiana University
Abstract: A wealth of humanities resources is available on the world wide web. Access to these resources remains hampered, however, by the absence of sophisticated tools for aggregating, searching, and navigating the various digital collections. As the resources grow, we must also improve our ability to represent their contents in meaningful ways accessible to novices, experts, and machines. Due to the increased scale and dynamic nature of digital humanities resources, traditional methods of gathering and organizing metacontent are too resource- intensive and inefficient to be practicable. More sophisticated techniques of generating metacontent from large, asynchronously- updated corpora are required. The NEH-funded Indiana Philosophy Ontology (InPhO) project combines human expertise and software analysis to generate a “dynamic ontology” for the domain of philosophy. What can one do with the InPhO? I will describe and demo some current and future applications that we are developing over the next two years with funding from the National Endowment for the Humanities.
Biography: Colin Allen, holds a B.A. in philosophy from University College London and a Ph.D. in philosophy from the University of California at Los Angeles where he also did graduate work in computer science (artificial intelligence). He is Professor of History & Philosophy of Science and Professor of Cognitive Science in the College of Arts and Sciences at Indiana University, Bloomington, where he has been a faculty member since 2004. He also holds an adjunct appointment in the Department of Philosophy, and is a faculty member of IU's Center for the Integrative Study of Animal Behavior. His main area of research is on the philosophical foundations of cognitive science, particularly with respect to nonhuman animals, but he also pursues topics in artificial intelligence and his most recent book is Moral Machines: Teaching Robots Right from Wrong (Oxford University Press 2009), coauthored with Wendell Wallach. Since 1998 he has been consulting and programming for The Stanford Encyclopedia of Philosophy and is Associate Editor of the encyclopedia. Allen is currently director of the Indiana Philosophy Ontology project (InPhO) which in 2007 was awarded a Digital Humanities startup grant from the National Endowment for the Humanities and in 2009 was received a $400,000 grant from the NEH Division of Preservation and Access. Allen was President of the Society for Philosophy and Psychology in 2008-2009. In 2008 he was awarded the Faculty Mentor of the Year award by the Indiana University Graduate and Professional Students Organization.
Data-Intensive Computing: From Clouds to GPGPUs
Dr. Gagan Agrawal
Department of Computer Science and Engineering,
Ohio State University
Abstract: While the high productivity aspect of map-reduce has been well accepted,it is not clear if the API results in efficient implementations for different sub-classes of data-intensive applications. We will describe a system MATE (Map-reduce with an AlternaTE API), that provides a high-level, but distinct API. Particularly, our API includes a programmer-managed reduction object, which results in lower memory requirements at runtime for many data-intensive applications. MATE implements this API on top of the Phoenix system, a multi-core map-reduce implementation from Stanford. Our results show the performance advantage of using this new API.
Besides programmability and ease of parallelization, fault-tolerance has been an important attribute of map-reduce as well in its Hadoop implementation, where it has implemented by replicating data in the file system. In this talk, we show how more efficient fault-tolerance support can be developed using the alternate API in the MATE system. Particularly, as the reduction object represents the state of the computation on a node, we can periodically cache the reduction object from every node at another location and use it to support failure-recovery. Our results show that the overheads of our scheme are extremely low, and we outperform Hadoop both in absence and presence of failures.
Based on our work on APIs for data-intensive computing, we have also developed a system for mapping this class of applications on GPGPUs. We will show we can scale several data mining applications using our code generation tool.
Biography: Gagan Agrawal is a professor of computer science at the Ohio State University. He received his B. Tech degree from IIT Kanpur, and MS and PhD degrees from University of Maryland, College Park. His research interests include high-performance and data-intensive computing, data mining, and cloud computing.
Visualizing the Digital Trail: Privacy, Design, and the Adoption of Technologies for Encouraging Healthy Behaviors
Dr. Kalpana Shankar
Center for Data and Search Informatics,
School of Informatics and Computing,
Indiana University
Abstract: Research has found that most individuals do not fully understand the privacy implications of technologies that store and manipulate personal data. To some extent, this is because data and security mechanisms are invisible to the average user. The goal of this project, which I conducting with Kay Connelly in the School of Informatics and Computing, is to explore two interrelated themes: how visualizations of everyday health data can impact user attitudes about technology adoption and the privacy implications of such applications.
Our previous research suggests that users in different age cohorts have different understandings of privacy and security, think about health behaviors differently, and use technology in different ways. As such, we are targeting three distinct groups: college-aged students (18-25), middle-aged adults (30-45), and senior citizens (65-80). In this talk, I will present our preliminary results and implications for future research.
Biography: Kalpana Shankar is an assistant professor in the School of Informatics and Computing at Indiana University-Bloomington and an adjunct professor in the School of Library and Information Science. Her research projects focus on the uses of data and information (digital and otherwise) in diverse communities of practice. In addition to work on data sharing and use in the natural sciences, she is a co-PI on ETHOS, an NSF-sponsored project to investigate aging and home-based technology.



