Digital data created through computational science experiment and discovery is growing at a rapid rate and extending to new frontiers as discovery and experiment frameworks gain acceptance and computational power and storage become cheaper.

As research digital data collections become more accessible, it becomes increasingly important to address the issues of data validity and quality: To record and manage information about where each data object originated, the processes applied to the data products, and by whom.

The ability to routinely collect provenance information about the data products that are produced during the scientific discovery process can have a transformational impact on scientific discovery.

Provenance collection is, in essence, a form of automatic metadata generation. When metadata information collection is automated and done at the point of data product generation, what results is more accurate and complete information being collected, largely because it removes the need of involving users in annotating after-the-fact.

As digital library solutions for scientific data collections become more common, as trends indicate is happening already, it will be important that specialized metadata catalogs built up around e-Science discovery, such as the provenance database, be utilized in the archival collection for the rich contextual metadata they contain.

We are developing tools for provenance generation and collection and case-based reasoning. The tools and collected data are also available for download for wider community use.