********* * DRAFT * 10/11/01 ********* REPORT OF THE SHARE COMMITTEE: Study of Hubble Archive & Reprocessing Enhancements (SHARE) G. Kriss, M. Dickinson, D. Fraquelli, A. Koekemoer, H. Bushouse, D. Swade, M. Donahue, M. Giavalisco, T. Keyes, G. Meurer, P. Padovani, W. Sparks ============================================================================== INTRODUCTION ------------ The calibration of HST data is one of the primary areas where the STScI adds value to the Hubble science program. The Hubble Archive has become, in itself, a major resource for the astronomical community. The operational debut of the On-the-fly Reprocessing (OTFR) system will radically change the paradigm that we have applied to the calibration and storage of HST data. This system will open up avenues for the STScI to further enhance the scientific value and impact of the HST data sets stored in the archive. Progress in the field of astronomical surveys and catalogs has been great in the past few years and suggests possibilities for increasing the scientific scope of our archive beyond that originally envisioned when the system was designed. This is an appropriate time for us to evaluate the scientific potential of such enhancements and develop a road map for implementing the most promising of them. In the past, the requirement to store uniformly calibrated data has generally prevented adoption of algorithms requiring user input, or knowledge of the astronomical scene. The OTFR concept removes those restrictions and could allow selection of algorithms or processing paths by the user. Our initial pipelines dealt with only one data set at a time. Later, pre-defined associations of data sets were developed to allow processing of related data sets, such as wavelength calibration of astronomical spectra via internal calibration exposures. The OTFR concept allows for post facto definition of associations of data, either permanent or on-the-fly. This could allow application of simple associations to SIs for which this was not originally available (i.e. WFPC2). This could also allow processing of larger groups of data sets to provide more scientifically valuable products, such as summed data sets or mosaics. The archive catalog was originally conceived as simply an index into individual observations. Already, its scientific value has been increased by incorporation of pre-view images, and we are working in the direction of seamless access to data across missions. At this point, it would be technically feasible to extend the Hubble archive to include scientific services such as object catalogs, their generation from HST data sets, and direct cross-references to other catalogs and databases. Many of these ideas are also under contemplation for the NGST era, and it is worth considering how the archive might smoothly evolve to provide similar services for HST, NGST, and MAST holdings. On this basis, Rodger Doxsey chartered the SHARE committee to 1. Evaluate and recommend general capabilities, services and enhancements to these systems. 2. Evaluate and recommend specific new scientific services and enhancements to these systems. These should be augmentations with clear and substantial added benefit for the research community. 3. Provide a rough road map or order for the implementation of the recommendations made in item 1 and 2. 4. Recommend a process for encouraging astronomical community participation in the development of such enhancements. 5. Recommend a process for regularly assessing and prioritizing enhancements of this type in the future. The SHARE study group consists of Gerard Kriss (chair) HST/STIS & NGST Daryl Swade ESS Howard Bushouse ESS/SSG Megan Donahue ACDSD (& FASST Chair) Paolo Padovani ACDSD Dorothy Fraquelli ACDSD Tony Keyes HST/COS Mauro Giavalisco HST/WFC3 Mark Dickinson HST/NICMOS Anton Koekemoer HST/WFPC2 Bill Sparks HST/ACS Gerhard Meurer JHU/ACS Our group met in a series of meetings from May -- September 2001 to develop ideas for reprocessing enhancements and rank them in order of scientific priority. After next examining the timescales and resources needed to implement each of these ideas, the group intends to develop a suggested road map for implementing several of these ideas, with the order of implementation based on these rankings, the resource requirements, and the fraction of the user community that is likely to benefit from an implementation of a given idea. External input from advisory bodies such as the Space Telescope Users' Committee (STUC) should play an important role in making final decisions on implementation. POTENTIAL REPROCESSING ENHANCEMENTS ----------------------------------- We have identified a number of possible enhancements that could be implemented via the OTFR mechanism. We prioritized the full list we developed on the basis of their potential scientific value, with additional weight given to the unique capability that the facilities and knowledge resident at STScI could bring to bear. In this initial study, we cast a broad net to capture as many potential ideas as possible. In some cases our suggestions might be characterized more as "data analysis" rather than "data processing". If considered for implementation, we note that such "analysis" enhancements would benefit from a higher level of testing, documentation, and scientific peer review than normally accorded pipeline data processing routines. In order, the current possibilities are 1. Improvements to the accuracy of the World Coordinate System in the headers of calibrated data products. 2. Combining images to produce a wide-field mosaic or deep stacked image. 3. Identifying and classifying objects in a single image or a combined set of images. 4. Producing a catalog of objects in a single image or a combined set of images. 5. Determining photometric redshifts for objects in a region of sky which is imaged in a number of different bands. 6. Combining spectra from several different observations or from dithered STIS observations. 7. Using the time history of data and calibrations to enhance the reprocessing. 8. Providing better quality information for data sets in the archive. 9. Allow users to specify customized parameters to be used when running OTFR to tailor the pipeline calibration to their scientific needs. 10. Creating "data cubes" from dithered long-slit spectroscopic observations. We note that there was a wide dispersion in our rankings of the priority to be given to each of these tasks. Basically, the top third, middle, and bottom thirds of the above list have similar priorities. Below we give more detailed descriptions of each of these topics. WCS Improvements ================ Description: The primary purpose of this enhancement is to improve the astrometry that is encoded in the WCS keywords information in image headers, by making use of updated measurements and astrometric information about guide stars from data taken with HST. Science Case: The current astrometric accuracy of HST images is limited by the inherent astrometric uncertainties in the Guide Star Catalog system, and is generally likely to be accurate to no more than ~0.5 - 1 arcseconds. However, there is often a need to obtain much higher astrometric accuracy, in particular when comparing images at different wavebands from different telescopes (radio, optical, X-ray) as well as different images obtained with HST, generally at different times and in unrelated programs (e.g., narrow-band and broad-band images of the same object). This is required when carrying out source identifications based on multi-wavelength information, when examining color gradients across a given object, or when determining the relative location of morphological features seen in different bands. Ideally it would be desirable to achieve astrometric accuracy to a level comparable to the measurement error of unresolved sources on HST (i.e., << 0.1 arcseconds). This issue involves two separate, but related, concepts: relative astrometric precision between different HST images, and absolute astrometric accuracy with respect to some other well-defined, high-resolution system (the global VLBI reference frame is one such example). Currently, relative astrometric accuracy between different HST images is generally only achievable in cases where stars or other bright unresolved objects are common to both images. For images that have few or not stars (e.g., narrow-band or UV observations), relative astrometry can still be aimed at but becomes less certain, thereby directly impacting the science. Absolute astrometry, for example between HST images and radio data, is generally only possible in images that contain sources unresolved in both radio and optical, or otherwise display precise one-to-one correspondence in the two bands. Otherwise, if sources are resolved in one or both and display different morphologies, absolute registration becomes uncertain. Unique STScI Capability: STScI has the ability to update the GSC information, as well as the resources to carry out studies of all the guide stars for which astrometric information may be updated. Furthermore, only STScI has the ability to automatically incorporate the updated information into the archive data processing pipelines, thereby potentially eliminating the need for users to carry out any further refinements on the astrometry. Drawbacks: The only minor drawback is that some guide stars will have better astrometric information than others, so this capability will produce improvements on a non-uniform basis for different images. Combining Images ================ Description: This archival capability would allow users to combine images that are offset with respect to one another, creating a single output image (i.e., doing drizzle on-the-fly). It includes relative registration between images, combining images that are offset by relatively small amounts, and also potentially the creation of much larger mosaiced images from pointings that are offset by scales comparable to that of the detector itself. Science Case: The large majority of long-exposure HST images are split into two or more exposures in order to facilitate the removal of cosmic rays, and furthermore many programs make use of dithering to move the objects around on the detectors, not only alleviating the effect of hot pixels but also in some cases improving the PSF sampling through the use of non-integral pixel offsets. Such techniques are already commonplace with WFPC2 and STIS, and is expected to be the norm for NICMOS and ACS observations. However, the archive currently offers only limited capability for combining separate CR-SPLIT images, and no current capability for combining dithered exposures. Yet many of the steps involved in combining dithered images are repetitive and time-intensive and can potentially be automated. Furthermore, steps that still currently involve human iteration (such as checking image registration) may also be amenable to automation using different parameters for different classes of images (some examples of image categories may include sparse extragalactic fields, dense stellar regions, or extended bright diffuse emission across the field). Unique STScI Capability: STScI has the ability to maintain up-to-date information on geometric distortion, image pointing information (in the form of jitter files and other telemetry information). Drawbacks: Although many steps in combining images using "drizzle" can now be automated, there still remains a need for manual intervention/iteration in some cases, and at a minimum there is a need for observers to check that the images have been combined correctly. Furthermore, the parameters for cross-correlation and combination must often be fine-tuned according to the nature of the images themselves (for example, depending upon whether there are many bright stars or only a few faint diffuse objects) and care would need to be taken in generalizing the algorithm to deal with such different types of images in a way that will still provide useful results. Object Detection and Classification =================================== Description: This archival capability would allow for the detection and classification of objects on an image, based on a set of pre-defined criteria (possibly selected by the user from one of several alternatives, depending upon the nature of the image). Science Case: Although several object detection routines exist, their behavior often needs to be fine-tuned by the user. Incorporating this capability into the archive/pipeline would allow standardized behavior according to some set of parameters (perhaps several different sets of parameters, optimized for different classes of images - spare vs crowded-field, for example). The advantage of standardized behavior is that it allows well-defined completeness and object detection thresholds to be specified. Unique STScI Capability: The ability to standardize the behavior of the photometric routine, particularly optimizing it for several different classes of images, is something that is unique to STScI. Drawbacks: If the object detection parameters are set up such that this technique will yield useful results for a large fraction of images, then it is also likely to extract less information than would be the case if it were optimized for a specific dataset. Furthermore, the same dataset can sometimes be used to create different kinds of catalogs - some catalogs may be of bright point sources, another catalog may be of faint diffuse galaxies in the same image - and the same set of parameters is unlikely to work in both cases, thus the observers would likely need to fine-tune the parameters themselves in such cases and re-run the task manually. Automated Catalog Generation ============================ Description: This capability would build upon the object-detection technique, by creating catalogs for individual images or datasets that are linked together in some way (for example by their filter selection, or their exposure time, or their location on the sky), and potentially from a number of different observing programs. Science Case: The aim of this capability would be to allow users to specify parameters such as filter selection, exposure depth, possibly limiting magnitude or range of magnitudes, and thereby create a custom-made catalog containing objects from all images (not necessarily from the same program) that satisfy these criteria. Thus for example one could create magnitude-limited samples of all objects in all images that have some specified minimum exposure depth, from a large number of different programs, thereby creating a well-defined sample that may cover a much larger area than any of the individual programs. Unique STScI Capability: The ability to store and associate image parameters with the catalogs generated from each image is something that can only be done internally in the STScI database. Drawbacks: In order for the catalogs to be useful, the behavior of the object detection routines would need to be reasonably well quantified, and furthermore would need to be automatically applicable to a relatively large fraction of datasets. It is not clear how practical this will be to carry out. Determining Photometric Redshifts from Multicolor Imaging Data ============================================================== Description: Estimating photometric redshifts from multicolor imaging data Science Case: The use of multicolor photometry for estimating redshifts of galaxies has become an increasingly common tool for extragalactic observers. This was spurred in a large part by the availability of the WFPC2 Hubble Deep Field images, which provided high quality multicolor optical photometry for thousands of galaxies in a field where extensive spectroscopy was also available to calibrate the photometric redshift methods. Even after an orbit or two, WFPC2 images detect galaxies faint enough that spectroscopy becomes impractically difficult, motivating the desire for color-based redshift estimates. A variety of methods have been employed, and some stand-alone software packages have already been created to compute photometric redshifts from multicolor photometry catalogs. If object catalogs for HST images were to be generated as a data product, then one might imagine feeding these to a photometric redshift estimation code and providing the results as another data product, potentially searchable via the archive. In practice, however, reliable, general purpose photometric redshift estimates require images through at least three filters, preferably at least four. Even then, especially with optical photometry alone (e.g., from WFPC2 or ACS), the range of redshift over which photometric redshifts can reliably be estimated is limited (e.g., galaxies at 1 < z < 2 generally require both optical and infrared photometry for quality photo-z estimates). Only a small subset of HST imaging data would be suitable for photometric redshift estimation, and it would be difficult to implement the sort of "quality control" that would result in easy-to-use and reliable results for the general user. In principle, even with measurements in only two or three bands, a photometric redshift code could be used to provide a likelihood function L(z) which could restrict the range of possible redshifts for a galaxy without necessarily specifying one "preferred" redshift estimate. However, such a product would be more complex to use, interpret, or to search than a straightforward catalog with single data values for each object, and there would be risk that naive users might use these redshift estimates uncritically without considering the very substantial uncertainties involved, especially for non-HDF-like data. Unique STScI capability: If object catalogs from HST images were being automatically generated as an STScI data product, STScI would be in a convenient position to then automatically feed multicolor data meeting some particular criteria to a photometric redshift estimator, and to standardize the performance and output product format. Drawbacks: Relatively few data sets would be suitable for providing good photometric redshift estimates. Therefore, only a small fraction of the archived data would benefit from this effort, or, perhaps worse, inadequate and misleading photometric redshift estimates could be calculated and distributed for a larger body of unsuitable data sets. Combining Spectral Data ======================= Description: This enhancement would permit the user to specify data sets from multiple observations of a target and request that they be optimally combined into a single, summed spectrum. Many STIS CCD spectral observations are obtained in a dither pattern to optimize CR rejection, to avoid hot pixels, and to completely sample the spatial domain. These dithered observations could be combined in an OTFR process. Science Case: Many HST archival spectral observations consist of multiple data sets. In some cases these are merely repeated, unrelated observations, and in others it is a result of a deliberate observing strategy. The combinations produced via OTFR could be multiple observations in a single wavelength region whose combination increases the signal-to-noise ratio, or it may consist of several observations at different grating settings that could be combined to obtain broader wavelength coverage in the product. Automated combination of dithered STIS CCD spectral observations would automatically produce a better total product for the archive user. Unique STScI capability: The instrument groups at STScI have the specific knowledge of the instruments and the associated pointing data (WCS information and jitter files) that would be needed to perform this task routinely. Drawbacks: Blindly combining spectra obtained at different epochs could produce errant results for time-variable objects. Judicious scientific input on the part of a careful observer is generally required to assure the validity of any result. An automated system could circumvent the careful scrutiny usually applied when users combine data sets on their own. In the case of merging dithered long-slit spectral observations, we note that combining spectral images has the same (if not more) problems with alignment and registration that affect combining images. One often must tweak parameters manually in an iterative process to achieve an optimum result. Processing Data Sets Based on Time History ========================================== Description: Use the time history of an instrument's observing program and calibration state as an additional factor in reprocessing data. Science Case: Some aspects of data processing are time-dependent, in the sense that the optimal data processing may depend on characteristics of the instrument which change over time, or which relate in some way to previous or subsequent science exposures in a series. The HST data processing system already carries out some simple time-dependent procedures when pipelining HST data. E.g., "best" reference files are often selected based on the date on which an observation is taken. This may include "super" or "delta" dark reference files, for example. Other, more sophisticated examples could be identified. For example, NICMOS data are subject to persistence, in which detector pixels which collect a large number of counts in one exposure continue to "glow" with a count rate that decays fairly gradually with time, producing afterimages in subsequent images. This occurs both due to astronomical sources (e.g., bright stars which leave afterimages in subsequent exposures) and to radiation events, especially after SAA passages when the entire array is heavily bombarded, leaving a spatially mottled pattern which gradually fades throughout the subsequent orbit. In some cases, it may be possible to track and even correct persistent afterimages. For "astronomical" persistence, bright sources could be identified in one exposure, and those pixels could be flagged (at least) or corrected (at best, if a suitable persistence model could be defined) in subsequent science images. The SIRTF data pipeline will attempt to do this. For post-SAA persistence, in Cycle 11 STScI will begin taking automatically-scheduled "post-SAA darks" in every SAA-impacted orbit. There is hope that software can be developed which will scale and subtract these "darks" from subsequent images to reduce or remove the persistence signal, although this has not yet been generally demonstrated on-orbit. If this is successful, it might be implemented in a pipeline. It is quite likely that there are other examples involving other instruments, where time-dependent processing could improve data quality for many users. Unique STScI capabilities: In general, time- or history-dependent processing would require some means to search for, link together, and multiply process exposures with a given instrument that were taken over some time frame, regardless of whether they are part of the same HST proposal or not. STScI is in the best position to do this, using direct interfaces between the data archive and the OPUS system. Enhance Quality Notations for Archived Data Sets ================================================ Description: Data that are obviously unsuitable for use in any scientific investigation or that may require special processing should be flagged as bad, that is, called to the user's attention, early in the observation request process. Examples of unsuitable data include, but are not limited to, the following. + NIC data taken within