The Data Reduction Expert Assistant

Glenn E. Miller

Space Telescope Science Institute

3700 San Martin Dr.

Baltimore, MD 21218 USA

miller@stsci.edu

ABSTRACT

Increased access to very large astronomical databases, the use of large format detectors and other developments in observational astronomy have the potential to overwhelm the capacity of most astronomers to analyze data unless new approaches to data reduction are found. This paper reports the initial progress in creating an expert system to assist in the reduction of scientific data. This system, called Draco, takes on much of the mechanics of data reduction, allowing the astronomer to spend more time understanding the physical nature of the data. Draco works in conjunction with existing data analysis systems such as STSDAS/IRAF and is designed to be extensible to new data reduction tasks.

1. Introduction

The task of data reduction presents severe obstacles to an astronomer: The volume of data may require much tedious work that is susceptible to errors (e.g., the flat-fielding and bias correction of a few dozen digital images can take several day's time and it is easy to accidentally apply the wrong calibrations to some of the images). Management of the data reduction process may require tracking tens or hundreds of files through many different steps. Limitations of disk space may constrain the order of the reduction (e.g., there may be room for only a few images on disk at any one time). The quality of each reduction step should be evaluated (e.g., stability of internal calibrations, or number of cosmic ray events). Often the entire reduction process must be repeated several times with improved calibration data or improved reduction algorithms. The chosen data analysis system must be mastered sufficiently by the scientist to correctly perform the reduction.

These are significant problems that inhibit progress by forcing the scientist to expend time and effort on the mechanics of reduction rather than understanding the physical nature of the data. The growing availability of large astronomical databases and increased use of large format detectors threatens to magnify these problems to an overwhelming degree. Other scientific disciplines share this concern, e.g., NASA's Earth Observing System (EOS) will collect many hundreds of megabytes of data each day.

We are developing Draco19, which is an expert system tool for the management and reduction of data.+ Draco builds on the foundation of existing data analysis systems such as STSDAS/IRAF. Draco gathers information about the available data (typically from header information in the data files), develops a plan for data reduction based on a template supplied by the astronomer and translates the plan into explicit reduction commands. An important feature of Draco is its generality and extensibility - new types of data analysis tasks or additional data analysis systems can easily be added without modifying existing software. This work is an extension of a successful prototype system for the calibration of CCD images developed by JohnstonSup5()10.

Draco's role in the data reduction process is modeled after a human assistant (at the level of an advanced undergraduate or beginning graduate student). With a human assistant, the astronomer describes the reduction process, demonstrates it on some data and notes various steps to be checked during the reduction (e.g., typical number of cosmic ray hits per pixel per second, average variation in flat fields, etc.). Once trained, the human assistant will reliably perform the reductions on new data sets and call attention to any unusual situations (e.g., missing calibration files, abnormally large number of bad pixels, etc.). A human assistant is (usually) able to adapt to simple changes in the reduction process with little or no additional training (e.g., using new calibration data or adjusting parameters within an algorithm).

Our goals are for Draco to accurately perform the reductions according to the description provided by the user, to alert the user to potential problems in the reduction, and to be readily extensible to new types of data reduction and data analysis systems. (The analogy of Draco to a human assistant should not be carried to an extreme. Unlike a human assistant, Draco will not learn from its mistakes nor will it discover new information. Section 2 mentions some programs that have some of these capabilities.) This automation frees the scientist from much of the drudgery in data reduction and should allow more time for the exploration and modelling of data.

This paper is a progress report on our initial work on Draco. Section 2 discusses related work, while Section 3 presents the design and implementation of Draco. Section 4 describes the use of the first version of the software. A final section summarizes our investigation and outlines our future work.

2. Previous Work on Expert Scientific Assistants

Our current investigation ensues from the work of Johnston10 who developed the Data Analysis Assistant (DAA), a prototype system for data reduction. The problem addressed by this work was Charge Coupled Device (CCD) calibration since it is a common data reduction task and it provided a suitable test case for the concept. CCD calibration consisted of four steps:

1. Extraction of a subimage representing valid data

2. Bias subtraction

3. Dark current subtraction

4. Flat fielding

The first two steps depend only on the characteristics of the detector and instrument mode. The last two steps are more involved since dark and flat calibration images are usually performed periodically through an observing run and therefore must be identified, matched and averaged to the appropriate science images.

Information about data and data reduction was organized in three knowledge bases: data, instrument modes and tasks. The data knowledge base described the astronomer's actual data (e.g., darks, flats, science). The instrument knowledge base held information such as the bias value or location of bad pixels. The task knowledge base recorded information about the data reduction process. Tasks were divided into two types: primitive and compound. Primitive tasks were those which could be implemented with a single command (or simple series of commands) in a specific data analysis system. Compound tasks represented the higher level operations.

To use the system, the astronomer first supplied the DAA with a description of all relevant images, i.e., dark, flat-field and science. (This information was available in the image header information, but a means to read headers was not implemented in the DAA.) The DAA generated a plan by using a set of forward-chaining production rules to associate flat fields and dark images with the proper science image, check for missing calibration files and expand compound tasks into primitive operations. Once a plan was complete, the user selected one of the two target languages, STSDAS or MIDAS and the general plan was converted to an explicit script of image processing commands in the chosen language. The user then executed the script file on an image analysis workstation.

The DAA was implemented in the Lisp-based Knowledge Engineering Environment (KEE) expert system shell (a product of Intellicorp, Inc.) on a Symbolics Lisp workstation. The DAA used KEE's rule and object systems as well as KEE's graphical user interface. Portions of the DAA were written in Lisp.

The DAA demonstrated two important concepts. First, it was possible to separate the system's knowledge of data reduction from the control strategy information. This allowed the system to accommodate new types of data or new data analysis functions without massive changes to the control software, as might be the case in a system written in a procedural language such as Fortran or C, or operating system command languages such as Unix shell scripts. For example, the format of individual reduction system commands was attached to the primitive task objects. This facilitates using existing information in new reduction procedures and provides a straightforward way to add new reduction systems. Second, it proved the feasibility of constructing a data reduction plan on the basis of a generalized knowledge of data reduction, specific knowledge of commands in a particular data analysis system and knowledge of the actual data. This yielded a very general framework which could readily accommodate new types of data reduction.

To conclude this section, I briefly mention some other work on expert scientific assistants. An expert assistant for the preparation of Hubble Space Telescope (HST) observing proposals was developed by Adorf and di Serego Alighieri2. A system which planned experiments in molecular genetics was originally developed by Stefik17 and recent work is reported in Noordewier and Travis15. Buchanan, et al.3 developed a system which controlled particle accelerator experimental parameters.

Abelson, et al.1 review their work on tools which prepare numerical experiments from high-level specifications of physical models. For example, the Bifurcation Interpreter and KAM programs investigate problems in dynamics by identifying interesting features, performing additional calculations of such features and reporting the results to the scientist. Keller and Rimon11 are developing a knowledge-based software environment to support the development of scientific models. Fabiano, Bettini and Chin5 describe a program which assists users in choosing parameters for complex quantum chemistry programs. Lucks and Gladwell12 describe a framework for representing and reasoning with expert knowledge that has been used to advise in the selection of differential equation software from numerical subroutine libraries and for the identification of parallel science observations for the Hubble Space Telescope.

Artificial intelligence technology has often been applied to classification problems. Fayyad, et al.6 are applying machine learning techniques to identify objects in the second Palomar Sky Survey. Cheeseman, et al.4 used a program to discover new classes of objects in IRAS spectra. Thonnat and Clement18 developed an expert system to control the processing of galactic images and the extraction of parameters such as size, ellipticity and luminosity profile. The results from this system are used by another expert system which classifies the galaxies.

These are just a few of the hundreds of scientific applications of expert systems and artificial intelligence (see Murtagh and Heck14 for more references). For the most recent developments, the reader should consult the proceedings of conferences such as the Conference on Artificial Intelligence Applications and Innovative Applications of Artificial Intelligence, National Conference on Artificial Intelligence.

3. Design and Implementation

3.1 Scientific Data Analysis Systems

In the last decade, scientific data analysis systems have grown in number and functionality. Widely used astronomical data analysis systems include the Interactive Reduction and Analysis Facility (IRAF) developed at NOAO, the Space Telescope Science Data Analysis System (STSDAS) developed at the STScI, the Astronomical Image Processing System (AIPS) developed at NRAO and the Munich Interactive Data Analysis System (MIDAS) developed at ESO. The Interactive Data Language (IDL) is used in astronomy and other disciplines such as climate research. (See Hanisch9 for a review).

The philosophy of these systems is usually similar to the philosophy of most computer operating systems (e.g., Unix, VMS): there is a command language (CL) which serves as the user interface in a "command/prompt" mode. The CL executes either single commands interactively, or scripts (procedures) of commands (generally with a choice of interactive or batch execution). CL commands reduce to the execution of modular operators which work on standardized types of data files. Two major strengths of this philosophy are:

* Flexibility for the user - individual commands can be chained (or "pipelined") to construct powerful, customized procedures.

* Facilitate development - there is a well-defined (though not usually simple) process for adding new modules to a system. Thus many programmers and scientists may independently contribute to the growth of a system.

The success of this approach is shown by their growth. Data analysis systems developed at one institution have been adopted as the standard at many universities and research institutes (e.g., IRAF) and systems developed for a particular wavelength range have been adapted to serve multiple spectral domains (e.g., AIPS). Packages developed independently have been incorporated into larger systems (e.g., the incorporation of DAOPHOT and Feigelson's Survival Analysis into IRAF/STSDAS).

However, this approach has some serious drawbacks (which are well known to users). Learning a system is not easy and even experts cannot be familiar with all parts of the system. Within a system, programs authored by different people may have different definitions or naming conventions which adds to the confusion of a novice user. To compound the problem, some users have to learn more than one system depending on where or how they obtain their data. This is especially true for multi-spectral observations which are often taken at several different observatories.

Although many commands are conceptually simple (e.g., subtract two images), some commands are quite sophisticated and require the specification of many parameters (some of which are interdependent) to get the correct results. If a complex procedure does not perform the desired tasks, the user is faced with the daunting possibilities of either modifying a large, complex program or writing a new program. Either choice can take many weeks or months.

It has also proven very difficult to capture and make available expert knowledge. Users can obtain assistance from manuals (often quite large, hard to use and harder to keep up to date), on-line help (often of little use to the non-expert and surprisingly hard to maintain), or by befriending the local expert on a particular topic.

Some of these problems will be lessened by efforts within scientific disciplines to adopt one analysis system as a standard (e.g., IRAF as a standard for astronomical data analysis), yet these effort cannot solve all the difficulties listed above. In particular, standardization will not lessen the data management problem nor the time needed to learn the system (including new modules as they are added). Researchers working in multi-spectral or interdisciplinary domains are likely to be faced with an amalgam of analysis systems for years to come.

Several groups are investigating solutions to these problems. A graphical user interface based on X-windows is being added to IRAF and the development of a hypertext help system has been proposed. A number of groups are exploring visualization systems which allow a scientist to interact with data in a more intuitive way in order to facilitate communication of results, browsing and discovery of new features7,16. However, as visualization systems are (by design) highly interactive, they will not lessen the data management problems addressed by Draco.

3.2 Draco - Data Reduction Expert Assistant

The present work demonstrates one approach to solving some of the above problems in scientific data analysis. Draco is an expert assistant which does the following:

* gathers information about the actual data (from header information in the data files).

* develops a plan for data reduction based on the user's goals and actual properties of the data

* produces a command language script to perform the reduction in a specific data analysis system

* performs checks on the data for consistency and quality

By producing a command script in the language of a data analysis system, Draco builds on the foundation of these systems, rather than creating yet another analysis system.

For Draco we have adopted a different design from the rule-based approach of Johnston's DAA. Draco can be likened to an algebra or very simple programming language. The user defines a set of primitive operations and combines them to perform reduction procedures. Draco does not know anything about the semantics of the primitives, but it does know which combinations are syntactically valid. This design was motivated by the realities of cutting-edge scientific work: In our discussions with colleagues it became clear that an important characteristic of scientific data analysis is that experts often disagree on how to best perform reductions. For example, there are several algorithms for removing cosmic ray artifacts from HST Wide Field/Planetary Camera (WF/PC) data8,13 and the proper choice of algorithm depends on the type of science to be extracted from the data. The variety of techniques available to correct for the HST's spherical aberration provides another example. Since the "rules" for data reduction vary from user to user (and even week to week for a single user), it did not appear feasible to us to collect this information as a set of expert system production rules. The alternative provided by Draco allows the user to specify the reduction steps at an abstract level.

Figure 1 illustrates the organization of information in Draco, using the removal of cosmic ray noise from WF/PC images as an example. Primitives represent basic data analysis operations and each primitive has one or more implementations which are commands or programs which accomplish the primitive. A procedure is a template for data reduction built from primitives.

The command

(make-script remove-CR-noise :input "mydir")

causes Draco to generate a script for removing cosmic ray artifacts using the procedure remove-CR-noise. Science image files are taken from the directory mydir. The script is then executed and produces a log file which records the reduction steps.

A procedure for cosmic ray removal is defined by:

(define-procedure

:name remove-CR-noise

:documentation "procedure for removing CR noise"

:primitives (find-like-images CR-removal))

The operations defined by the primitives are executed in the order specified, that is, Draco currently implements a pipeline for reductions.

The primitive CR-removal is defined as follows:

(define-primitive

:name CR-removal

:documentation "remove cosmic ray noise"

:input image

:output image

:reconcile :conjunctive

:concrete (STSDAS-CR-removal))

The :input and :output parameters specify the data types that this primitive reads and writes. (Data types are an abstraction which are realized in terms of file types, e.g., in SDAS, images are stored in Generic Edited Information Set files.) The :reconcile parameter defines the action when multiple inputs are encountered. The value of :conjunctive in the example indicates that the primitive should treat the data as a single input (i.e., multiple images are processed as a unit to determine the cosmic ray hits). An alternative value for this parameter is :distributive which causes the primitive to process each input separately, i.e., to iterate over its inputs. The :concrete keyword names the implementation which is defined as:

(define-implementation

:name STSDAS-CR-removal

:documentation "STSDAS cosmic ray removal function"

:package IRAF

:initialize-once ("stsdas" "wfpc")

:syntax "combine ~in ~out option=\"crreject\" usedqf=\"yes\"")

The syntax parameter records the format of specific procedures and commands such as the STSDAS "combine" procedure in our example. The ~in and ~out tokens are placeholders for the input and output file lists, respectively. Many analysis commands have initializations which must be invoked prior to execution. In this example, the "stsdas" and "wfpc" commands must be issued to IRAF to select the proper packages which contain the combine procedure. The :initialize-once keyword's actions will be performed before the first instance of this command. The optional parameter :initialize (not used in this example) is used when commands must be invoked with each use.

Sorry, figure is not available in this format

Figure 1 - Data structures for the removal of cosmic ray noise from WF/PC images.

The definition of Draco's structures such as primitives, implementations and data types creates a base of knowledge which can be applied to new data reduction problems. A new analysis system can be added to Draco by defining the appropriate implementations and file types. It is common for astronomers to write operating system command language scripts (e.g., Unix Shell or VMS DCL) to reduce data. The advantages of the Draco-generated scripts are clear: Draco provides a higher level of abstraction and handles many lower level details for the user. It is usually very difficult to modify custom command language scripts to different reduction tasks whereas Draco facilitates reuse of its component data structures.

In addition to the ability to create scripts, Draco provides a tool to inventory the files within a directory. Rather than relying solely on file extension conventions, the inventory uses file recognizers which (generally) open and read files to determine their format and contents (e.g., particular FITS keywords). This tool is useful for the management of the large number of files involved in data reduction and in providing quality checks on the data.

Draco is written in mostly in Common Lisp with some of the file recognition functions in C and Bourne shell code. Object-oriented programming is used to represent and operate the data structures (using the Common Lisp Object System). The DAA was prototyped on a special-purpose workstation (Lisp machine) and used a costly expert system shell. Since then, both workstation and software technology has evolved to the point that Draco is implemented on the same class of workstation commonly used for data analysis (e.g., Sun Sparcstation) and only requires an inexpensive Lisp environment. This makes possible the distribution of Draco to the scientific community.

4. Experience with Draco

In discussions with research groups at the STScI (along with our own experience in astronomical research) we found a number of common problems:

* The data management problem is severe. Many astronomers today have more data than can quickly be reduced and analyzed. Some data may wait months or years before the scientist can hire a postdoc or graduate assistant to reduce and analyze it.

* Despite best efforts to calibrate data only once, there is a continuing need to recalibrate data. This is true even if an observatory provides calibrated data (as does STScI). Often this is because the best calibration data are not available until well after the observations are taken.

* The removal of instrumental signatures from data is seldom routine, especially when state-of-the-art detectors are involved or when striving for quantitative results or high accuracy. The scientist therefore needs to be able to experiment with different parameters in the reduction algorithms as well as different algorithms.

* There is an inertia to remain within the analysis systems, computer operating systems and programming languages with which the astronomer is comfortable, despite serious shortcomings with these systems. Part of this is due to a justifiable skepticism that a new system will actually be better. Another major factor is that a scientist must usually concentrate on research and may have little time to provide tools which are useful to others.

An important part of our project plan is the early involvement of scientists in the use and evaluation of Draco. We sought astronomers who were faced with a large amount of data to reduce and whose projects were such that even early versions of Draco would reward them for their investment of time in the project. Detailed discussions were held with three research groups at the STScI. Our first users are R. Griffiths and K. Ratnatunga of the HST Medium-Deep Survey (MDS) Key Project. HST Key Projects are those which were identified by the astronomical community as having high scientific importance and involving a large amount of HST observing time. Data is shared by many astronomers with different interests. The scientific objectives of the Medium-Deep Survey include serendipitous discoveries, observations of rare objects, morphology and distribution of faint galaxies, active nuclei of distant galaxies, galactic structure, and distant solar-system objects. The observing program involves obtaining image data with WF/PC and Faint Object Camera in parallel with other HST observing programs. Since removal of cosmic ray artifacts from WF/PC data is an important step in the reduction process, we adopted it as the first sample problem for Draco. A small set of MDS files has been processed with Draco. Further use of Draco by the MDS is on hold at this time as they are revising their basic data reduction procedures in order to quantitatively account for measurement and reduction errors. This revision may cause them to write new software or to modify existing packages.

Earlier work (c.f. Section 2) has shown that it is possible to build software which provides expert assistance for scientific tasks. In our project we are trying to bring the expert assistant out of the prototype stage and into the hands of researchers. An open question is whether there is a sufficient audience for this type of tool. Some users either do not have much data or have less stringent analysis requirements and can be satisfied with existing analysis tools. Other users prefer to write their own software for analysis in order to be sure the reductions are done correctly (or must do so because no suitable software exists). Draco is aimed primarily at users between these two types.

5. Summary

This paper reports our initial efforts in developing an expert assistant for the reduction of scientific data. The first version of the software has been used to manage the removal of cosmic ray artifacts from HST Medium-Deep survey WF/PC data using an STSDAS procedure. Our approach holds promise for addressing several critical problems in dealing with large amounts of data. Although astronomical data reduction has been the focus of our initial work, the Draco system is directly applicable to many other fields of science including space physics and earth sciences.

We plan to implement several more versions of Draco over the coming year, each with increasing capability and addressing more involved reduction tasks. It will then be made available to the community. Possibilities for future work include: Adding a graphical user interface for defining and editing Draco entities would make its use more intuitive. The ability for Draco to monitor the execution of procedures and provide status and diagnostic information would be helpful. Adding a means for procedures (scripts) to branch or iterate in a general way might be useful. A script would need to examine to output of a reduction program in order to determine the next implementation to invoke, and possibly, some of the parameters with which to invoke it. We have deferred implementing such a feature since it is not clear if such a step can currently be automated for most reductions. Consider for example an iterative deconvolution algorithm. The number of iterations is usually determined by the astronomer's visual inspection since there exists no image analysis program which can determine the "best" number of iterations for an image (or at least no program which is generally accepted by astronomers). Even the current version of Draco can provide substantial assistance for such a task. Draco could create a number of deconvolved images (e.g., 25, 30, ... iterations) and run some statistical analysis modules on the images. The astronomer would examine this output to select the desired images.

Acknowledgements

Along with the author, Felix Yen, Mark Johnston and Robert Hanisch are investigators on the Draco project. We thank Ron Gilliland, Richard Griffiths, Keith Horne, Phil Martel and Kavan Ratnatunga for discussions about data reduction and their thoughtful comments on the design and development of Draco. This work is supported by NASA's Astrophysics Information Systems Research Program through the Center of Excellence in Space Data and Information Sciences by a contract with the Space Telescope Science Institute which is operated by AURA for NASA.

References

1. Abelson, H., Esienberg, M., Halfant, M., Katzenelson, J., Sacks, E., Sussman, G., Wisdom, J., Yip, K., 1989, "Intelligence in Scientific Computing", Communications of the ACM, 32, 546.

2. Adorf, H.-M., and di Serego Alighieri, S., 1989, "An Expert Assistant Supporting Hubble Space Telescope Proposal Preparation," in Data Analysis in Astronomy III, ed. V. DiGesu, et al., (New York: Plenum Press), 225.

3. Buchanan, B., Sullivan, J., Cheng, T. and Clearwater, S., 1988, "Simulation-Assisted Inductive Learning", in Proceedings of the Seventh National Conference on Artificial Intelligence (San Mateo, CA: Morgan Kaufmann), 552.

4. Cheeseman, P., Stutz, J., Self, M., Taylor, W., Goebel, J., Volk, K., and Walker, H., 1989, Automatic Classification of Spectra from the Infrared Astronomical Satellite (IRAS), NASA Reference Publication 1217.

5. Fabiano, A., Bettini, C. and Chin, S., 1991, "A Model Based Assistant for Quantum Chemistry Programs", Seventh Conference on AI Applications, (Miami: IEEE), 114.

6. Fayyad, U., Doyle, R., Weir, N., and Djorgovski, S., 1992, "Applying Machine Learning Classification Techniques to Automate Sky object Cataloguing", International Space Year Conference on Earth and Space Science Information Systems, in press.

7. Foley, J., 1990, "Scientific Data Visualization Software: Trends and Directions.", International Journal of Supercomputer Applications, 4, 154.

8. Groth, E., 1992, "An Algorithm for Removing Cosmic Rays from Two or More Cosmic Ray Split Exposures", in preparation.

9. Hanisch, R., 1992, "Image Processing, Data Analysis Software and Computer Systems for CCD Data Reduction and Analysis" in Astronomical CCD Observing and Reduction, ed. S. Howell (San Francisco: Astronomical Society of the Pacific), 285.

10. Johnston, M., 1987, "An Expert System Approach to Astronomical Data Analysis," Proceedings of the 1987 Goddard Conference on Space Applications of Artificial Intelligence.

11. Keller, R. and Rimon, M., 1992, "A Knowledge-based Software Development Environment for Scientific Model-building", Proceedings of the Seventh Knowledge-Based Software Engineering Conference, in press.

12. Lucks, M. and Gladwell, I., 1992, "Functional Knowledge Representation in AI Applications for Scientific Computing", AAAI 1992 Fall Symposium on Intelligent Scientific Computation, in press.

13. Murtagh, F. and Adorf, H.-M., 1991, "Detecting Cosmic Ray Hits on HST WF/PC Images using Neural Networks and Other Discriminant Analysis Approaches", in Data Analysis in Astronomy IV, ed. V. DiGesu, et al., (New York: Plenum Press), 103.

14. Murtagh, F. and Heck, A., 1989, Knowledge Based Systems in Astronomy , Lecture Notes in Physics #329, (Berlin: Springer Verlag).

15. Noordewier, M., and Travis, L., 1990, "Case Study of a Knowledge-Based System Which Plans Molecular Genetics Experiments", 1990 Conference on Artificial Intelligence Applications, (San Mateo, CA: IEEE), 257.

16. Senay, H. and Ignatius, E., 1992, "A Knowledge Based System for Scientific Data Visualization" Center for Excellence in Space and Data Information Systems Technical Report 92-79.

17. Stefik, M., 1981, "Planning with Constraints (MOLGEN)", Artificial Intelligence, 16, 111.

18. Thonnat, M., and Bijaoui, A., 1989, "Knowledge Based Classification of Galaxies," Knowledge-Based Systems in Astronomy, ed. A. Heck and F. Murtagh, (Berlin: Springer-Verlag), 121.

19. Yen, F., 1992, "Draco - A Data Reduction Expert Assistant", AAAI 1992 Fall Symposium on Intelligent Scientific Computation, in press.


Invited paper to appear in the Proceedings of Astronomy from Large Databases II, 14-16 September, 1992, Haguenau, France.

+ For the purposes of this paper, there is no need to distinguish between the terms data "reduction", "calibration" and "analysis" since Draco can provide assistance for all.