STScI Logo
STScI Logo
HST
Banner

Introduction to numarray

------------------------
  This is the text of the original annoucement to the Numeric community.
  It serves as a good explanation of what we are trying to accomplish
  with numarray but it is a bit dated and some of the details are no
  longer correct.
------------------------

We have been working on a reimplementation of Numeric, the
numeric array manipulation extension module for Python. 
The reimplementation is virtually a complete rewrite
and becuase it is not completely backwards compatible
with Numeric, we have dubbed it numarray to prevent
confusion.

While we think this version is not quite mature enough for
most to use in everyday projects, we are interested in
feedback on the user interface and the open issues mentioned
below. While we believe that performance is generally very
good for arrays larger than 100K elements, there are functions
and aspects of numarray which have not been optimized yet, so
please be careful about drawing conclusions about its efficiency
from limited test cases. We also welcome those who would like
to contribute to this effort by helping with the development,
or adding libraries.

This new version was developed for a number of reasons.
To summarize, we regularly deal with large datasets and
the new version gives us the capabilities that we feel are
necessary for working with such datasets. In particular:

1) Avoiding promotion of array types in expressions involving
   Python scalars (e.g., 2.*<Float32 array> should not result
   in a <Float64 array>).
2) Ability to use memory mapped files [largely implemented].
3) Ability to access fields in arrays of records as
   numeric arrays without copying the data to a new array.
4) Ability to reference byteswapped data or non-aligned data
   (as might be found in record arrays) without producing new
   temporary arrays.
5) Reuse temporary arrays in expressions when possible
   [not implemented yet]
6) Provide more convenient use of index arrays (put and take).

A new implementation was decided upon since many of the
existing Numeric developers agree that the existing implementation
is not suitable for massive changes and enhancements. Furthermore
Guido is very reluctant to accept Numeric into the Standard
Library unless it is rewritten.

This version has nearly the full functionality of the basic
Numeric (only masked arrays and the convolve function are
missing, and the use of the ufunc methods reduce, accumulate,
and outer for complex types). No other libraries have been
adapted to Numarray yet. In particular, the FFT and lapack
modules are not yet available.

NUMARRAY IS NOT FULLY COMPATIBLE WITH NUMERIC. (But it is 
very similar in most respects). The incompatibilities are
listed below. The C interface is completely different so
existing C extensions will not work with numarray.

Major Compatibility issues:

1) Coercion rules are different. Expressions involving
   scalars may not produce the same type of arrays
2) Types are represented by type Objects rather than character
   codes (though the old character codes may still be used
   as arguments to the functions).
3) Arrays have no public attributes. Accessor functions must
   be used instead (e.g., to get shape for array x, one must
   use x.shape() instead of x.shape

There are more minor differences which we will try to catalog.

OPEN ISSUES:

There are still some open questions about the interface
which we hope to resolve with feedback from the Numeric 
commmunity. These issues are listed below. Depending on
what the consensus is, we may change aspects of the user
interface. It is not our intent to make the case for either
side. This should be done by the proponents of each view
in more detail on the Numeric mailing list.

1) Should reduce, accumlate, etc. act on the first dimension
   or last?

   Currently a number of functions or methods (e.g., reduce)
   act on the first dimension rather than the last by default
   (as does Numeric). Some feel that it makes sense to always
   apply functions to the most rapidly varying dimension by 
   default and they are proposing that we change numarray to
   reflect this. Others say that the current behavior is more
   compatible with Python behavior, and that which dimension
   a function applies to depends on the function (reduce would
   apply to the first while FFT would apply to the last).

2) Should the ones and zeros functions should return Float64
   or Int32 arrays by default? (currently Int32)
   
3) Should Complex comparisions work, particularly for equality?

   The current implementation does allow complex comparisions,
   even for <, >,  >=, etc.--only the real component is checked
   for those, both real and imaginary for ==, !=. Python now
   does not permit complex comparisons for complex scalars.
   Allowing this avoids lots of special case code some say.
   But others say the incompatiblity with Python behavior is
   worse.
   
4) What is the necessary degree of optimization for small
   arrays?
   
   The current implementation has much of the code
   in Python. The performance on large arrays is generally 
   fast (or easy to optimize), but reducing the overhead for
   setting up array operations will require much more of the
   current implementation be rewritten in C or more complex
   algorithms in Python. Either approach involves significant
   work and will result in a more difficult-to-maintain module.
   We have not done any significant benchmarking lately, but we
   do have some order of magnitude estimates about relative
   performance. (The following numbers depend on the platform
   so do not take them too seriously). IDL (Interactive Data
   Language from Research Systems) is already at about half its
   asymptotic throughput for 200 element arrays. Numeric is at
   its half throughput for 2000 element arrays. We estimate
   that numarray is probably another order of magnitude worse,
   i.e., that 20K element arrays are at half the asymptotic 
   speed. How much should this be improved? We have some
   thoughts about other approaches that don't require this sort
   of optimization, but rather try casting iterative calculations
   on small arrays into a form that numarray can perform in 
   one call. These ideas will be elaborated on in a separate
   message.
   
5) Should numarray be bullet-proof enough to protect users that
   violate the rules regarding private attributes from themselves?

   The current implementation allows a user who messes with
   private attributes to crash Python. If they stick with
   the public interface, this should not be a problem. 

6) Should array properties be accessible as public attributes
   instead of through accessor methods? 

   We don't currently allow public array attributes to make
   the Python code simpler and faster (otherwise we will
   be forced to use __setattr__ and such). This results in
   incompatibilty with previous code that uses such attributes.
   
7) Is it necessary to add any other types, e.g.: Int64, UInt32,
   Float128, Complex256?
   
8) What should the default be for behavior of index arrays with
   regard to negative indices? Out of range indices? What other
   options should be available.

9) What options should be made available for index arrays? What
   behavior should be the default?
   
   In particular a number of different behaviors can be envisioned.
   For index values beyond the end of the array we can:
     o Raise an exception (as with single indexing for Python
       sequences.)
     o Clip the index (as with Python slicing) [current default]
     o Wrap the index (mod size of dimension).
   For negative index values:
     o Raise an exception
     o Index from end of array (as with Python) [planned,
        but not yet implemented default]
     o Clip the index at 0 [current default] 
     o Wrap the index (mod size of dimension).

10) Should we implement 'indexing' with Bool arrays as being
    equivalent to compress?
    
    This isn't important to us, but has been wanted by others.
    Who still wants it? 

11) Should comparisons of array types objects work with the traditional
    type codes?

    We could enable comparisons of the type objects with the traditional
    single character typecodes to yield true on effective matches.
    Would this be a good thing? The positive is that reduces the backward
    incompatibility. An alternative is to add the typecode method which
    would return the typecode string consistent with Numeric.

New capabilities:

o Record arrays: arrays of records that have fixed format fields.
o Character arrays: arrays of fixed-length character strings.
o Support for memory mapped files.
o Use of index arrays within subscripts: e.g. 
  if ind = array([4, 4, 0, 2]) and x = 2*arange(6),
  x[ind] results in array([8, 8, 0, 4])
o Support for byteswapped representation.
o tofile and fromfile functionality which eliminates the need to
  use strings to read and write data from and to files.

Things yet to be done:

o More extensive C API; templates and examples for how to write
  C extensions. (currently top priority)
o Reimplement complex types in C.
o Masked Arrays
o Add all the commonly available Numeric libraries to numarray.
o An array type for generic Python objects. Various possibilities
  exist: one that requires them all to derive from a specified
  base class, and one that doesn't. Special purpose ones? (e.g.,
  one for Python strings?)
o Benchmarking and optimization. For example, some of the
  functions could be reimplemented with much better performance
  (e.g., array printing).

The early beta will be available for download from 
http://sourceforge.net/projects/numpy as package numarray

We are in the process of developing a revised version of the
Numeric manual for numarray (all in all, there is a great
deal of commonality).

We have developed a brief document summarizing the differences
between Numeric and numarray for use in the meantime.


Copyright  | Help  | Printable Page