Introduction to numarray
------------------------
This is the text of the original annoucement to the Numeric community.
It serves as a good explanation of what we are trying to accomplish
with numarray but it is a bit dated and some of the details are no
longer correct.
------------------------
We have been working on a reimplementation of Numeric, the
numeric array manipulation extension module for Python.
The reimplementation is virtually a complete rewrite
and becuase it is not completely backwards compatible
with Numeric, we have dubbed it numarray to prevent
confusion.
While we think this version is not quite mature enough for
most to use in everyday projects, we are interested in
feedback on the user interface and the open issues mentioned
below. While we believe that performance is generally very
good for arrays larger than 100K elements, there are functions
and aspects of numarray which have not been optimized yet, so
please be careful about drawing conclusions about its efficiency
from limited test cases. We also welcome those who would like
to contribute to this effort by helping with the development,
or adding libraries.
This new version was developed for a number of reasons.
To summarize, we regularly deal with large datasets and
the new version gives us the capabilities that we feel are
necessary for working with such datasets. In particular:
1) Avoiding promotion of array types in expressions involving
Python scalars (e.g., 2.*<Float32 array> should not result
in a <Float64 array>).
2) Ability to use memory mapped files [largely implemented].
3) Ability to access fields in arrays of records as
numeric arrays without copying the data to a new array.
4) Ability to reference byteswapped data or non-aligned data
(as might be found in record arrays) without producing new
temporary arrays.
5) Reuse temporary arrays in expressions when possible
[not implemented yet]
6) Provide more convenient use of index arrays (put and take).
A new implementation was decided upon since many of the
existing Numeric developers agree that the existing implementation
is not suitable for massive changes and enhancements. Furthermore
Guido is very reluctant to accept Numeric into the Standard
Library unless it is rewritten.
This version has nearly the full functionality of the basic
Numeric (only masked arrays and the convolve function are
missing, and the use of the ufunc methods reduce, accumulate,
and outer for complex types). No other libraries have been
adapted to Numarray yet. In particular, the FFT and lapack
modules are not yet available.
NUMARRAY IS NOT FULLY COMPATIBLE WITH NUMERIC. (But it is
very similar in most respects). The incompatibilities are
listed below. The C interface is completely different so
existing C extensions will not work with numarray.
Major Compatibility issues:
1) Coercion rules are different. Expressions involving
scalars may not produce the same type of arrays
2) Types are represented by type Objects rather than character
codes (though the old character codes may still be used
as arguments to the functions).
3) Arrays have no public attributes. Accessor functions must
be used instead (e.g., to get shape for array x, one must
use x.shape() instead of x.shape
There are more minor differences which we will try to catalog.
OPEN ISSUES:
There are still some open questions about the interface
which we hope to resolve with feedback from the Numeric
commmunity. These issues are listed below. Depending on
what the consensus is, we may change aspects of the user
interface. It is not our intent to make the case for either
side. This should be done by the proponents of each view
in more detail on the Numeric mailing list.
1) Should reduce, accumlate, etc. act on the first dimension
or last?
Currently a number of functions or methods (e.g., reduce)
act on the first dimension rather than the last by default
(as does Numeric). Some feel that it makes sense to always
apply functions to the most rapidly varying dimension by
default and they are proposing that we change numarray to
reflect this. Others say that the current behavior is more
compatible with Python behavior, and that which dimension
a function applies to depends on the function (reduce would
apply to the first while FFT would apply to the last).
2) Should the ones and zeros functions should return Float64
or Int32 arrays by default? (currently Int32)
3) Should Complex comparisions work, particularly for equality?
The current implementation does allow complex comparisions,
even for <, >, >=, etc.--only the real component is checked
for those, both real and imaginary for ==, !=. Python now
does not permit complex comparisons for complex scalars.
Allowing this avoids lots of special case code some say.
But others say the incompatiblity with Python behavior is
worse.
4) What is the necessary degree of optimization for small
arrays?
The current implementation has much of the code
in Python. The performance on large arrays is generally
fast (or easy to optimize), but reducing the overhead for
setting up array operations will require much more of the
current implementation be rewritten in C or more complex
algorithms in Python. Either approach involves significant
work and will result in a more difficult-to-maintain module.
We have not done any significant benchmarking lately, but we
do have some order of magnitude estimates about relative
performance. (The following numbers depend on the platform
so do not take them too seriously). IDL (Interactive Data
Language from Research Systems) is already at about half its
asymptotic throughput for 200 element arrays. Numeric is at
its half throughput for 2000 element arrays. We estimate
that numarray is probably another order of magnitude worse,
i.e., that 20K element arrays are at half the asymptotic
speed. How much should this be improved? We have some
thoughts about other approaches that don't require this sort
of optimization, but rather try casting iterative calculations
on small arrays into a form that numarray can perform in
one call. These ideas will be elaborated on in a separate
message.
5) Should numarray be bullet-proof enough to protect users that
violate the rules regarding private attributes from themselves?
The current implementation allows a user who messes with
private attributes to crash Python. If they stick with
the public interface, this should not be a problem.
6) Should array properties be accessible as public attributes
instead of through accessor methods?
We don't currently allow public array attributes to make
the Python code simpler and faster (otherwise we will
be forced to use __setattr__ and such). This results in
incompatibilty with previous code that uses such attributes.
7) Is it necessary to add any other types, e.g.: Int64, UInt32,
Float128, Complex256?
8) What should the default be for behavior of index arrays with
regard to negative indices? Out of range indices? What other
options should be available.
9) What options should be made available for index arrays? What
behavior should be the default?
In particular a number of different behaviors can be envisioned.
For index values beyond the end of the array we can:
o Raise an exception (as with single indexing for Python
sequences.)
o Clip the index (as with Python slicing) [current default]
o Wrap the index (mod size of dimension).
For negative index values:
o Raise an exception
o Index from end of array (as with Python) [planned,
but not yet implemented default]
o Clip the index at 0 [current default]
o Wrap the index (mod size of dimension).
10) Should we implement 'indexing' with Bool arrays as being
equivalent to compress?
This isn't important to us, but has been wanted by others.
Who still wants it?
11) Should comparisons of array types objects work with the traditional
type codes?
We could enable comparisons of the type objects with the traditional
single character typecodes to yield true on effective matches.
Would this be a good thing? The positive is that reduces the backward
incompatibility. An alternative is to add the typecode method which
would return the typecode string consistent with Numeric.
New capabilities:
o Record arrays: arrays of records that have fixed format fields.
o Character arrays: arrays of fixed-length character strings.
o Support for memory mapped files.
o Use of index arrays within subscripts: e.g.
if ind = array([4, 4, 0, 2]) and x = 2*arange(6),
x[ind] results in array([8, 8, 0, 4])
o Support for byteswapped representation.
o tofile and fromfile functionality which eliminates the need to
use strings to read and write data from and to files.
Things yet to be done:
o More extensive C API; templates and examples for how to write
C extensions. (currently top priority)
o Reimplement complex types in C.
o Masked Arrays
o Add all the commonly available Numeric libraries to numarray.
o An array type for generic Python objects. Various possibilities
exist: one that requires them all to derive from a specified
base class, and one that doesn't. Special purpose ones? (e.g.,
one for Python strings?)
o Benchmarking and optimization. For example, some of the
functions could be reimplemented with much better performance
(e.g., array printing).
The early beta will be available for download from
http://sourceforge.net/projects/numpy as package numarray
We are in the process of developing a revised version of the
Numeric manual for numarray (all in all, there is a great
deal of commonality).
We have developed a brief document summarizing the differences
between Numeric and numarray for use in the meantime.
|
 |
|