Sunday, December 16, 2012

Passing the torch of NumPy and moving on to Blaze

I wrote this letter tonight to the NumPy mailing list --- a list I have been actively participating in for nearly 15 years.


Hello all, 

There is a lot happening in my life right now and I am spread quite thin among the various projects that I take an interest in.     In particular, I am thrilled to publicly announce on this list that Continuum Analytics has received DARPA funding (to the tune of at least $3 million) for Blaze, Numba, and Bokeh which we are writing to take NumPy, SciPy, and visualization into the domain of very large data sets.    This is part of the XDATA program, and I will be taking an active role in it.    You can read more about Blaze here:  http://blaze.pydata.org.   You can read more about XDATA here:  http://www.darpa.mil/Our_Work/I2O/Programs/XDATA.aspx  

I personally think Blaze is the future of array-oriented computing in Python.   I will be putting efforts and resources next year behind making that case.   How it interacts with future incarnations of NumPy, Pandas, or other projects is an interesting and open question.  I have no doubt the future will be a rich ecosystem of interoperating array-oriented data-structures.     I invite anyone interested in Blaze to participate in the discussions and development at https://groups.google.com/a/continuum.io/forum/#!forum/blaze-dev or watch the project on our public GitHub repo:  https://github.com/ContinuumIO/blaze.  Blaze is being incubated under the ContinuumIO GitHub project for now, but eventually I hope it will receive its own GitHub project page later next year.   Development of Blaze is early but we are moving rapidly with it (and have deliverable deadlines --- thus while we will welcome input and pull requests we won't have a ton of time to respond to simple queries until at least May or June).    There is more that we are working on behind the scenes with respect to Blaze that will be coming out next year as well but isn't quite ready to show yet.

As I look at the coming months and years, my time for direct involvement in NumPy development is therefore only going to get smaller.  As a result it is not appropriate that I remain as "head steward" of the NumPy project (a term I prefer to BDF12 or anything else).   I'm sure that it is apparent that while I've tried to help personally where I can this year on the NumPy project, my role has been more one of coordination, seeking funding, and providing expert advice on certain sections of code.    I fundamentally agree with Fernando Perez that the responsibility of care-taking open source projects is one of stewardship --- something akin to public service.    I have tried to emulate that belief this year --- even while not always succeeding.  

It is time for me to make official what is already becoming apparent to observers of this community, namely, that I am stepping down as someone who might be considered "head steward" for the NumPy project and officially leaving the development of the project in the hands of others in the community.   I don't think the project actually needs a new "head steward" --- especially from a development perspective.     Instead I see a lot of strong developers offering key opinions for the project as well as a great set of new developers offering pull requests.  

My strong suggestion is that development discussions of the project continue on this list with consensus among the active participants being the goal for development.  I don't think 100% consensus is a rigid requirement --- but certainly a super-majority should be the goal, and serious changes should not be made with out a clear consensus.     I would pay special attention to under-represented people (users with intense usage of NumPy but small voices on this list).   There are many of them.    If you push me for specifics then at this point in NumPy's history, I would say that if Chuck, Nathaniel, and Ralf agree on a course of action, it will likely be a good thing for the project.   I suspect that even if only 2 of the 3 agree at one time it might still be a good thing (but I would expect more detail and discussion).    There are others whose opinion should be sought as well:  Ondrej Certik, Perry Greenfield, Stefan van der Walt, David Warde-Farley, Pauli Virtanen, Robert Kern, David Cournapeau, Francesc Alted, and Mark Wiebe to name a few (there are many other people as well whose opinions can only help NumPy).    For some questions, I might even seek input from people like Konrad Hinsen and Paul Dubois --- if they have time to give it.   I will still be willing to offer my view from time to time, and if I am asked. 

Greg Wilson (of Software Carpentry fame) asked me recently what letter I would have written to myself 5 years ago.   What would I tell myself to do given the knowledge I have now?     I've thought about that for a bit, and I have some answers.   I don't know if these will help anyone, but I offer them as hopefully instructive:   

1) Do not promise to not break the ABI of NumPy --- and in fact emphasize that it will be broken at least once in the 1.X series.    NumPy was designed to add new data-types --- but not without breaking the ABI.    NumPy has needed more data-types and still needs even more.   While it's not beautifully simple to add new data-types, it can be done.   But, it is impossible to add them without breaking the ABI in some fashion.   The desire to add new data-types *and* keep ABI compatibility has led to significant pain.   I think the ABI non-breakage goal has been amplified by the poor state of package management in Python.   The fact that it's painful for someone to update their downstream packages when an upstream ABI breaks (on Windows and Mac in particular) has put a lot of unfortunate pressure on this community.    Pressure that was not envisioned or understood when I was writing NumPy.

(As an aside:  This is one reason Continuum has invested resources in building the conda tool and a completely free set of binary packages called Anaconda CE which is becoming more and more usable thanks to the efforts of Bryan Van de Ven and Ilan Schnell and our testing team at Continuum.   The conda tool:  http://docs.continuum.io/conda/index.html is open source and BSD licensed and the next release will provide the ability to build packages, build indexes on package repositories and interface with pip.    Expect a blog-post in the near future about how cool conda is!).  

2) Don't create array-scalars.  Instead, make the data-type object a meta-type object whose instances are the items returned from NumPy arrays.   There is no need for a separate array-scalar object and in fact it's confusing to the type-system.    I understand that now.  I did not understand that 5 years ago.   

3) Special-case small arrays to avoid the memory indirection and look at PDL so that generalized ufuncs are supported from the beginning.

4) Define missing-value data-types and labels on the dimensions and arrays

5) Define a standard "dictionary of NumPy arrays" interface as the basic "structure of arrays" concept to go with the "array of structures" that structured arrays provide.

6) Start work on SQL interface to NumPy arrays *now*

Additional comments I would make to someone today: 

1) Most of NumPy should be written in Python with Numba used as the compiler (particularly as soon as Numba gets the ability to create Python extension modules which is in the next release).  
2) There are still many, many optimizations that can be made in NumPy run-time (especially in the face of modern hardware). 

I will continue to be available to answer questions and I may chime in here and there on pull requests.    However, most of my time for NumPy will be on administrative aspects of the project where I will continue to take an active interest.    To help make sure that this happens in a transparent way,  I would like to propose that "administrative" support of the project be left to the NumFOCUS board of which I am currently 1 of 9 members.   The other board members are currently:  Ralf Gommers, Anthony Scopatz, Andy Terrel, Prabhu Ramachandran, Fernando Perez, Emmanuelle Gouillart, Jarrod Millman, and Perry Greenfield.      While NumFOCUS basically seeks to promote and fund the entire scientific Python stack,   I think it can also play a role in helping to administer some of the core projects which the board members themselves have a personal interest in. 

By administrative support, I mean decisions like "what should be done with any NumPy IP or web-domains" or "what kind of commercially-related ads or otherwise should go on the NumPy home page", or "what should be done with the NumPy github account", etc.  --- basically anything that requires an executive decision that is not directly development related.    I don't expect there to be many of these decisions.  But, when they show up, I would like them to be made in as transparent and public of a way as possible.  In practice, the way I see this working is that there are members of the NumPy community who are (like me) particularly interested in admin-related questions and serve on a NumPy team in the NumFOCUS organization.     I just know I'll be attending NumFOCUS board meetings, and I would like to help move administrative decisions forward with NumPy as part of the time I spend thinking about NumFOCUS. 

If people on this list would like to play an active role in those admin discussions, then I would heartily welcome them into NumFOCUS membership where they would work with interested members of the NumFOCUS board (like me and Ralf) to direct that organization.    I would really love to have someone from this list volunteer to serve on the NumPy team as part of the NumFOCUS project.   I am certainly going to be interested in the opinions of people who are active participants on this list and on GitHub pages for NumPy on anything admin related to NumPy, and I expect Ralf would also be very interested in those views.

One admin discussion that I will bring up in another email (as this one is already too long) is about making 2 or 3 lists for NumPy such as numpy-admin@numpy.org,  numpy-dev@numpy.org, and numpy-users@numpy-org.  

Just because I'll be spending more time on Blaze, Numba, Bokeh, and the PyData ecosystem does not mean that I won't be around for NumPy.    I will continue to promote NumPy.   My involvement with Continuum connects me to NumPy as Continuum continues to offer commercial support contracts for NumPy (and SciPy and other open source projects).   Continuum will also continue to maintain its Github NumPy project which will contain pull requests from our company that we are working to get into the mainline branch.      Continuum will also continue to provide resources for release-management of NumPy (we have been funding Ondrej in this role for the past 6 months --- though I would like to see this happen through NumFOCUS in the future even if Continuum provides much of the money).    We also offer optimized versions of NumPy in our commercial Anaconda distribution (Anaconda CE is free and open source).   

Also, I will still be available for questions and help (I'm not disappearing --- just making it clear that I'm stepping back into an occasional NumPy developer role).   It has been extremely gratifying to see the number of pull-requests, GitHub-conversations, and code contributions increase this year.   Even though the 1.7 release has taken a long time to stabilize, there have been a lot of people participating in the discussion and in helping to track down the problems, figure out what to do, and fix them.    It even makes it possible for people to think about 1.7 as a long-term release.  

I will continue to hope that the spirit of openness, tolerance, respect, and gratitude continue to permeate this mailing list, and that we continue to seek to resolve any differences with trust and mutual respect.    I know I have offended people in the past with quick remarks and actions made sometimes in haste without fully realizing how they might be taken.   But, I also know that like many of you I have always done the very best I could for moving Python for scientific computing forward in the best way I know how.    

Thank you for the great memories.   If you will forgive a little sentiment:  My daughter who is in college now was 3 years old when I began working with this community and went down a road that would lead to my involvement with SciPy and NumPy.   I have marked the building of my family and the passage of time with where the Python for Scientific Computing Community was at.   Like many of you, I have given a great deal of attention and time to building this community.   That sacrifice and time has led me to love what we have created.    I know that I leave this segment of the community with the tools in better hands than mine.   I am hopeful that NumPy will continue to be a useful array library for the Python community for many years to come even as we all continue to build new tools for the future. 

Very best regards,

-Travis