Delivering TESS Data to the WorldM. Fox (mfox[at]stsci.edu) and I. Momcheva (imomcheva[at]stsci.edu)
The first data release from the Transiting Exoplanet Survey Satellite (TESS) was one of the most anticipated events of 2018 across the exoplanet community. In the months leading up to the first release, engineers and scientists across the Data Management Division (DMD) and the Data Science Mission Office (DSMO) spent significant amounts of time and effort setting up and testing our infrastructure, developing tools and interfaces, and preparing documentation and tutorials such that users can have a smooth experience downloading the data and getting started on the analysis. In this article we highlight some of the innovative solutions and services we used for this release to give you a "behind-the-scenes" look at the first TESS data release.
Our primary concern was our ability to serve data to users immediately after the release. Based on our previous experience hosting the Kepler and K2 datasets, we expected that the first public release—of Sectors 1 and 2 scheduled for early December 2018—would drive the highest volume of downloads within the shortest amount of time. While we expected that most users would only download data for their targets of interest, we also anticipated that many would download all the data in the release. With a data volume of ~6.9 TB in the first release (~3.5 TB per sector), multiple concurrent requests could quickly saturate our capacity to deliver data and result in slow downloads and time-outs for many users. Our highest demand prediction required the Mikulski Archive for Space Telescopes (MAST) to be capable of delivering 400 TB of data per day, which is over six times the current MAST bandwidth capacity! Clearly, we would need to make use of additional infrastructure.
Evaluating third party infrastructure options required us to set up performance tests, which would simulate the expected global demand. Our team decided to use the Amazon Web Services (AWS) Lambda functionality to drive these performance tests. Lambda is a serverless compute service, where a user can write a function and execute it in different geographic regions around the world. We set up a simple Lambda function that simulated a user downloading TESS data products from MAST, checked for errors and recorded performance metrics such as connect time, transfer time, bytes downloaded, etc. The Lambda function was run in multiple global regions, where each region would have a pre-configured concurrency level. For example, for a concurrency of 1000, each region would be running 1000 concurrent instances of our Lambda function simulating 1000 users requesting data simultaneously.
In order to understand the performance edges of third-party services, we needed to analyze the data captured in the tests. We found an elegant solution was to push the metrics and error codes directly to AWS CloudWatch from our Lambda function. CloudWatch is an AWS service that provides a graphical interface allowing analysis of captured metrics. Tuning the performance test was mostly a matter of adjusting the Lambda concurrency and monitoring the CloudWatch graphs to note performance trends and any errors.
We evaluated two third-party infrastructure solutions: a cloud object storage provider and a content delivery network (CDN) service. Cloud object storage (e.g., AWS S3) provides high availability, but while it could potentially double our bandwidth, it fell short of our goal to get to six times our current bandwidth rates. Here is where the CDN service really shines, particularly given our expected large global concurrent demand. A CDN service caches files in data centers around the world. For example, when someone in Tokyo requests a file, the CDN caches a copy of that file at the CDN edge data center in Japan. All future requests from users in Japan would be delivered from the CDN edge data center there, thus taking pressure off MAST and using edge data centers around the globe to serve TESS data products. Our Lambda tests showed that CDN performance scaled linearly up to a point we felt would meet our expected demand.
On the release day—December 6th, 2018—we observed the CDN reducing bandwidth demands to MAST (the CDN origin) and pushing out data products at record rates. During the highest demand periods we saw TESS products moving out to the community at rates up to 1.2 GB/s, more than twice the MAST capacity. Figure 1 shows a fine-grained record of the bandwidth as a function of time (binned in five minute increments) within the first day after the release. Demand on the MAST servers (orange line) remained almost constant while the edge locations (blue line) experienced significant spikes. Within the first 24 hours of the release, users downloaded 24 TB of data and Figure 2 shows the global distribution of those users. The performance for the user community was even, regardless of their global location. We received no complaints from users and there was a significant amount of excitement across the community with many posting on social media about the seamless experience. Equally important, users requesting data from other MAST missions were unaffected by the spike in demand for TESS data.
All in all, the CDN allowed us to seamlessly serve a large volume of data within a short period of time to the global astronomical community. Even though the demand did not reach our worst-case estimates, we were satisfied with the technological solution we chose and the experience with the TESS delivery can be transferred to future data releases where we expect high volume of downloads.
A number of other innovative solutions were developed for the TESS data.
For users interested in exploring the full suite of MAST holdings of a given known exoplanet, TESS data were added to exo.MAST, an interface that caters specifically to the needs of the exoplanet community. Specifically, users can search on a TESS threshold-crossing event (TCE), find the TESS-derived metadata, view the TESS light-curves and download the TESS data products for their targets of interest directly in exo.MAST.
The full frame images (FFI) which are saved every 30 minutes by TESS (as opposed to the monthly cadence for Kepler) are expected to be a major source of new discoveries. As part of our mission to provide high-quality access to astronomical datasets, we built an image cutout service for TESS FFI images. Users can request image cutouts in the form of TESS pipeline-compatible TPFs without needing to download the entire set of images (>1400 images with a total volume of >750 GB). For users who wish to have more direct control or who want to cutout every single star in the sky, the cutout software (python package) is publicly available and installable for local use. The main barrier in writing performant TESS FFI cutout software was the number of files that must be opened and read from. To streamline the cutout process, we performed a certain amount of one-time work up front, which allowed individual cutouts to proceed much more efficiently. The one-time data manipulation work takes an entire sector of FFIs and builds one large (~45 GB) cube file for each camera chip, so that the cutout software need not access several thousand FFIs individually. Additionally, we transpose the image cube, putting time on the short axis, thus minimizing the number of seeks per cutout. By creating these data cubes up front, we achieved a significant increase in performance.
Helping users get from zero to science as fast as possible is also a major part of our work. In order to expedite this process for TESS, our archive scientists developed a series of tutorials in the form on Jupyter notebooks. These covered a wide range of topics including reading and displaying different types of files, searching the catalog around a target, retrieving data from a guest investigator program and creating FFI cutouts. All notebooks conformed with our internal style guide and were automatically tested to guarantee their functionality. Users were excited by the availability of "executable documentation" and at least one user announced on social media that the notebooks were helpful in the fast turn-around of a discovery. The notebook library is open to contributions! In the future, we hope to expand the notebook library to other missions supported by the institute.
A final notable development was the staging of all TESS data on Amazon Web Services (AWS). The goal here is to make the TESS data highly available next to the vast computational resources offered by AWS. Using these data from within the US-East (N. Virginia) AWS region does not incur any charges, but downloading data from this copy to other AWS regions or outside of AWS will. For any large-scale analysis which would require touching most or all of the data and/or needs a large amount of computational resources, we recommend that users consider using the AWS dataset. An example of accessing and working with the AWS dataset is available on the MASTLabs blog.
Hosting the TESS data at MAST has been an exciting experience across STScI data management. It allowed us to experiment with new technologies, prototype new services and be part of the worldwide wave of science delivered by this amazing mission. We look forward to the many discoveries to come.