Our engineers have fine-tuned data processing, evoking a chord of speed and efficiency.
About This Article
Astronomers are frequently rewarded with greater and greater quantities of data to analyze from a range of space- and ground-based telescopes. Following the launch of the Nancy Grace Roman Space Telescope, the data quantity will increase yet again—Roman will survey large areas of the sky, capturing datasets that need to be analyzed with cloud-based software systems. For several years now, staff at the institute have worked to address these well-known challenges.
In 2023, engineers at the institute’s Barbara A. Mikulski Archive for Space Telescopes (MAST) marked a major accomplishment: They ingested a comprehensive 1.3 petabyte-catalog of the night sky, known as the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS), in only 12 days. If they used their previous high-touch workflow, that same project might have taken months. Another example: Migrating 21 terabytes of data from the Galaxy Evolution Explorer (GALEX) took 120 hours with the old workflow, but now could be migrated in under 5 hours, almost 26 times faster.
This sea change was made possible by the engineers’ adoption of Airflow, a workflow management platform that helps manage data intake and can be configured to seamlessly coordinate with other software systems. The program’s future applications are wide ranging and will, for example, support astronomers on the team as they ingest, process, and release new science products based on data from the James Webb and Hubble space telescopes, and the upcoming Nancy Grace Roman Space Telescope. Here, team members Sharon Shen, Johnny Gossick, and Ben Jordan explain how shifting to Airflow changed how quickly we can serve data to the world.
What problem did Airflow solve?
Sharon Shen: Pan-STARRS is a huge data set. It used to take months to process and release through MAST, but now we can release it in under two weeks. It’s also changed how we view more of our work throughout MAST. Astronomers receive a lot of community-contributed high-level data products every month, and we archive them on MAST to benefit the research community. The previous process required a lot of labor and attention to detail from our colleagues. With Airflow, we reduced and simplified what they need to contribute by automating the process and standardizing the systems that process the data.
Ben Jordan: When our on-staff scientists convert raw telescope data into something that’s usable for researchers, there were many manual touchpoints with the old process. For example, if an issue arose, an engineer had to figure out what was happening, which took time. Plus, the old conversion process did not execute tasks in parallel, which meant each step had to be completed before the next instruction could begin. Processing one dataset might have taken a whole day. With Airflow, we can run a whole batch of tasks at the same time.
Another benefit is automation. We no longer need somebody to keep track of files’ statuses and decide when things are ready for the next step. Instead, we set Airflow to manage that automatically, either by setting a schedule, or having the platform send emails or notifications to the relevant people when a step is complete. Since we and astronomers are already familiar with its programming language, Python, there’s no learning curve.
Why is this new workflow so different?
Jordan: Imagine the orchestra conductor in front of musicians. They are working together while playing individual instruments to produce a complex piece of music. This is what Airflow does. It doesn’t actually do the work itself, but it makes sure that the other resources we rely on are available and working in unison.
Shen: Any industry can use Airflow. It has a really big, open community that keeps growing. It’s scalable and you can plug in tools from other services. At the institute, we always favor open-source software because we can change the code to meet our needs and then contribute that updated code to support more users.
Johnny Gossick: Airflow complements our existing technology. For example, we have an on-premise platform called Rancher that manages virtual machines. Each machine we tap into provides a different resource, but they operate together and can complete the work much faster. Managing a bunch of virtual machines and their services is tough, but by integrating Airflow into Rancher, we can orchestrate the workflow a lot faster.
How will this change work for on-staff scientists when they upload new products?
Shen: Before, if scientists encountered issues on their own machines, we frequently spent a lot of time trying to figure out what went wrong or ask our team for help. Usually, the issue was related to their local environment on their individual machines. Now, we all work in a unified environment in the cloud, which means its far more powerful and the computing environment is the same for everyone.
Gossick: These advances will help our colleagues focus on their areas of expertise, only visiting the interface to monitor or trigger jobs as needed. They don’t need to understand every facet of the process to complete their work successfully.
How could Airflow be used to support the ingestion of data from the upcoming Nancy Grace Roman Space Telescope?
Shen: For the Roman mission, we need to build scalable processes that can handle very large datasets that will come in continuously. We can use Airflow to pick up those data and archive them automatically when they are delivered to us on the ground.
Jordan: After those raw data are collected, that action will kick off predetermined steps to process and ingest those data into MAST. What’s great about this is that it will all happen automatically, without needing a human to initiate the process.