Data and Computing
Astrophysics research relies on large amounts of data collected by detectors located far from where scientists, usually in international collaborations, work. Computing challenges for these detectors, such as IceCube or HAWC, include not only building huge facilities to store and provide access to data but also developing services that allow hundreds of scientists located all over the world to analyze the data quickly and efficiently.
At WIPAC, a team of IT specialists works in collaboration with local researchers as well as international collaborators to design and operate a computing infrastructure and develop the associated services to produce science results. After data is collected at the South Pole, our team is in charge of data filtering, storage, and transfer to the UW–Madison data center via satellite. Once at UW–Madison, the main computing services are those of data processing, simulation, and data access. We are also working to fully integrate our computing infrastructure into the Open Science Grid (OSG), contributing to the largest distributed high-throughput computing infrastructure for research in the US. We can leverage these types of shared resources to get the flexibility required to respond to peak research demands.
We manage facilities at WIPAC’s headquarters along with two other locations on the UW–Madison campus and the South Pole data center, where raw data from IceCube, ARA, and DM-Ice are collected and sent north for processing and archiving. The core services that provide these capabilities at WIPAC are a data processing and simulation cluster, with 4000 CPU cores and 350 GPUs, and four petabytes of disk storage in high-performance file systems.
The IceCube project generates about 700 terabytes of data every year. This data is processed and filtered so that a reduced set can be kept in the data center disks for analysis. The bulk of the raw data is archived in two off-site collaborating data centers at DESY Zeuthen and at the National Energy Research Scientific Computing Center (NERSC). There, it is preserved for long-term storage and can be retrieved for future re-processing as the need arises.
Our data center at WIPAC hosts the core IT infrastructure servers and about half of the storage servers. A second data center is located at UW–Madison in Chamberlin Hall, home of the Department of Physics. The other half of the storage servers are located there along with 90 percent of the data processing and simulation cluster. Finally, the Wisconsin Institutes for Discovery hosts the remaining GPU nodes in our cluster.
The most critical computing facility is the one at the South Pole, consisting of around 150 servers. One hundred of those are custom-built servers called “DOM hubs,” which directly connect to the detector sensors. There is one DOM hub for each string of detectors in the IceCube/IceTop configuration. The other 50 are commodity servers that host the real-time data acquisition, filtering, archiving, and transfer services.
IceCube data analysis often requires very large amounts of simulated data in order to develop selection techniques or to estimate the impact of experimental effects, such as ice impurities, on the data. We make use of distributed computing resources at several collaborating institutions to reach the vast amount of computing power needed to generate these simulated data sets.
From Data to Science
One terabyte of data is collected from the IceCube, ARA, and DM-Ice detectors daily and is filtered down to about 70 gigabytes, which we call Level 1 data, for satellite transmission to the north. The raw data are copied to disks that are shipped up from the South Pole every year. Once these disks arrive at UW–Madison, they are read back and transferred to a long-term archiving facility at NERSC, and a second copy is transferred to DESY Zeuthen.
At WIPAC, data are further processed offline to produce data samples, called Level 2, which provide the reconstruction of some of the particles recorded by the detectors, e.g., upgoing and downgoing muons in IceCube. Further levels of data filtering are established and implemented by different analysis working groups within the collaboration. Each group has specific requirements for direction and energy reconstruction as they look for specific neutrino and cosmic-ray data.
In addition to processing data from the detector, data simulations are continually produced. These simulations are critical to help scientists understand the physics signatures from outer space recorded in the detector. They include theoretical calculations for generating the input physical events and their propagation through the Earth and in the instrumented volume of the experiment. Simulation data reproduce all the detailed calibrated hardware response as well. Production of simulation data is planned according to the science requirements and the availability of computing infrastructure at WIPAC and throughout its collaborating institutions.
The WIPAC cluster is also the main facility used by IceCube, and WIPAC researchers also use it for data analysis on ARA, HAWC, CTA and DECO. For IceCube, the cluster delivers about 25 million CPU hours and close to 500,000 GPU hours for its analysis jobs every year.
Services Enabling Science
The computing infrastructure supporting WIPAC experiments is a natural extension of detector systems. Servers and networking equipment at the experiment site are an integral part of analyzing and filtering data. The archiving and transferring services are the synapses that link the online data production sensors with the off-line data processing and analysis machinery. Scientific results are the ultimate goal of our research, and a powerful and complex computing infrastructure is essential for turning data into scientific discoveries.