Data Analysis Systems and Facilities training module

Data analysis frameworks for HEP software are often built by the physicists. Currently several C++-based frameworks, often controlled by Python, are in use. The frameworks are also integrated closely with data formats used for archival, for example ROOT. Newer frameworks in the Python ecosystem, e.g. Scikit-HEP, are also being developed which leverage developments in the larger data science ecosystem. HEP students & postdocs are often familiar with using these tools, but typically know little about how they work under the hood. This training module will review the evolving HEP analysis ecosystem, including the performance and I/O limitations and tradeoffs as well as newer strategies, e.g., columnar data analysis, and implications for designing modern analysis facilities. The goal of the module will be to prepare the students to develop scalable, performant and innovative tools within the ecosystem.

Development of this module is partially supported by the IRIS-HEP software institute.

Topics

The scientific Python ecosystem (3.5 hours)
Introduction to performance tuning and optimization tools (0.5 hours)
Columnar data analysis (3.5 hours)
Analysis facilities: coffea-casa and the Analysis Grand Challenge workflow (2 hours)
Analysis scale-out techniques (2 hours)
Julia for analysis (2 hours)