First semester

Topics, Case Studies, Conferences

Teacher(s): Romaric GAUDEL, Shadi IBRAHIM, Rémi LELUC, Franck ORAGA, Thomas ZAMOJSKI

Course type: STATISTICS

Correspondant: François PORTIER

Unit: UE-MSD06 : Case Studies and Project

Number of ECTS: 2.5

Course code: MSD 06-2

Distribution of courses: Heures de cours : 36

Language of teaching: English

Objectives

This part is divided into multiple seminar sessions (each is dedicated to a recent data science topic):rn- BANDIT THEORY: You will learn how to identify when exploration is necessary in a learning system, learn standard strategies for handling this requirement, implement and test (with notebooks) these strategies. The need for exploration is ubiquitous in applications, arising as soon as we learn a model from data resulting from the choices made by that model. This challenge is one of the fundamental obstacles in reinforcement learning and recommender systems.rn SOME RECENT ADVANCES FOR BIG DATA PROCESSING IN THE CLOUD: At the end of the lectures, the student will be able to identify the main performance bottlenecks when running Big data applications in Clouds and will know how the performance of Hadoop can be improved, accordingly. During this conference, we will discuss several approaches and methods used to optimise the performance of Hadoop in the Cloud. We will also discuss the limitations of Hadoop and introduce state-of-the-art resource management systems and job schedulers for Big data applications including Mesos, Delay scheduler, ShuffleWatcher, and Tetrium. In addition, we will discuss how redundancy techniques, such as replication and erasure coding, affect the performance of MapReduce applications.rn STOCHASTIC OPTIMIZATION METHODS FOR MACHINE LEARNING: At the end of the lecture, the students will have acquired a robust understanding of theory and applications of stochastic optimization methods, a large overview of different stochastic optimization techniques such as Adam, Adagrad, (L)BFGS and general conditioning methods, practical techniques to apply stochastic optimization methods to real-world machine learning problems.rnrnrn- CASE STUDIES IN SMART DATA: At the end of the lecture, the student will know what are the challenges in deploying and maintaining a machine learning model in operation, what are some best practices addressing these concerns, how to create a Docker image and run a container, how to serve a model as a service in python, statistical methods for online and offline model monitoring.rn

Course outline

– BANDIT THEORY: Bandit setting and use-cases, Analysis of Explore then Commit strategy, Presentation, implementation and test of standard solutions: epsilon-greedy, UCB, Thompson Samplingrnrnrnrn- SOME RECENT ADVANCES FOR BIG DATA PROCESSING IN THE CLOUD: Approaches to optimize Hadoop in clouds (2.5 hrs), Resource management and job scheduling for Big data applications: Mesos, Delay scheduler, ShuffleWatcher, Tetrium, etc (2.5 hrs),rnIndependent work (tentative): Students will be assigned to groups where each group will do a 15 -20 min presentation (1 hr)rnrn- STOCHASTIC OPTIMIZATION METHODS FOR MACHINE LEARNING: This seminar delves into stochastic optimization methods tailored for machine learning, immersing students in both theory and application. The spotlight is on the widely-used Stochastic Gradient Descent (SGD) algorithm and its variants. Exploring the theory behind SGD, we uncover its limitations and expand into enhancements such as diagonal scaling, second-order techniques, and broader conditioning methods. Complementing this, the lecture transitions into a practical session, unraveling the direct application of stochastic optimization in reinforcement learning through policy gradient methods.rn- CASE STUDIES IN SMART DATA: Machine Learning models are notoriously hard to put and maintain in production. But why is it so and what can we do about it? In this course, we will explore the very latest trends in MLOps. We will learn about technologies such as Docker containers and FastAPI. We will also learn statistical methods to intelligently automate model monitoring and we will see how to put them in action via implementations in python packages such as scikit-multiflow and ruptures.rn

Prerequisites

– BANDIT THEORY: Basic knowledge of Python and of object-oriented programming – Basic knowledge of Machine Learning would be a plusrnrn- SOME RECENT ADVANCES FOR BIG DATA PROCESSING IN THE CLOUD: Attend the course: Big Data processing in Clouds: Hadooprn- STOCHASTIC OPTIMIZATION METHODS FOR MACHINE LEARNING: Convex analysis, Linear algebra, Python (basics/numpy/pytorch)rn- CASE STUDIES IN SMART DATA: Basic knowledge of Python Programming Language