Second semester

Big Data IT Tools

Objectives

Find your way around the most common "big data" technologies
Identify bottlenecks in data processing execution and adapt processing to remedy them
Choose and implement the right architecture for a given processing task, in particular CPU vs. GPU, local vs. cloud, batch vs. streaming, high-level vs. low-level, etc.
Produce simple statistical analyses with Spark
Provision a simple infrastructure on AWS

Course outline

The term "big data" is being used more and more, both in business and in the general media. Unfortunately, it is often used as a catch-all term. This course begins with a deconstruction of the notion of big data, presenting the V’s of big data and introducing the notion of high-performance data processing.
It then presents an overview of the technologies labelled big data and the associated computing architectures, comparing them with traditional solutions:
General architecture of local (processor, RAM, storage) and distributed computing (centralized vs. pee-to-peer; advantages and disadvantages of distributed systems)
Storage architectures (file systems vs. databases, local vs. distributed)
Focus on distributed storage with HDFS
Focus on distributed computing with Spark and MapReduce
Introduction to cloud computing with Amazon Web Service

Prerequisites

Basic knowledge of Python