First semester

IT Tools 2 (NoSQL, Big Data Processing with Spark)

Objectives

NoSQL:
Understand the fundamentals of NoSQL databases and the features and specific challenges NoSQL databases are addressing compared to classic SQL databases. Evaluate and select appropriate NoSQL technologies for particular situations. Gain hands-on experience in deploying and using NoSQL databases, such as MongoDB or Neo4j.

Big Data Processing with Spark:
Understand the stakes of distributed computing through the Apache Spark architecture. Discover how to use Apache Spark, platforms & tools available. Practice PySpark coding to learn Apache Spark features, from data management to machine learning.

Course outline

NoSQL:
– NoSQL origins (history & players)
– NoSQL / SQL comparison
– Key concepts of NoSQL databases:
– Data models
– Distribution models
– Query languages
– Consistency
– NoSQL database types
– NoSQL database technologies & comparisons (MongoDB, Cassandra, Neo4j, Redis, ElasticSearch…)
– Neo4j introduction + lab
– ElasticSearch introduction + lab

Big Data Processing with Spark:
– Distributed computing introduction
– Apache Spark origins & history, links to Apache Hadoop
– Apache architecture and main concepts:
– Apache Spark “modules”
– Architecture: driver & executors
– Transformations vs. actions
– Lazy evaluation
– Data structures: RDD, dataframes & datasets
– Using Apache Spark:
– Create sessions and connect to clusters
– Use data management functions
– Leverage SQL with Spark SQL
– Train & test machine learning models
– Use Spark Web UI

Prerequisites

NoSQL:
Basic knowledge of SQL, databases, and computer systems

Big Data Processing with Spark:
Computer systems and architecture basic knowledge, Python & SQL language practice