Research
First semester

IT Tools 1 (Hadoop and Cloud Computing)

Objectives

At the end of the lectures, the student will realize the potential of Big Data and will know the main tools to process this tsunami of data at large-scale. In particular, the students will understand the main features of MapReduce programming model and its open-source implementation Hadoop, and will be able to use Hadoop and test it using different configurations.

Data volumes are ever growing, for a large application spectrum going from traditional database applications, scientific simulations to emerging applications including Web 2.0 and online social networks. To cope with this added weight of Big Data, we have recently witnessed a paradigm shift in computing infrastructure through Cloud Computing and in the way data is processed through the MapReduce model. First promoted by Google, MapReduce has become, due to the popularity of its open-source implementation Hadoop, the de facto programming paradigm for Big Data processing in large-scale infrastructures. On the other hand, cloud computing is continuing to act as a prominent infrastructure for Big Data applications.

The goal of this course is to give a brief introduction to Cloud Computing: definitions, types of cloud (IaaS/PaaS/Saas, public/private/hybrid), challenges, applications, main cloud players (Amazon, Microsoft Azure, Google etc.), and cloud enabling technologies (virtualization). Then we will explore data processing models and tools used to handle Big Data in clouds such as MapReduce and Hadoop. An overview on Big Data including definitions, the source of Big Data, and the main challenges introduced by Big Data, will be presented. After that, we will discuss distributed file systems. We will then present the MapReduce programming model as an important programming model for Big Data processing in the Cloud. Hadoop ecosystem and some of major Hadoop features will then be discussed.

Course outline

Throughout the course we will cover the following topics:

– Cloud Computing: definitions, types, Challenges, enabling technologies, and examples (2.25 hrs)
– Big Data: definitions, the source of Big Data, challenges (1.5 hrs)
– Google Distributed File System (1.5 hrs)
– The MapReduce programming model (1.5 hrs)
– Hadoop Ecosystem (2.25 hrs)
– Practical sessions on Hadoop (7 hrs)
– How to use Virtual Machines/Containers and Public Cloud Platforms
– Starting with Hadoop
– Configuring HDFS
– Configuring and Optimising Hadoop
– Writing MapReduce applications

Prerequisites

Familiar with Linux command-line
Familiar with Java/Python