Python for Data Science Masterclass – 46 Hours HD Video

Python for Data Science Masterclass – 46 Hours HD Video

top University professor
What you’ll learn

  • Variables and datatypes

  • operators

  • tuples

  • sets

  • dictionary

  • and much much more
Requirements
  • beginners are welcome
Description

Welcome to this course on Python for Data Science. This is a 4 week course we are

going to teach you some very basic programming aspects in python. And since this is a

course that is geared towards data science towards another course based on what has

been taught in the course, we will also show you two different case studies one is what

we call as a function approximation case study another one a classification case study.

And then tell you how to solve those case studies using the programming platform that

you have learned. So, in this first introductory lecture I am just going to talk about why

are we looking at python for data science.

(Refer Slide Time: 01:10)

So, to look at that first we are going to look at what data science is. This is something

that you would have seen in other videos of courses in the NPTEL in other places. Data

science is basically the science of analyzing raw data and deriving insights from this

data. And you could use multiple techniques to derive insights, you could use simple

statistical techniques to derive insights, you could use more complicated and more

sophisticated machine learning techniques to derive insights and so on.

Nonetheless the key focus of data science is in actually deriving these insights using

whatever techniques that you want to use. Now there is a lot of excitement about data

science and this excitement comes because its been shown that you can get very valuable

insights, from large data and you can get insights about how different variables change

together, how one variable affects another variable and so on with large data which is not

very easy to simply see by very simple computation.

So, you need to invest some time and energy, into understanding how you could look at

this data and derive these insights from data. And from utilitarian viewpoint, if you look

at data science in industries if you do proper data science, it allows these industries to

make better decisions. These decisions could be in multiple fields for example,

companies could make better purchasing decisions, better hiring decisions, better

decisions in terms of how to operate their processes and so on.

So, when we talk about decisions, the decisions could be across multiple verticals in an

industry. And data science is not only useful from an industrial perspective it is also

useful in actual science as themselves. So, where you look at lots of data to model your

system or test your hypotheses or theories about systems and so on. So, when we talk

about data science, we start by assuming that we have a large amount of data for the

problem of interest. And we are going to basically look at this data we are going to

inspect the data, we are going to clean and curate the data then we will do some

transformation of the data modeling and so on before we can derive insights that are

valuable to the organization or to test a theory and so on.

(Refer Slide Time: 03:47)

Now, coming to a more practical viewpoint of what we do once we have data. I have

these four bullet points; which roughly tell you supposing you were solving a data

science problem what are the steps you will do? So, you will start with just having data

someone gives you data; and you are trying to derive insights from this data. So, the very

first step is really to bring this data into your system. So, you have to read the data. So,

that the data comes into this programming platform so that you can use this data. Now

data could be in multiple formats so you could have data in a simple excel sheet or some

other format.

So, we will teach you how to pull data in to your programming platform from multiple

data formats. So, that is a first step really if you think about how you are going to solve a

problem these steps would be first to simply read the data. And then once you read the

data many times you have to do some processing with this data you could have data that

that is not correct. For example, we all know that if you have your mobile numbers, there

are 10 numbers in a mobile number and if there is a column of mobile numbers and then

say there is a one row where there are just five numbers then you know there is

something wrong ok. So, this is a very simple check I am talking about in real data

processing this gets much more complicated.

So, once you bring the data in when you try to process this data you are going to get

errors such as this. So, how do you remove such errors how do you clean the data? Is one

activity that that usually precedes doing you more useful stuff with the data. This is not

the only issue that we look at there could be data that is missing.

So, for example, there is a variable for which you get a value in multiple situations, but

in some situations the value is missing. So, what do you do with this data do you throw

the record away? Or you do something to fill the data and so on. So, these are all data

processing cleaning steps. So, in this course we will tell you the tools that are available

in python so that you can do this data processing cleaning and so on.

Now what you have done at this point is you have been able to get the data into the

system, you have been able to process and clean the data and get to a certain data file or

data structure that is reasonably complete so that you think you can work with this data

set at which point what you will do is you will try to summarize this data. And usually

summarization of this data a very simple technique would be very very simple statistical

measures that you will compute; you could for example, computer median, mode, mean

of a particular column.

So, those are simple ideas or summarizing the data you could compute variance and so

on. So, we are going to teach you how to use this notions of statistical quantities that you

can use to summarize the data. Once you summarize the data then another activity which

is usually taken up is what is called visualization right. So, visualization means you look

at this data and more pictorially to get insights about the data before you bring in heavy

duty algorithms to bear on this data. And this is a creative aspect of data science, the

same data could be visualized by multiple people in multiple ways. And some

visualizations are not only I caching, but are also much more informative than other

types of visualization.

So, this notion of plotting this data so that some of the attributes are aspects of the data

are made apparent is this notion of visualization. And there are tools in python that will

teach you in terms of how you visualize this data. So, at this point you have taken the

data, you have cleaned the data, got a set of data points or data structure that you can

work with you have done some basic summary of this data that gives you some insights.

You also looked at it more visually and you have got some more insights, but when you

have large amount of data big data the last step is really deriving those insights which are

not readily apparent either through visualization or through simple summary of data.

So, how do we then go and look at more sophisticated analytics or analysis of data so,

that these insights come out. And that is where machine learning comes and as a part of

this course when you see the progress of this course you will notice that you will go

through all of this, so that you are ready to look at data science problems in a structured

format and then use python as a tool to solve some of these problems.

(Refer Slide Time: 08:57)

Now, why python for doing all of this? The number one reason is that there are these

python libraries, which already are geared towards doing many of the things that we

talked about so that it becomes easy for one to program and very quickly you can get

some interesting outcomes out of what we are trying to do.

So, there are as we talked about in the previous slide, you need to do data manipulation

and pre processing. There are lots of functions libraries in python where you can do data

wrangling manipulation and so on. From a data summary viewpoint there are many of

these statistical calculations such you want to do are already pre programmed and you

have to simply invoke them with your data to be able to show data summary. The next

step we talked about visualization there are libraries in python, which can be used to do

the visualization.

And finally, for the more sophisticated analysis that we talked about all kinds of machine

learning algorithms are already pre coded available as libraries in python. So, again once

you understand some bit about these functions and once you get comfortable working in

python, then applying certain machine learning algorithms for these problems become

trivial. So, you simply call these libraries and then run these algorithms.

(Refer Slide Time: 10:29)

At a higher level so, in the previous slide we talked about flow process for how I get the

data in clean it. And all the way up to insights and then parallelly we said why python

makes it easy for us to do all of this. If you go back if you go forward a little more and

then, ask in terms of the other advantages of python which are little more than just very

simple data science activities. Python provides you several libraries and its being

continuously improved so, anytime there is a new algorithm those are coming into the set

of libraries. So, in that sense its very varied and there is also a good user community.

So, if there are some issues with new libraries and so on and those are fixed so that you

get robust library to work with and we talk about data and data can be of different scale.

So, the examples that you will see in this course are data of reasonably small size, but in

real life problems you are going to look at data which is much larger which we call as big

data. So, python has an ability to integrate with big data frameworks like hadoop spark

and so on.

And python also allows you to do more sophisticated programming object oriented

programming and functional programming. Python with all of this sophisticated tools

and abilities is still reasonably a simple language to learn its reasonably fast to prototype.

And it also gives you the ability to work with data which is in your local machine or in a

cloud and so on. So, these are all things that one looks for when one looks at a

programming platform which is capable of solving problems in real life right.

So, these are real problems that you can solve, these are not only toy examples, but real

applications that you can build data science applications that you can build with python.

(Refer Slide Time: 12:49)

And just as another pointer in terms of why we believe that python is something that, a

lot of our students and professionals in India should learn. As you know there are tools

which are paid tools for machine learning with all of these libraries and so on.

And there are also open source tools and in India based on a survey, most people of

course, prefer open source tools for a variety of reasons cause being one because its free

to use. But also if it is just free to use, but it does not have a robust user community then

its not really very useful that is where python really scores in terms of a robust user

community which can help with people working in python. So, it is both open source and

there is a robust user community, both of which are advantageous for python.

(Refer Slide Time: 13:48).

And if you think of other competing languages for machine learning; if you look at this

chart in India about 44 percent of the people who were surveyed said they use python or

they prefer python. And of course, a close second is R. In fact, R was much more

preferred a few years back, but over the last few years in India a python is starting to

become the programming platform of choice. So, in that sense its a good language to

learn because the opportunities for jobs and so on or lot more when when you are

comfortable with python as a language.

So, with this I will stop this brief introduction on why python for data science. I hope I

have given you an idea of the fact that while we are going to teach you python as a

programming language, please keep in mind that each module that we teach in this is

actually geared towards data science. So, as we teach python we will make the

connections to how you will use some of the things that you are seeing in data science;

and all of this we will culminate with these two case studies that will bring all of these

ideas together. In terms of both giving you an idea and an understanding of how the data

science problem will be solved and also how it will be solved in python which is a

program of choice currently in India.

So, I hope this short four week course, helps you quickly get on to this programming

platform. And then learn data science and then, you can enhance your skills with much

more detailed understanding of both the programming language and data science

techniques.

Thank you.

Now, the commonly used data exploration and visualization tools are Tableau, Qlikview

and of course, you always have your MS Excel. So, the next bucket that we are going to

look into is when you have huge chunks of data, now when your collecting data on a real

time basis you are going to be collecting data over every second every minute. Now if

you want to store all these data and preprocesses it the regular desktop or computing

systems that you have might not be useful.

So, that is when you use parallel or distributed computing, where you distribute the work

across different systems popular tools that are being used for big data apache spark and

Apache Hadoop. So, in this course we are going to be mainly focusing on tools that are

required for data preprocessing and analysis and in specific we are going to look into

python.

(Refer Slide Time: 03:08)

So, let us look at the evolution of python. So, python was developed by Guido van

Rossum in the late eighties at the national research institute for mathematics and

computer science and this institute is located at Netherlands.

So, there are different versions of python, the first version that it was released was in

1991; the second version was released in 2000 and the third version was released in 2008

with version 3.7 being the latest. So, let us look at the advantages of using python.

(Refer Slide Time: 03:41)

So, python has features that make it well suited for data science. So, let us look at what

these features are. So, the first and foremost feature of python is that it is an open source

tool and python community provides immense support and development to its users. So,

python was developed under the open source initiative approved license thereby making

it free to use and distribute even if its for commercial purposes.

(Refer Slide Time: 04:05)

The next feature is that the syntax that python use fairly simple to understand and code

and this breaks all kinds of programming barriers if you are going to switch to a newer

programming language. So, the next important advantage of using python is that, the

libraries which are contained in python get installed at the time of installation and these

libraries are designed keeping in mind specific data science task and activities.

Python also integrates well with most of the cloud platform service providers; and this is

a huge advantage if you are looking to use big data. So, if you are going to download

python from the website and install it, you will see that most of the scripting is done in

shell. So, there are applications that provide better graphical user interfaced for the end

users and these are taken care by the integrated development environment.

(Refer Slide Time: 04:57)

So, now, let us see what an integrated development environment is, an IDE as how its

abbreviated is a software application and it consists of tools which are required for

development. All these tools are consolidated and brought together under one roof inside

the application. IDEs are also designed to simplify the software development this is very

useful because as an end user, if you are not a developer you might want all the tools

available at a single click. Using an IDE will be very beneficial in that case also the

features provided by IDEs include tools for managing compiling deploying and

debugging a software. So, these also form the code features of any IDEs.

(Refer Slide Time: 05:44)

So, now let us look at what are the features of an IDE in depth. So, any IDE should

consist of three important features; the first is the source code or text editor, the second is

a compiler and the third is a debugger. Now all these three features form the crux of any

software development.

The IDEs can also have additional features like syntax and error highlighting code

completion and version control.

(Refer Slide Time: 06:09)

So, let us see what are the commonly used IDEs for python, the most frequently used as

Spider, PyCharm, Jupyter Notebook and Atom. And these are basically from the

endpoint of the user, depending on what he or she is comfortable with.

(Refer Slide Time: 06:24)

So, now let us look at spider, the spider is an IDE and it supported across Linux, Macs

and Windows platforms. It is also an open source software and it is bundled up with

Anaconda distribution which comes up with all inbuilt python libraries.

So, if you want to work with spider you do not have to install any of the libraries. So, all

the necessary libraries are taken care by Anaconda. So, another important feature of

spider is that it was specifically developed for data science and it was developed in

python and for python.

(Refer Slide Time: 06:57)

So, this is how the interface of spider looks, you have the scripting window and you have

other console output here, you have a variable explorer here. All these features we are

going to be looking at in the next few lectures to come.

(Refer Slide Time: 07:11)

The other features of spider includes a code editor, with robust syntax error highlighting

features; it also helps in code completion and navigation it consist of a debugger, it also

consist of an integrated documents that can be viewed within the python interface on the

web. Another advantage of using spider is that it has a interface which is very similar to

MATLAB and RStudio’s. So, if you are a person who is already work with these two

programming languages and are looking to switch to python, then the transition is also

going to be seamless.

(Refer Slide Time: 07:44)

So, now let us look at the second IDE which is pyCharm. So, pyCharm is also supported

across all OS X systems which is Linux Macs and windows. It has two versions to it one

is the community version which is an open source software; the other is the professional

version which is a paid software. So, pyCharm supports only python and it is bundled up

and packaged with Anaconda distribution which comes with all the inbuilt python

libraries. But; however, if you want to install pyCharm separately then that can also be

done.

(Refer Slide Time: 08:14)

So, this is how the interface of pyCharm looks, you have a very very well define

structure for naming your directories and you have the scripting window here.

(Refer Slide Time: 08:25)

So, let us look at some of the features that pyCharm consists of. The first is that it

consists of a code editor which provides syntax and error highlighting; then it consists of

a code completion and navigation feature it also consists of a unit testing tool which will

help the compiler go through each and every line of the code. It also consists of a

debugger and controls the versions.

(Refer Slide Time: 08:48)

So, now let us look at the next IDE which is Jupyter notebook. So, now, Jupyter

notebook is very different from the earlier two IDEs in the sense that it is a web

application which allows creation and manipulation of the codes; now these codes are

called notebook documents and hence that is how Jupyter gets its name Jupyter note

book. Now Jupyter is supported across all operating systems and it is available as an

open source version.

(Refer Slide Time: 09:18)

Now, this is the interface of Jupyter, you can see that you have few cells here as an input

you also have some output let me just zoom in and show you how the interface looks.

(Refer Slide Time: 09:33)

So, here you can see some of the codes that is written, if you just scroll up and see this is

some narrative about whatever you have written.

(Refer Slide Time: 09:41)

So, Jupyter is bundled with Anaconda distribution, but it can also be install separately. It

primarily supports Julia, python, R and Scala. So, if you look at the name Jupyter it

basically takes the first two letters from Julia the next two from python and then R.

So, that is how Jupyter gets its name as Jupyter it also consists of an ordered collection

of input and output cells like how we earlier saw; and these can contain narrative text,

code, plots and any kind of media.

(Refer Slide Time: 10:13)

One of the key features of Jupyter notebook is that, it allows sharing of code and

narrative text through output formats like HTML markdown or PDF. If you are working

in an education environment or if you would like to have a better presentation tool, then

you can use these kind of output formats to present. So, though Jupyter consist of

features that give a very good aesthetic appeal to it, it is deficit of the important features

of a good IDE. So, by good IDE I mean it should consist of a source code editor and

compiler and a debugger; and all three of these are not provided by Jupyter.

(Refer Slide Time: 10:50)

So, the next IDE that we are going to look into is atom. So, atom is an open source text

and source code editor and it supported again across all over systems; it again supports

programming languages like python PHP Java etcetera. And it is very very well suited

for developers, it also helps the users to install plug ins or packages. So, one common

drawback with all these text editors and source code editor is that these do not come

installed with basic libraries of any programming languages; you have to install these

kind of packages as and when you have a need for them.

So, that is one major drawback for using any kind of text editor or the source code editor.

But; however, atom does provide packages or libraries that are suited for data science

and code completion or code navigation or debugging. So, you can install it, so if you are

a developer and if you want to code an text editor environment then you can go ahead

with atom. But you will have to install all these packages as and when you require.

(Refer Slide Time: 11:52)

So, this is the interface of atom, this is how it looks it, it is a proper text editor interface.

(Refer Slide Time: 12:00)

So, how will you choose the best IDEs then important question. So, it basically depends

on your requirements, but it is a good habit to work first with different IDEs to

understand what your own requirements are. So, if you are new to python then it is better

that you work across all these IDEs and there are several other IDEs out there you can

work with all these IDEs see what suits you and then take a call on which IDE to use.

But in this course we are going to be looking at spider; and that is primarily because it is

a very good software that has been developed only for data science and python; and it as

an interface that is very very appealing and easy to use for beginners.

(Refer Slide Time: 12:43)

So, to summarize in this lecturer we saw what are the popular tools used in data science

environment. We also saw how python evolved and what are the commonly used

integrated development environment. We also looked at what each of these IDE have to

offer us and some of the common pros and cons of each of these.

Welcome to the lecture on Introduction to Spyder, in this lecture we are going to see how

does the interface of spyder look? How to set the working directory and how to create

and save a Python file?

(Refer Slide Time: 00:28)

(Refer Slide Time: 00:32)

So, let us see how does the appearance of spyder look. So, on my left you can see a

snapshot of the screen that would appear once you open Spyder. So, the Python version

that I am using to illustrate this lecture is version 3.6. So, once you open you will get a

small description of the author name and when the file was created. There are a couple of

windows though here so let us see what each of these windows mean.

So, the entire interface is split into three windows, the window on my left is called the

scripting window and all your lines of codes and commands that you are going to type

will be displayed here. So, you have to write all your commands and codes here on my

right I have two windows, the top section is where you would find tabs that read as file

explorer, help and variable explorer.

Now under file explorer once you set the directory if you have any files that are existing

in your current working directory, then all these files will be displayed under file

explorer under variable explorer you will basically be having a display of all the objects

and variables that you have used in your code. Now, along with the variables you also

have their name, type and size. Now, name is the name of the variable, type is the data

type and size is whether it is an array or a single value. Now, the first few values will be

displayed if it is only a single value then the single value be displayed under, the heading

value the section on the bottom is the console.

So, console so is an output window where you will be seeing all your printed statements

and outputs, you can also perform elementary operations in your console, but the only

disadvantage is that you will not be able to save it. Now however, whatever you type in

the scripting window can always be saved. So, we are going to look into how to save the

lines of commands that you have used in your scripting window and we will do that once

the lecture proceeds.

(Refer Slide Time: 02:34)

Now, let us see how to set the working directory, there are three ways to set a working

directory the first is using an icon, the second is using the inbuilt library OS and the third

is using a command CD which means change directory.

(Refer Slide Time: 02:47)

Now, let us see how to set a working directory using the icon. If you look at the top

section here you will see an icon here with a folder open, now you can choose a working

directory by clicking on this icon. Once you choose you will be prompted to choose a

location or a folder. Now, you can choose a suitable folder or a suitable location by

clicking on the icon and once you click on the location your directory is considered to be

set. Now this is an easy method and if you do not want to be typing commands every

single time, then you can just do a drag and drop.

(Refer Slide Time: 03:28)

Now, let us look at the second and the third methods, now you need to import a library

called OS, OS stands for Operating Systems. Before you use a function from this library

to change the directory you need to import it. So, import is a function that you will use to

load a library to your environment.

Now, once you load the library OS on your environment you can use the function chdir

which means change directory. So, I need to use the name of the library which is OS in

this case followed by a dot and then use chdir. Now, within parenthesis you we can give

single or double quotes. So, copy the entire path from your directory and then paste it

here or you can also type it out. The third method is using the command CD, CD also

means Change Directory and you can give a space after the command and then give the

path. So, this how you set a working directory.

(Refer Slide Time: 04:32)

Now, once you set the working directory if you have any folders or any subfolders or any

other files inside the working directory, all of that will be displayed under file explorer.

For me I have a couple of files under this directory and hence it is being displayed here

for me. But of course, if you are opening a new folder you are likely to see this space as

empty now you can check all your files and sub file and sub directories here under file

explorer.

(Refer Slide Time: 05:09)

So, we have seen how to set a working directory, now let us see how to create a file. So,

there are two ways to go about it the first is by clicking an icon that looks like a page

folded on the right. Now, this you can find on the toolbar. So, on the icon bar towards

your extreme left you will see a page that is folded on the right, now if you click on that

a new script file will open. I have also shown you a zoomed in version of the icon, so this

is how it looks, the moment you click on it a new script file will pop up.

(Refer Slide Time: 05:39)

Now, the second method is by clicking on the file menu and then selecting new file. So,

you can see the file menu here and then from that click on new file. Now, apart from

these two methods you always have a fallback option of using the keyboard shortcut

which is control plus N, in all these three methods right away open a script file for you

till. Now, we have set the working directory we have created a script file. So, now let us

type few pieces of code before we save our script file, but even before we go there let us

look at what a variable means.

(Refer Slide Time: 06:00)

So, variable is an identifier that contains a known information, the known information

that is contained within an identifier referred to as a value. So, a variable name will

actually point to a memory address or a storage location and then this location is actually

used to cross refer to the stored value. So, variable name can be descriptive or can also

consist of single alphabets. So, we will look into the naming conventions of naming a

variable in the lectures to come.

(Refer Slide Time: 06:47)

So, let us go ahead and create few variables, now you will see a snapshot of a code here

on my left I have zoomed in the lines of code on my right. So, let me again zoom in and

show you now I am assigning a value of 11 to a. So, in Python the assignment operator

that you will be using to assign a value is equal to. So, I am storing a value of 11 in a, a is

my variable name and I am saying b is equal to 8 times 10.

So, this is a multiplication and the multiplication operator in Python is referred as

asterisk. So, once I create both my variables I would like to print the values of a and b,

now because I want to print two values together; I am going to separate them with a

comma inside the print statement. So, the print statement will help me print the output

and since I want to print two outputs here I am going to separate them with a comma.

However, if you just want to print one statement you can just give a single object inside

the parentheses.

(Refer Slide Time: 07:59)

So, now let us go ahead and save our script files. So, to save your script file you can

click on the file menu again and you can see there are three different options here. So, let

us see what these options are I am going to zoom in a bit to show you the list of options

that you have. So, the first option is save which is represented as control plus S in your

keyboard shortcut. Now, if you already have a file now if you are making some changes

to it, then if you would like to save changes that you made then you can just simply click

on save.

Now if you are making changes across multiple files. So, now, if you are opening

multiple files and making changes in all of them then you can use the option save all. So,

what save all does is that it will save all the changes made across all the files that are

open. So, this is the use of save all. So, the third option is what is called as save as, now

if you are creating a new file and you would like to rename it and save it then you would

be using save as. So, let us see how to save a new script file for the very first time.

(Refer Slide Time: 09:07)

So, once you click on save as it will prompt you to give a name for the file. Now, you

can choose your directory here as to where you want to just save it or if you already in

your working directory then you can just go there and save it. So, dot py is the extension

that is used to save a Python script file. Now once you do this you can just click on save

and your file is saved.

(Refer Slide Time: 09:35)

So, to summarize in this lecture we saw how the interface of Spyder looks, we saw how

to set the working directory and how to create and save Python script files.

Who this course is for:
  • Beginner python developers curious for data science
Tutorial Bar
Logo