Computer&Data Science Made EZ : May 2020

Environment of Work in DataScience

1) Work Environment using Jupyter
2) Data Analysis our data using library using Pandas
3) Numpy (library for data science is NumPy, a fundamental for a data science guy)
4) Data Visualization( library matplotlib that allows to make neat graphs & visuals describe our data)
5) Use models, train models & Check the accuracy of the ML Models (using Scikit-learn library)
6) Types of Learning (Supervised Learning Regression and Classification)
7) Special topic such as ( Deep Learning and neural networks, and transfer learning) using latest version of TensorFlow and Keras

What is Machine Learning
In simple words When computer learn from given input of data.

Machine learning is, using an algorithm or computer program to learn about different patterns in data, and then taking that algorithm and what it's learned to make predictions about the future using similar data machine learning algorithms are also called models.

Type of Machine Learning
1) Supervised Learning is done by drawing lines based on the given data. Example is regression
2) Unsupervised Learning when we have given data without Labels this is called unsupervised Learn eg; csv files without column names.In this we associated rules with set of given data.
3) Reinforcement is teaching Machine through trial and error through rewards and punishment

Steps in Data Models

1. Problem definition (What type of problem we have at hand)

2 Data (What type data we need to analyze)

3. Evaluation of Success and Failure

4. Features (feature we should model)

5. Model (what Kind of Model we should use to solve the problem)

6. Test Different combination we should apply

Conclusion.

Keep doing the steps until we solve the problem using our model.

Flow of data science work

A practical and usual data science workflow started with

1) opening a csv file in Jupiter notebook

2) exploring the data and performing data analysis using pandas

3) Making visualizations/ graphs and comparing different data points using map plot lib

4) Building machine learning models on the data using Scikit-learn.

Key

*Pandas (a python library for data analysis)

*map plot lib (a python library for Visualization and graphs)

*Jupiter notebook (a tool for building machine learning projects)

*Scikit-learn also know as sklearn (free machine learning library for the Python)

Type of Data

Unstructured data are things like images, mobile calls voice data, videos

Structured data like patient medical history of disease

Type of Machine Learning

1. Supervised learning

2. Unsupervised learning

3. Transfer learning

4. Reinforcement learning

Type of Evaluation

Classification	Regression	Recommendation
Accuracy	Mean Absolute error (MAE)	Precision at K
Precision	Mean Square error(MSE)
Recall	Root Mean Square error RMSE

Feature variables are used to predict the target variables

Feature variable are the attributes describing your data.

Variables in Data Science

Feature variables are used to predict the target variables

Feature variable are the attributes describing your data.

Kinds of feature variables

1. Numerical feature (like body height number)

2. Categorical feature (like gender Male and Female)

3. Derived feature ( visit by a person to hospital in a year)

Important concept of ML

Split your data into 3 different sets

1. training dataset to train your model

2. validation data set to choosing your model

3. test data set to test and compare your model with different models

split your data usually

70% for training data set

15 % for validation data set

15% for test data set

Three parts of modelling

1 choosing a model

2 training a model

3 Comparing a model

Choose a model

Working with structured data following model will work best

1. decision trees

such as random forest /gradient boosting algorithms (cat boost in, x g boost)

Working with unstructured data following model will work best

1. deep learning neural networks

2. transfer learning

Training of the model

By align your input (feature variables) and output (target variable) of the data

Feature variable input to the model àModel (mode will find the patterns)àTarget variable as output

input (feature variables) =X the data

is used to predict

output (target variable)= y

Improvement of your Model

hyper parameters (choose different hyper parament for your model)

Fitting the model line on data point is conceptualized as follows

1. overfishing happens due to data leakage

2. underfunding happen due to data mismatch

3. balanced(Goldilocks Zone)

Data leakage & Data Mismatch

Data leakage happens when some of your test data leaks into your training data

This often results in overfishing or a model doing better on the test set then on the training dataset

Data mismatch happens data used of testing is different to the data used for training that is

Training data is not equal (! =) testing data set

features in the training data set! = features in the testing data set.

Model Tuning

Training and Testing data set are to do Model Tuning

Training data set is used for

1. Validation

2. model tuning

Testing data set used for

1. Testing

2. Model comparison

Model comparison(comparing cars to cars)

To Build the Environment.

In anaconda prompt window(Miniconda3)
conda create --prefix ./env pandas numpy matplotlib scikit-learn
or I want to install the saperate jupyter
conda install jupyter
C:\Users\Tahir\Desktop\sample_pro\env>conda activate C:\Users\Tahir\Desktop\sample_project_1\env
after running the command prompt converts into following.
(C:\Users\Tahir\Desktop\sample_pro\env)
Note:-For Kernel error in jupyter notebook install as follows:-pip install jupyter_client --upgrade

Pandas in Python for Data Analysis

import pandas as pd
names=pd.Series(["Tah","Zah","Mah"])
colour=pd.Series(["Red","green","Blue"])
df=pd.DataFrame({"k1":names,"k2":colour})
#import and read data from csv using data frame
df.to_csv("abc.csv",index=False)
df2=pd.read_csv("abc.csv")
#attribues of DF
df.dtypes
df.columns
df.index
#function of DF
df.info()
df.describe()
df2["column_name"].sum()
df2.column_name


       Envivronment Keys:-
        Hit Esc then m , x b ,a in Jupyter Notebook
   Hit shit+tab to see the description of function
      Hit shit enter to execute the command in Jupyter Notebook
Hit Ctr+shit - to break the lines in Jupyter Notebook

Data frames in Pandas

Column represent by axis=1
Row represented by axis=0
index number starts at 0 by default.

pd.crosstab(df["col1"],df["col2"])
df["col1"].fillna(df["col1"].mean(),inplace=True)
df["col1"].dropna(inplace=True)
## how to create colum using series in python
new_column=pd.Series([7,7,7,7))

##how to create column from python list

##should be the same length as data frame df

capcity_column=[7.2,8.2,3.2,2.2]

df["An_other_new_Column"]=capcity_column

## To drop a colum using the function

df.drop("ColumnName", axis=1)

## to shuffule the data frame

new_DF=df.smaple(frac=0.5) or frac=1 for 100%
## to remove the extra index colum using
df.reset_index(drop=True,inplace=True)

anonymous function.

apply a function to the column of the data frame

df["columnName"]=df["columnName"].apply(lambda x: x/2)

this is say assign x=x/2 in function name lambda takes x then reassign x to x after dividing by 2

## ndarray in python of numpy package

mysamplearray=np.ones() # hit the key shift+tab to see the description of th function
mysamplearray=np.ones()
mysamplearray=np.zeros()
mysamplearray=np.arrange(0,10,2) #0=start, end=10 and 2= the step interval
mysamplearray=np.arrange(0,10,2)
mysamplearray=np.random.randint()
mysamplearray=np.random.rand(5,3) # 5 row and 3 column array
mysamplearray=np.random.seed()# Pseudo-random number.

##ploting using matplotlib in python

%matplotlib inline

import matplotlib.pylot as mplt

import pandas as pd

import numpy as np

x=[1,2,3,4]
y=[11,22,33,44]
mplt.plot(x,y)
------------------

## Flow of matplotlib

# 1. import matplotlib and get it ready for plotting in jupyter

%matplotlib inline
import matplotlib.pyplot as plt

# 2 prepare data

x=[1,2,3,4]
y=[11,22,33,44]

#3 setup plot

fig,ax=plt.subplots(figsize=(10,10)); # (width,hight)

# add data

ax.plot(x,y)

# 4. customize plot

ax.set(title="SimplePlot", xlabel="x values", ylabel="y values")

# 5. save your figure

fig.savefig("images/simpleplot.png")

## Making figures using NumPy arrays

#Line plot

# Scatter plot

# Bar Plot

# Histogram

import numpy as np

#create some data

x=np.linspace(0,10,100)
x[:10]

#plot the data and create a line plot

       fig,ax=plt.subplots()
       ax.plot(x,x**2);

       fig,ax=plt.subplots()
ax.scatter(x,np.sin(x))

# making a plot from dictionary

prices={"butter":10, "Cake":20, "Milk":50}
fig,ax=plt.subplots()
ax.bar(prices.keys(),prices.values())
ax.set(title='butter Cake Milk Shop',ylabel="prices")

#How to Plot our data frame using Pandas

import pandas as pd

#make a data frame

cs=pd.read_csv("CS.csv")
cs["Price"]=cs["Price"].str.replace('[\$\,\.]','')
cs["Price"]=cs["Price"].str[:-2]

#adding date column in our dataframe

cs["S_date"]=pd.date_range("1/1/2020",periods=len(cs))

cs["Total_S"]=cs["Price"].astype(int).cumsum()

# now plot the s_date and Total_S

cs.plot(x="S_date",y="Total_S");

# Plot a bar graph
x=np.random.rand(10,4)
# Now Make a data frame of these random data.
df=pd.DataFram(x,column=['a','b','c','d']

df.plot.bar()

cs[].plot(x="Make",y="Meter",kind="bar")

Histogram?

cs["Price"].plot.hist()

cs["Price"].astype(int)
cs.plot(x="Odm)",y="Pr", kind="scatter");

#Using subplots making all column plot in one graph.

cs.head()
cs.plot.hist(figsize=(10,30),subplots=True);

#Object Oriented Method

fix,ax=plt.subplots(figsize=(10,6))
of.plot(kind='scatter',x='age','high',c='tgt',ax=ax);
ax.set_xlim([50,100])

## OO Method from the base

Fig,ax=plt.subplots(figsize=(10,6))

#Plot the data
scatter=ax.scatter(x=df["age"],
y=df["ch"],
c=df["trgt"])

#Customize the plot

ax.set(
          title="HDD and LL" ,
         xlabel="Age"
          ylabel="ch")

# Add a legend

ax.legend("scatter.legend_elements(),title="trgt");

#Add a horixontal line

ax.axhline(df["ch"].mean(),linestyle='--')

# subplot of age, hight, weight

fig,(ax0,ax1)=plt.subplots(nrow=2,ncols=1,figsize=(10,10))

SK-Lean or Scikit-Lear is Machine Learning Library of Python

Work Flow of Machine Learning Models

Getting data ready (to be used with ML Models)

Choosing a machine learning Models

Fitting a model to the data

Making prediction with a model

Evaluating model predictions

Improving model prediction

saving and loading models

Example for this

# Standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

#1. Get the data ready
###################################################################
# Import dataset
bad_disease = pd.read_csv("../data/bad-disease.csv")
# View the data
bad_disease.head()
####################################################################
# Create X (all the feature columns)
X = bad_disease.drop("target", axis=1)
# Create y (the target column)
y = bad_disease["target"]
####################################################################
# Split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
# View the data shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

#2. Choose the model/estimator
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

#3. Fit the model to the data and use it to make a prediction
#A model will (attempt to) learn the patterns in a dataset by calling the fit() function on it and passing it the data.
model.fit(X_train, y_train)

#Once a model has learned patterns in data, you can use them to make a prediction with the predict() function.
# Make predictions
y_preds = model.predict(X_test)

# This will be in the same format as y_test
y_preds

X_test.loc[209]
bad_disease.loc[209]

# Make a prediction on a single sample (has to be array)
model.predict(np.array(X_test.loc[209]).reshape(1, -1))
#array([0])

4. Evaluate the model
#A trained model/estimator can be evaluated by calling the score() function and passing it a collection of data.

# On the training set
model.score(X_train, y_train)

# On the test set (unseen)
model.score(X_test, y_test)

#5. Experiment to improve (hyperparameter tuning)
#A model's first evaluation metrics aren't always its last. One way to improve a models predictions is with hyperparameter tuning.

# Try different numbers of estimators (n_estimators is a hyperparameter you can change)
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accruacy on test set: {model.score(X_test, y_test)}")
    print("")

#6. Save a model for later use
#A trained model can be exported and saved so it can be imported and used later. One way to save a model is using Python's pickle module.

import pickle
# Save trained model to file
pickle.dump(model, open("random_forest_model_1.pkl",

# Load a saved model and make a prediction on a single example
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.predict(np.array(X_test.loc[209]).reshape(1, -1))

Another Way of the Example

--We fit the model on the training data and train the model on testing data
--Classification model( random forest) find the patters in the training data
-- Make prediction on test data

import pandas as pd
hd=pd.read_csv("data/baddisease.csv")
#create x the feature columns
X=hd.drop("target",axis=1)
#create y the label
y=hd["target"]
#Choose the right model and hyperparameter
# RandomForestClassifier is a classifcation Machnine Learning Model
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=100)
# we will keep the default hyperparameter
clf.get_params()
# 3. Fit the model to the training data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X, y,test_size=0.2)
clf.fit(X_train,y_train) -- Fitting the model to the training data
y_preds=clf.prdict(X_test) -- X_test making prediction on test data
y_preds
#4 Evaluate the model on the training data and testing data.
clf.score(X_train,y_train)
clf.score(X_test,y_test)

Thursday, 14 May 2020

Machine Learning and Data Science using Python