1) Work Environment using Jupyter
2) Data Analysis our data using library using Pandas
3) Numpy (library for data science is NumPy, a fundamental for a data science guy)
4) Data Visualization( library matplotlib that allows to make neat graphs & visuals describe our data)
5) Use models, train models & Check the accuracy of the ML Models (using Scikit-learn library)
6) Types of Learning (Supervised Learning Regression and Classification)
7) Special topic such as ( Deep Learning and neural networks, and transfer learning) using latest version of TensorFlow and Keras
What is Machine Learning
In simple words When computer learn from given input of data.
Machine learning is, using an algorithm or computer program to learn about different patterns in data, and then taking that algorithm and what it's learned to make predictions about the future using similar data machine learning algorithms are also called models.
Type of Machine Learning
1) Supervised Learning is done by drawing lines based on the given data. Example is regression
2) Unsupervised Learning when we have given data without Labels this is called unsupervised Learn eg; csv files without column names.In this we associated rules with set of given data.
3) Reinforcement is teaching Machine through trial and error through rewards and punishment
Steps in Data Models
1. Problem definition
(What type of problem we have at hand)
2 Data (What type
data we need to analyze)
3. Evaluation of Success and Failure
4. Features (feature we should model)
5. Model (what Kind of Model we
should use to solve the problem)
6. Test Different combination we
should apply
Conclusion.
Keep doing the steps until we solve the problem using our model.
Flow of data science work
A practical and usual data science
workflow started with
1)
opening a csv file in Jupiter notebook
2) exploring the data and performing data
analysis using pandas
3)
Making visualizations/ graphs and comparing
different data points using map plot lib
4)
Building machine learning models on the data using Scikit-learn.
Key
*Pandas
(a
python library for data analysis)
*map plot lib (a python library for Visualization
and graphs)
*Jupiter notebook (a tool for building machine learning projects)
*Scikit-learn also know as sklearn (free machine
learning library for the Python)
Type of Data
Unstructured data are things like images, mobile calls voice data,
videos
Structured data like patient medical history of disease
Type of Machine Learning
1. Supervised learning
2. Unsupervised learning
3. Transfer learning
4.
Reinforcement
learning
Type of Evaluation
Classification
|
Regression
|
Recommendation
|
Accuracy
|
Mean Absolute error (MAE)
|
Precision at K
|
Precision
|
Mean Square error(MSE)
|
|
Recall
|
Root Mean Square error RMSE
|
Feature variables
are used to predict the target variables
Feature variable are the attributes describing your data.
Variables in Data
Science
Feature variables
are used to predict the target variables
Feature variable are the attributes describing your data.
Kinds of feature
variables
1. Numerical feature (like body height
number)
2. Categorical feature (like gender Male
and Female)
3. Derived feature ( visit by a person to
hospital in a year)
Important concept of ML
Split
your data into 3 different sets
1. training dataset to
train your model
2.
validation data set to choosing your model
3.
test
data set to test and compare your model with different
models
split your data usually
70% for training data set
15 % for validation data set
15% for test data set
Three parts of modelling
1 choosing a model
2 training a model
3 Comparing a model
Choose a model
Working with structured data following
model will work best
1.
decision
trees
such
as random forest /gradient boosting algorithms (cat boost in, x g boost)
Working with unstructured data following
model will work best
1. deep learning neural networks
2. transfer learning
Training of the model
By align your input (feature variables) and output
(target variable) of the data
Feature variable
input to the model à Model
(mode will find the patterns)Ã Target
variable as output
input (feature variables) =X the
data
is used to predict
output (target variable)= y
Improvement of your Model
hyper parameters (choose different hyper
parament for your model)
Fitting
the model line on data point is conceptualized
as follows
1.
overfishing happens due to data leakage
2.
underfunding happen due to data mismatch
3.
balanced(Goldilocks
Zone)
Data leakage & Data Mismatch
Data leakage
happens when some of your test data leaks into your training data
This often
results in overfishing or a model doing better on the test set
then on the training dataset
Data mismatch happens data used of testing is different to the data used
for training that is
Training data is not equal (! =)
testing data set
features in
the training data set! = features in the testing data set.
Model Tuning
Training and Testing
data set are to do Model Tuning
Training
data set is used for
1.
Validation
2.
model tuning
Testing data
set used for
1.
Testing
2.
Model comparison
To Build the Environment.
In anaconda prompt window(Miniconda3)conda create --prefix ./env pandas numpy matplotlib scikit-learn
or I want to install the saperate jupyter
conda install jupyter
C:\Users\Tahir\Desktop\sample_pro\env>conda activate C:\Users\Tahir\Desktop\sample_project_1\env
after running the command prompt converts into following.
(C:\Users\Tahir\Desktop\sample_pro\env)
Note:-For Kernel error in jupyter notebook install as follows:-
pip install jupyter_client --upgrade
Pandas in Python for Data Analysis
import pandas as pdnames=pd.Series(["Tah","Zah","Mah"])
colour=pd.Series(["Red","green","Blue"])
df=pd.DataFrame({"k1":names,"k2":colour})
#import and read data from csv using data frame
df.to_csv("abc.csv",index=False)
df2=pd.read_csv("abc.csv")
#attribues of DF
df.dtypes
df.columns
df.index
#function of DF
df.info()
df.describe()
df2["column_name"].sum()
df2.column_name
Envivronment Keys:-
Hit Esc then m , x b ,a in Jupyter Notebook
Hit shit+tab to see the description of function
Hit shit enter to execute the command in Jupyter Notebook
Hit Ctr+shit - to break the lines in Jupyter Notebook
Data frames in Pandas
Column represent by axis=1Row represented by axis=0
index number starts at 0 by default.
pd.crosstab(df["col1"],df["col2"])
df["col1"].fillna(df["col1"].mean(),inplace=True)
df["col1"].dropna(inplace=True)
## how to create colum using series in python
new_column=pd.Series([7,7,7,7))
##how to create column from python list
##should be the same length as data frame df
capcity_column=[7.2,8.2,3.2,2.2]df["An_other_new_Column"]=capcity_column
## To drop a colum using the function
df.drop("ColumnName", axis=1)## to shuffule the data frame
new_DF=df.smaple(frac=0.5) or frac=1 for 100%## to remove the extra index colum using
df.reset_index(drop=True,inplace=True)
anonymous function.
apply a function to the column of the data frame
df["columnName"]=df["columnName"].apply(lambda x: x/2)this is say assign x=x/2 in function name lambda takes x then reassign x to x after dividing by 2
## ndarray in python of numpy package
mysamplearray=np.ones() # hit the key shift+tab to see the description of th functionmysamplearray=np.ones()
mysamplearray=np.zeros()
mysamplearray=np.arrange(0,10,2) #0=start, end=10 and 2= the step interval
mysamplearray=np.arrange(0,10,2)
mysamplearray=np.random.randint()
mysamplearray=np.random.rand(5,3) # 5 row and 3 column array
mysamplearray=np.random.seed()# Pseudo-random number.
##ploting using matplotlib in python
%matplotlib inline
import matplotlib.pylot as mplt
import pandas as pd
import numpy as np
x=[1,2,3,4]
y=[11,22,33,44]
mplt.plot(x,y)
------------------
## Flow of matplotlib
# 1. import matplotlib and get it ready for plotting in jupyter
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# 2 prepare data
x=[1,2,3,4]y=[11,22,33,44]
#3 setup plot
fig,ax=plt.subplots(figsize=(10,10)); # (width,hight)# add data
ax.plot(x,y)# 4. customize plot
ax.set(title="SimplePlot", xlabel="x values", ylabel="y values")# 5. save your figure
fig.savefig("images/simpleplot.png")## Making figures using NumPy arrays
#Line plot
# Scatter plot
# Bar Plot
# Histogram
import numpy as np
#create some data
x=np.linspace(0,10,100)x[:10]
#plot the data and create a line plot
fig,ax=plt.subplots()ax.plot(x,x**2);
fig,ax=plt.subplots()
ax.scatter(x,np.sin(x))
# making a plot from dictionary
prices={"butter":10, "Cake":20, "Milk":50}
fig,ax=plt.subplots()
ax.bar(prices.keys(),prices.values())
ax.set(title='butter Cake Milk Shop',ylabel="prices")
#How to Plot our data frame using Pandas
import pandas as pd
#make a data frame
cs=pd.read_csv("CS.csv")cs["Price"]=cs["Price"].str.replace('[\$\,\.]','')
cs["Price"]=cs["Price"].str[:-2]
#adding date column in our dataframe
cs["S_date"]=pd.date_range("1/1/2020",periods=len(cs))cs["Total_S"]=cs["Price"].astype(int).cumsum()
# now plot the s_date and Total_S
cs.plot(x="S_date",y="Total_S");# Plot a bar graph
x=np.random.rand(10,4)
# Now Make a data frame of these random data.
df=pd.DataFram(x,column=['a','b','c','d']
df.plot.bar()
cs[].plot(x="Make",y="Meter",kind="bar")
Histogram?
cs["Price"].plot.hist()cs["Price"].astype(int)
cs.plot(x="Odm)",y="Pr", kind="scatter");
#Using subplots making all column plot in one graph.
cs.head()cs.plot.hist(figsize=(10,30),subplots=True);
#Object Oriented Method
fix,ax=plt.subplots(figsize=(10,6))of.plot(kind='scatter',x='age','high',c='tgt',ax=ax);
ax.set_xlim([50,100])
## OO Method from the base
Fig,ax=plt.subplots(figsize=(10,6))#Plot the data
scatter=ax.scatter(x=df["age"],
y=df["ch"],
c=df["trgt"])
#Customize the plot
ax.set(title="HDD and LL" ,
xlabel="Age"
ylabel="ch")
# Add a legend
ax.legend("scatter.legend_elements(),title="trgt");#Add a horixontal line
ax.axhline(df["ch"].mean(),linestyle='--')# subplot of age, hight, weight
fig,(ax0,ax1)=plt.subplots(nrow=2,ncols=1,figsize=(10,10))SK-Lean or Scikit-Lear is Machine Learning Library of Python
Work Flow of Machine Learning Models
Getting data ready (to be used with ML Models)
Choosing a machine learning Models
Fitting a model to the data
Making prediction with a model
Evaluating model predictions
Improving model prediction
saving and loading models
Example for this
# Standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
#1. Get the data ready
###################################################################
# Import dataset
bad_disease = pd.read_csv("../data/bad-disease.csv")
# View the data
bad_disease.head()
####################################################################
# Create X (all the feature columns)
X = bad_disease.drop("target", axis=1)
# Create y (the target column)
y = bad_disease["target"]
####################################################################
# Split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
# View the data shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape
#2. Choose the model/estimator
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
#3. Fit the model to the data and use it to make a prediction
#A model will (attempt to) learn the patterns in a dataset by calling the fit() function on it and passing it the data.
model.fit(X_train, y_train)
#Once a model has learned patterns in data, you can use them to make a prediction with the predict() function.
# Make predictions
y_preds = model.predict(X_test)
# This will be in the same format as y_test
y_preds
X_test.loc[209]
bad_disease.loc[209]
# Make a prediction on a single sample (has to be array)
model.predict(np.array(X_test.loc[209]).reshape(1, -1))
#array([0])
4. Evaluate the model
#A trained model/estimator can be evaluated by calling the score() function and passing it a collection of data.
# On the training set
model.score(X_train, y_train)
# On the test set (unseen)
model.score(X_test, y_test)
#5. Experiment to improve (hyperparameter tuning)
#A model's first evaluation metrics aren't always its last. One way to improve a models predictions is with hyperparameter tuning.
# Try different numbers of estimators (n_estimators is a hyperparameter you can change)
np.random.seed(42)
for i in range(10, 100, 10):
print(f"Trying model with {i} estimators...")
model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
print(f"Model accruacy on test set: {model.score(X_test, y_test)}")
print("")
#6. Save a model for later use
#A trained model can be exported and saved so it can be imported and used later. One way to save a model is using Python's pickle module.
import pickle
# Save trained model to file
pickle.dump(model, open("random_forest_model_1.pkl",
# Load a saved model and make a prediction on a single example
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.predict(np.array(X_test.loc[209]).reshape(1, -1))
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
#1. Get the data ready
###################################################################
# Import dataset
bad_disease = pd.read_csv("../data/bad-disease.csv")
# View the data
bad_disease.head()
####################################################################
# Create X (all the feature columns)
X = bad_disease.drop("target", axis=1)
# Create y (the target column)
y = bad_disease["target"]
####################################################################
# Split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
# View the data shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape
#2. Choose the model/estimator
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
#3. Fit the model to the data and use it to make a prediction
#A model will (attempt to) learn the patterns in a dataset by calling the fit() function on it and passing it the data.
model.fit(X_train, y_train)
#Once a model has learned patterns in data, you can use them to make a prediction with the predict() function.
# Make predictions
y_preds = model.predict(X_test)
# This will be in the same format as y_test
y_preds
X_test.loc[209]
bad_disease.loc[209]
# Make a prediction on a single sample (has to be array)
model.predict(np.array(X_test.loc[209]).reshape(1, -1))
#array([0])
4. Evaluate the model
#A trained model/estimator can be evaluated by calling the score() function and passing it a collection of data.
# On the training set
model.score(X_train, y_train)
# On the test set (unseen)
model.score(X_test, y_test)
#5. Experiment to improve (hyperparameter tuning)
#A model's first evaluation metrics aren't always its last. One way to improve a models predictions is with hyperparameter tuning.
# Try different numbers of estimators (n_estimators is a hyperparameter you can change)
np.random.seed(42)
for i in range(10, 100, 10):
print(f"Trying model with {i} estimators...")
model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
print(f"Model accruacy on test set: {model.score(X_test, y_test)}")
print("")
#6. Save a model for later use
#A trained model can be exported and saved so it can be imported and used later. One way to save a model is using Python's pickle module.
import pickle
# Save trained model to file
pickle.dump(model, open("random_forest_model_1.pkl",
# Load a saved model and make a prediction on a single example
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.predict(np.array(X_test.loc[209]).reshape(1, -1))
Another Way of the Example
--We fit the model on the training data and train the model on testing data
--Classification model( random forest) find the patters in the training data
-- Make prediction on test data
import pandas as pd
hd=pd.read_csv("data/baddisease.csv")
#create x the feature columns
X=hd.drop("target",axis=1)
#create y the label
y=hd["target"]
#Choose the right model and hyperparameter
# RandomForestClassifier is a classifcation Machnine Learning Model
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=100)
# we will keep the default hyperparameter
clf.get_params()
# 3. Fit the model to the training data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X, y,test_size=0.2)
clf.fit(X_train,y_train) -- Fitting the model to the training data
y_preds=clf.prdict(X_test) -- X_test making prediction on test data
y_preds
#4 Evaluate the model on the training data and testing data.
clf.score(X_train,y_train)
clf.score(X_test,y_test)