Incorporate Area Data into Your Mannequin with Rule-Primarily based Studying
You might be given a labeled dataset and assigned to foretell a brand new one. What would you do?
The primary method that you simply in all probability attempt is to coach a machine studying mannequin to seek out guidelines for labeling new knowledge.
That is handy, however it’s difficult to know why the machine studying mannequin comes up with a specific prediction. You can also’t incorporate your area information into the mannequin.
As a substitute of relying on a machine studying mannequin to make predictions, is there a technique to set the principles for knowledge labeling based mostly in your information?
That’s when human-learn is useful.
human-learn is a Python package deal to create rule-based techniques which might be straightforward to assemble and are suitable with scikit-learn.
To put in human-learn, sort:
pip set up human-learn
Within the earlier article, I talked about how one can create a human studying mannequin by drawing:
On this article, we are going to discover ways to create a mannequin with a easy perform.
Be happy to play and fork the supply code of this text right here:
To guage the efficiency of a rule-based mannequin, let’s begin with predicting a dataset utilizing a machine studying mannequin.
We are going to use the Occupation Detection Dataset from UCI Machine Studying Repository for example for this tutorial.
Our activity is to foretell room occupancy based mostly on temperature, humidity, gentle, and CO2. A room shouldn’t be occupied if Occupancy=0
and is occupied if Occupancy=1
.
After downloading the dataset, unzip and browse the info:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report# Get practice and check knowledge
practice = pd.read_csv("occupancy_data/datatraining.txt").drop(columns="date")
check = pd.read_csv("occupancy_data/datatest.txt").drop(columns="date")
# Get X and y
goal = "Occupancy"
train_X, train_y = practice.drop(columns=goal), practice[target]
val_X, val_y = check.drop(columns=goal), check[target]
Check out the primary ten data of the practice
dataset:
practice.head(10)
Prepare the scikit-learn’s RandomForestClassifier
mannequin on the coaching dataset and use it to foretell the check dataset:
# Prepare
forest_model = RandomForestClassifier(random_state=1)# Preduct
forest_model.match(train_X, train_y)
machine_preds = forest_model.predict(val_X)
# Evalute
print(classification_report(val_y, machine_preds))
The rating is fairly good. Nevertheless, we’re uncertain how the mannequin comes up with these predictions.
Let’s see if we will label the brand new knowledge with easy guidelines.
There are 4 steps to create guidelines for labeling knowledge:
- Generate a speculation
- Observe the info to validate the speculation
- Begin with easy guidelines based mostly on the observations
- Enhance the principles
Generate a Speculation
Mild in a room is an effective indicator of whether or not a room is occupied. Thus, we will assume that the lighter a room is, the extra seemingly it is going to be occupied.
Let’s see if that is true by trying on the knowledge.
Observe the Information
To validate our guess, let’s use a field plot to seek out the distinction within the quantity of sunshine between an occupied room (Occupancy=1
) and an empty room (Occupancy=0
).
import plotly.categorical as px
import plotly.graph_objects as gofunction = "Mild"
px.field(data_frame=practice, x=goal, y=function)
We are able to see a big distinction within the median between an occupied and an empty room.
Begin with Easy Guidelines
Now, we are going to create guidelines for whether or not a room is occupied based mostly on the sunshine in that room. Particularly, if the quantity of sunshine is above a sure threshold, Occupancy=1
and Occupancy=0
in any other case.
However what ought to that threshold be? Let’s begin with selecting 100
to be threshold and see what we get.
To create a rule-based mannequin with human-learn, we are going to:
- Write a easy Python perform that specifies the principles
- Use
FunctionClassifier
to show that perform right into a scikit-learn mannequin
import numpy as np
from hulearn.classification import FunctionClassifierdef create_rule(knowledge: pd.DataFrame, col: str, threshold: float=100):
return np.array(knowledge[col] > threshold).astype(int)
mod = FunctionClassifier(create_rule, col='Mild')
Predict the check set and consider the predictions:
mod.match(train_X, train_y)
preds = mod.predict(val_X)
print(classification_report(val_y, preds))
The accuracy is best than what we received earlier utilizing RandomForestClassifier
!
Enhance the Guidelines
Let’s see if we will get a greater outcome by experimenting with a number of thresholds. We are going to use parallel coordinates to research the relationships between a selected worth of sunshine and room occupancy.
from hulearn.experimental.interactive import parallel_coordinatesparallel_coordinates(practice, label=goal, peak=200)
From the parallel coordinates, we will see that the room with a light-weight above 250 Lux has a excessive likelihood of being occupied. The optimum threshold that separates an occupied room from an empty room appears to be someplace between 250 Lux and 750 Lux.
Let’s discover one of the best threshold on this vary utilizing scikit-learn’s GridSearch
.
from sklearn.model_selection import GridSearchCVgrid = GridSearchCV(mod, cv=2, param_grid={"threshold": np.linspace(250, 750, 1000)})
grid.match(train_X, train_y)
Get one of the best threshold:
best_threshold = grid.best_params_["threshold"]
best_threshold
> 364.61461461461465
Plot the edge on the field plot.
Use the mannequin with one of the best threshold to foretell the check set:
human_preds = grid.predict(val_X)
print(classification_report(val_y, human_preds))
The edge of 365
provides a greater outcome than the edge of 100
.
Utilizing area information to create guidelines with a rule-based mannequin is sweet, however there are some disadvantages:
- It doesn’t generalize properly to unseen knowledge
- It’s troublesome to give you guidelines for advanced knowledge
- There isn’t any suggestions loop to enhance the mannequin
Thus, combing a rule-based mannequin and an ML mannequin will assist knowledge scientists scale and enhance the mannequin whereas nonetheless with the ability to incorporate their area experience.
One simple technique to mix the 2 fashions is to determine whether or not to cut back false negatives or false positives.
Scale back False Negatives
You would possibly wish to cut back false negatives in situations akin to predicting whether or not a affected person has most cancers (it’s higher to make a mistake telling sufferers that they’ve most cancers than to fail to detect most cancers).
To scale back false negatives, select constructive labels when two fashions disagree.
Scale back False Positives
You would possibly wish to cut back false positives in situations akin to recommending movies that is likely to be violent to youngsters (it’s higher to make the error of not recommending kid-friendly movies than to suggest grownup movies to youngsters).
To scale back false positives, select detrimental labels when two fashions disagree.
It’s also possible to use different extra advanced coverage layers to determine which prediction to select from.
For a deeper dive into how one can mix an ML mannequin and a rule-based mannequin, I like to recommend checking this glorious video by Jeremy Jordan.
Congratulations! You have got simply realized what a rule-based mannequin is and how one can mix it with a machine-learning mannequin. I hope this text provides you the information wanted to develop your individual rule-based mannequin.