The scikit-learn is the most popular open-source and free python machine learning library for Data scientists and Machine learning practitioners. It contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction. In this article, I’m happy to share with you the top 5 new features presented in the new version of scikkit-learn (1.0) New Flexible Plotting API includes metrics.PrecisionRecallDisplay, metrics.DetCurveDisplay, and inspection.PartialDependenceDisplay. Pearson’s R Correlation Coefficient is a new feature in feature selection.
Scikit-learn is the most popular open-source and free python machine learning library for Data scientists and Machine learning practitioners. The scikit-learn library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction.
In this article, I’m happy to share with you the top 5 new features presented in the new version of scikit-learn (1.0).
TABLE OF CONTENTS
- Install Scikit-learn v1.0
- New Flexible Plotting API
- Feature Names Support
- Pearson’s R Correlation Coefficient
- OneHot Encoder Improvements
- Histogram-based Gradient Boosting Models are now stable
Install Scikit-learn v1.0
Firstly, make sure you install the latest version (with pip):
pip install --upgrade scikit-learn
If you are using conda, use the following command:
conda install -c conda-forge scikit-learn
Note: Version 1.0.0 of scikit-learn requires python 3.7+, NumPy 1.14.6+ and scipy 1.1.0+. Optional minimal dependency is matplotlib 2.2.2+
Now, let’s look at the new features!
1. New Flexible Plotting API
Scikit-learn 1.0 has introduced new flexible plotting API such as metrics.PrecisionRecallDisplay, metrics.DetCurveDisplay, and inspection.PartialDependenceDisplay.
This Plotting API comes with two class methods:
This class method allows you to fit a model and plot the results at the same time.
Let’s look at an example by using PrecisionRecallDisplay to visualize Precision and Recall.
import matplotlib.pyplot as pltfrom sklearn.datasets import make_classificationfrom sklearn.metrics import PrecisionRecallDisplayfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifier X, y = make_classification(random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2) classifier= RandomForestClassifier(random_state=42)classifier.fit(X_train, y_train) disp_confusion = PrecisionRecallDisplay.from_estimator(classifier, X_test, y_test) plt.show()
In this class method, you can just pass prediction results and get your plots.
Let’s look at an example by using ConfusionMatrixDisplay to visualize the confusion matrix.
import matplotlib.pyplot as pltfrom sklearn.datasets import make_classificationfrom sklearn.metrics import confusion_matrix, ConfusionMatrixDisplayfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifier X, y = make_classification(random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2) classifier= RandomForestClassifier(random_state=42)classifier.fit(X_train, y_train) predictions = classifier.predict(X_test) disp_confusion = ConfusionMatrixDisplay.from_predictions(predictions, y_test, display_labels=classifier.classes_) plt.show()
2. Feature Names Support (Pandas Dataframe)
In the new version of scikit-learn, you can track the names of the columns of your pandas dataframe when working with transformers or estimators.
When you pass a dataframe to an estimator and call the fit method, the estimator will store the features name in the feature_names_in_ attribute.
from sklearn.preprocessing import StandardScalerimport pandas as pd X = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=["age", "days", "duration"])scalar = StandardScaler().fit(X) print(scalar.feature_names_in_)
array([‘age’, ‘days’, ‘duration’], dtype=object)
Note: feature names support is only enabled when the column names in the dataframe are all strings.
3. Pearson’s R Correlation Coefficient
This is a new feature in feature selection that can measure the linear relationship between each feature and the target for the regression tasks. It is also known as the pearson’s r.
The cross-correlation between each regressor and the target is computed as
((X[:, i] – mean(X[:, i])) * (y – mean_y)) / (std(X[:, i]) * std(y)).
Note: Where X is the features of the dataset and y is the target variable.
The following example shows how you can compute the Pearson’s r for each feature and the target.
from sklearn.datasets import fetch_california_housingfrom sklearn.feature_selection import r_regression X, y = fetch_california_housing(return_X_y=True) print(X.shape) p = r_regression(X,y) print(p)
[ 0.68807521 0.10562341 0.15194829 -0.04670051 -0.02464968 -0.02373741 -0.14416028 -0.04596662]
4. OneHot Encoder Improvements
The OneHot Encoder in scikit-learn 1.0 can accept values it has not seen before. You just need to set a parameter called handle_unknown to ‘ignore’ (handle_unknown=’ignore’) when instantiating the transformer.
When you transform data with an unknown category, the encoded columns for this feature will be all zero values.
In the following example, we pass an unknown category when we transform the data given.
from sklearn.preprocessing import OneHotEncoder enc = OneHotEncoder(handle_unknown='ignore') X = [['secondary'], ['primary'], ['primary']] enc.fit(X) transformed = enc.transform([['degree'], ['primary'],['secondary']]).toarray() print(transformed)
Note: In the inverse transform, an unknown category will be labeled as None.
5. Histogram-based Gradient Boosting Models are now Stable
The two supervised learning algorithms introduced in the previous version of scikit-learn 0.24 (HistGradientBoostingRegressor and HistGradientBoostingClassifier) are no longer experimental and you can simply import and use them as:
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
There are more new features in scikit-learn 1.0.0 that I did not mention in this article. You can find the highlights of other features released in scikit-learn 1.0.0 here.
Congratulations, you have made it to the end of this article! I hope you have learned something new that will help you on your next machine learning project.
If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!
You can also find me on Twitter @Davis_McDavid.
And you can read more articles like this here.