Data Cleaning in Machine Learning

data.isnull().sum()dropna(axis=1)drop(features_list)data.select_dtypes(exclude=[features_list])

Use for loop with if data.isnull().any() condition to use with both test and training data cleaning.

Imputer

from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
filled_data = my_imputer.fit_transform(data) //on train data
filled_data = my_imputer.transform(data) //on test data

Extension to imputer: Work on copy of data.

Low cardinality data(#unique values) to select categorical columns.

#low_cardinality_cols
data[feature].nunique() < 10

Cross-validation

from sklearn.model_selection import cross_val_scorecross_val_score(RandomForestRegressor(50),X, y,scoring = 'neg_mean_absolute_error').mean()

Terminology

fit(X,y); predict(X); tree; mean_absolute_error(y, pred); ensemble; metrics; RandomForestRegressor; DecisionTreeRegressor; describe; data.columns; model_selection; test_train_split; cross_val_score; max_leaf_nodes; get_dummies;

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store