Data Cleaning in Machine Learning


Use for loop with if data.isnull().any() condition to use with both test and training data cleaning.


from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
filled_data = my_imputer.fit_transform(data) //on train data
filled_data = my_imputer.transform(data) //on test data

Extension to imputer: Work on copy of data.

Low cardinality data(#unique values) to select categorical columns.

data[feature].nunique() < 10


from sklearn.model_selection import cross_val_scorecross_val_score(RandomForestRegressor(50),X, y,scoring = 'neg_mean_absolute_error').mean()


fit(X,y); predict(X); tree; mean_absolute_error(y, pred); ensemble; metrics; RandomForestRegressor; DecisionTreeRegressor; describe; data.columns; model_selection; test_train_split; cross_val_score; max_leaf_nodes; get_dummies;



Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store