Data Cleaning in Machine Learning
All you need to know about cleaning data
data.isnull().sum()dropna(axis=1)drop(features_list)data.select_dtypes(exclude=[features_list])
Use for loop with if data.isnull().any()
condition to use with both test and training data cleaning.
Imputer
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
filled_data = my_imputer.fit_transform(data) //on train data
filled_data = my_imputer.transform(data) //on test data
Extension to imputer: Work on copy of data.
Low cardinality data(#unique values) to select categorical columns.
#low_cardinality_cols
data[feature].nunique() < 10
Cross-validation
from sklearn.model_selection import cross_val_scorecross_val_score(RandomForestRegressor(50),X, y,scoring = 'neg_mean_absolute_error').mean()
Terminology
fit(X,y); predict(X); tree; mean_absolute_error(y, pred); ensemble; metrics; RandomForestRegressor; DecisionTreeRegressor; describe; data.columns; model_selection; test_train_split; cross_val_score; max_leaf_nodes; get_dummies;