python - Using the predict_proba() function of RandomForestClassifier in the safe and right way -
i'm using scikit-learn apply machine learning algorithm on datasets. need have probabilities of labels/classes instated of labels/classes themselves. instead of having spam/not spam labels of emails, wish have example: 0.78 probability given email spam.
for such purpose, i'm using predict_proba() randomforestclassifier following:
clf = randomforestclassifier(n_estimators=10, max_depth=none, min_samples_split=1, random_state=0) scores = cross_val_score(clf, x, y) print(scores.mean()) classifier = clf.fit(x,y) predictions = classifier.predict_proba(xtest) print(predictions)
and got results:
[ 0.4 0.6] [ 0.1 0.9] [ 0.2 0.8] [ 0.7 0.3] [ 0.3 0.7] [ 0.3 0.7] [ 0.7 0.3] [ 0.4 0.6]
where second column class: spam. however, have 2 main issues results not confident. first issue results represent probabilities of labels without being affected size of data? second issue results show 1 digit not specific in cases 0.701 probability different 0.708. there way next 5 digit example?
many in advance time in reading these 2 issues , questions.
i more 1 digit in results, sure not due dataset ? (for example using small dataset yield simple decision trees , 'simple' probabilities). otherwise may display shows 1 digit, try print
predictions[0,0]
.i not sure understand mean "the probabilities aren't affected size of data". if concern don't want predict, eg, many spams, done use threshold
t
such predict 1 ifproba(label==1) > t
. way can use threshold balance predictions, example limit global probabilty of spams. , if want globally analyse model, compute area under curve (auc) of receiver operating characteristic (roc) curve (see wikipedia article here). roc curve description of predictions depending on thresholdt
.
hope helps!
Comments
Post a Comment