Scikit-Learn Library

Sci-kit Learn

Data and Feature Processing

categorical encoders

Modelling

3 Steps

Instantiate model
model.fit
model.predict

Supervised Learning

linear_model
- LogisticRegression
tree
ensemble

Unsupervised Learning

unsupervised learning

neighbors.
- KNeighborsClassifier(3, weights='distance')
cluster
- KMeans
decomposition

Model Training and Model Selection

sklearn.

model_selection
- train_test_split(X, y, test_size, [random_state, stratify])
- GridSearchCV
  - tune hyper parameters
metrics
- roc_curve, auc
- confusion_matrix, precision_score, recall_score, classification_report

Pasty

It is closely inspired by and compatible with the formula mini-language used in R and S

pasty.

dmatrices('y~x0+x1[+0])
- returns nd array with additional info
  - X.design_info
- can use standardize(x), center(x), C(x)
  - C(x) - categorical data
    - treat like dummy variable automatically
build_design_matrices(<design_info>, new data)

pasty objects can be taken directly to methods like

numpy.linalg

lstsq

StatsModel

StatsModels include classical frequentists statistical models like

Linear Models, generalized linear models
Linear Mixed Effects Models
Analysis of Variance methods
Time Series Processing and State Space Models
Generalized Methods of Moments

Basic Usage

statsmodels.api - array based model api

sm.

add_constant()
OLS
- yields a model
tsa

statsmodels.formula.api - formula (pasty-like) based model api

smf.

model.

fit()
predict()

Additional Libraries

Boosting Trees

XGBoost

LightGBM

Examples

Examples of Gradient Boosting

H2O

H2O is a Java-based software for data modeling and general computing. The H2O software is many things, but the primary purpose of H2O is as a distributed (many machines), parallel (many CPUs), in memory (several hundred GBs Xmx) processing engine.

H2O algorithms

Vowpal Wabbit

ScikitLearn