Scikit-Learn Library
Data and Feature Processing
Modelling
3 Steps
- Instantiate model
- model.fit
- model.predict
Supervised Learning
- linear_model
- LogisticRegression
- tree
- ensemble
Unsupervised Learning
- neighbors.
- KNeighborsClassifier(3, weights='distance')
- cluster
- KMeans
- decomposition
Model Training and Model Selection
sklearn.
- model_selection
- train_test_split(X, y, test_size, [random_state, stratify])
- GridSearchCV
- tune hyper parameters
- metrics
- roc_curve, auc
- confusion_matrix, precision_score, recall_score, classification_report
Pasty
It is closely inspired by and compatible with the formula mini-language used in R and S
pasty.
- dmatrices('y~x0+x1[+0])
- returns nd array with additional info
- X.design_info
- can use standardize(x), center(x), C(x)
- C(x) - categorical data
- treat like dummy variable automatically
- C(x) - categorical data
- returns nd array with additional info
- build_design_matrices(<design_info>, new data)
pasty objects can be taken directly to methods like
numpy.linalg
StatsModel
StatsModels include classical frequentists statistical models like
- Linear Models, generalized linear models
- Linear Mixed Effects Models
- Analysis of Variance methods
- Time Series Processing and State Space Models
- Generalized Methods of Moments
Basic Usage
statsmodels.api - array based model api
sm.
- add_constant()
- OLS
- yields a model
- tsa
statsmodels.formula.api - formula (pasty-like) based model api
smf.
- ols
model.
- fit()
- predict()
Additional Libraries
Boosting Trees
Examples
H2O
H2O is a Java-based software for data modeling and general computing. The H2O software is many things, but the primary purpose of H2O is as a distributed (many machines), parallel (many CPUs), in memory (several hundred GBs Xmx) processing engine.
H2O algorithms
Vowpal Wabbit