Pandas
Basic Data Structure
Creation and Attributes
- pd.Series(<list>, [index])
- Attributes: both object and index have name attribute
- pd.Dataframe(<ndarray/dictionary/list>, index = <pd.date_range/list>, columns = [list] )
- index = pd. MultiIndex.from_tuples(<list of tuples>, names = [names of each layer]) : will yield a dataframe with hierarchical indexing
- pd.dtypes - display the dtype of every column
- float, datetime, category, object...
- pd.Categorical()
- pd.get_dummies(<categorical data>, prefix) - create dummy variables
Manipulation
methods can be chained ( linked all together by .)
Operations
- +-/* - Operate on values
Modify
df.
- reindex( index = <index>, columns = <list>)
- rename(index =, columns = <func>)
- shift(<shift num>)
- T
- transpose
- sort_index([axis], [b], [ascending])
- modify values
- add, radd, sub, rsub, div, rdiv, floordiv, rfloordiv, mul, rmul, pow, rpow
Bining
pd
- cut(<df>, <bins>)
- key would be intervals
- qcut
View/ Conversion to other types
df.
- copy
- index, columns,
- returns Index
- values
- returns Numpy nd-array
Indexing and Slicing (and Add/Modify in place)
- df.<col name> - select one column
- df[<col name/ col name list>] - select a column
- df[<index range>] -select a number of rows
- df.filter(regrex = [regular expression]): search columns whose name matches
- df.loc([<index range/index value>, <column list/column name>] -select by label
- df.at[<row pos>,<col pos>]
- df.iloc[<row positions>, [column positions]]
- df.iat[<row position>, <column position>]
- df[ <logic expression>] - logical indexing
- df[ <df/col>.isin(<vals>]
- df.drop_duplicates()
- df.head(n), df.tail(n)
- df.sample(frac = [frac]/n = [sample size]) :select a fraction of rows randomly
- df.nlargest(n, 'value'), df.nsmallest(n, 'value')
Random Samples
df.
- take(<sample func-a random permutation>)
- sample
Functions
df.
- apply
- apply_map
use with lambda functions
Windows
df.
- expanding()
- rolling()
Operations
df.
- query(<expr>)
- see how an expression is expressed if in Pandas
- stats
- mean([axis])
- var, std, skew, kurty
- mad
- Mathematics
- sub(<df>)
- cumsum(), cummax(), cummin(), cumprod()
- rank(method = [dense/min/first/max/average], pct = [Ture/False])
- shift([n])
- pct_change()
- String Operations
- str.<oper func>()
summarize
- is_unique(), unique()
- .nunique() - counts unique values in each col
- .value_counts(): produce a histogram
- len()
- describe()
- idx: labels, arg: positions
- argmin, argmax, idxmin, idxmax
- sum(), min(), max(), mean(), median(), var(), std()
- arguments
- axis, skipna, level
- count()
- quantile
Load and Write Data
df.
- to_csv(),
- read_csv(<filename> ,[args]), read_table(<file_name>, [args])
- sep = "|", ","", "\t"
- names = names
- encoding
- to_hdf(),
- to_excel(), read_excel
pd.
- read_csv
- nrows
- chucksize
- returns in chunckers
- TextParser
- to_csv
- ExcelFile
- read_excel
- to_excel
- read_html
- read_pickle(), to_pickle
- to binary format
- other formats
- bcolz
- Feather
- HDFSTore
- returns HDF5 object with dict-like API
- 2 storage schemas : 'fixed' 'table' - latter slightly slower
- methods
- put
- select
- put
- returns HDF5 object with dict-like API
- to_hdf
- read_hdf
- read_sql(<command>, <db>)
Merging and Grouping
pd.
- merge
- concat
pd.
- groupby(<key>,[axis], [level=index name],[as index],[group_keys]).[columns].<aggregator>
- keys
- column name
- dictionary or series (mapping)
- map column (row) name to different values then group by values
- function
- eg.len
- Categorial object
- eg. ones returned by pd.cut
- aggregator
- count, sum, mean, media, sts, var, min, max
- prod
- first, last - Non NA Values
- aggregate(lambda or list of lambda functions) or agg(lambda)
- aggregate([(<func1>,<lambda1>),(<func2>,<lambda2>)
- give column names when aggregate using lambda
- aggregate([(<func1>,<lambda1>),(<func2>,<lambda2>)
- apply - most general form
- function is called on each row group from the data frame, then the results are glued together with pandas.concat()
- as index
- not allow group keys to be hierarchical indices
- group_keys
- not allow group keys, use as original columns
- keys
- looping
- for (k1, k2...), v in <grouped>
- groupby(<key1>)['data1']
- syntactic sugar for df['data1'].groupby(df['key1'])
Pivot Table
Special case of groupby
- pd.pivot_table(<df>, values=<col name>, index=<cols name>, columns=[col to putin], [arrgfunc=np.mean])
- df.pivot_table([cols fo pivot],<index=>, [columns],[margins],[aggfun=],[fill_value=],[drop_na])
- margins=True will include partial total
- default aggfunc is mean
- df.pivot(<cols>,<vals>)
- pd.crosstab
- special type of pivot table
Pandas Sql
from pandasql import sqldf
def pysqldf(q):
return sqldf(q, globals())
q = '''
SELECT *
FROM df_uk_rain a LEFT JOIN df_right b
ON a.year = b.year
LIMIT 5
'''
pysqldf(q)
Combine DataSets
df[df.col.isin( <a col in another df>)],df[df.col.isin( <a col in another df>)],
- append(<rows>, ignore_index = [T/F])
Data Preprocessing
Descriptive
df.
- head(), tail()
- info()
- describe()
- isnull.sum(), notnull().sum
- plot()
- plot.hist()
- plot.scatter()
Pandas Data Profiling
import pandas_profiling
pandas_profiling.ProfileReport(df)
Reshaping
df.stack(): compress a level in the data frame's columns ( columns become a level index)
- unstack([index level]):unstack a index level to column
pd.melt(<df>, [indexes] ) : columns not in indexes are concated by different value combinations as rows
Duplication
df.
- duplicated - returns boolean (duplicated rows)
- [np.where(df_uk_rain[['water_year', 'rain_octsep']].duplicated())]
- drop_duplicates()
Outliers
use boxplot and winsorizing
Missing Data
Detection
df.
- isna
- isnull
- notnull
can combine with any()
Drop or Imputation
df.
# fill with model
clf = KNeighborsClassifier(3, weights='distance')
clf.fit(X_no_nan[['age', 'preMLScore', 'postMLScore']], X_no_nan['gender'])
x_imputed = clf.predict(X_with_nan[['age', 'preMLScore', 'postMLScore']])
X_with_imputed = X.copy()
X_with_imputed.loc[idx_with_nan,'gender'] = x_imputed.reshape(-1, 1)
X_with_imputed
Categorical Data
Get Dummies
Converting categorical data to "dummy" or "indicator" matrix
df.
- get_dummies()
- returns dummy variables
- get_indexer
- add_prefix
Vectorized String
df.str
- contains
- findall(<pattern>, flag)
- extract
- cat
- count
- endswith, startswith
- findall, get, isalnum, isalpha, isdecimal, isdigit, islower, isnumeric, isupper
- join, len, lower, upper
- match
- pad
- cetner
- repear
- replace
- split
- strip, lstrip, rstrip
Time-Series Data(pd. Series)
df.
- resample(<arg>).<func>()
- aggregators : max, mean....
- olhc
- aggregators : max, mean....
- rolling().<func> (apply)
- rolling window function
- tz_localize()
- tz_convert()
- to_period()
- convert to period data
- to_timestamp()
pd
- to_datetime()
- date_range()
- period_range(,[freq])
- asfreq()
Categorical Data
- df[cols].astype("category")
- df.cat
- categories()
- set_categories()
- df.sort_values(by)