Pandas

Basic Data Structure

Creation and Attributes

pd.Series(<list>, [index])
- Attributes: both object and index have name attribute
pd.Dataframe(<ndarray/dictionary/list>, index = <pd.date_range/list>, columns = [list] )
- index = pd. MultiIndex.from_tuples(<list of tuples>, names = [names of each layer]) : will yield a dataframe with hierarchical indexing
pd.dtypes - display the dtype of every column
- float, datetime, category, object...
- pd.Categorical()
  - pd.get_dummies(<categorical data>, prefix) - create dummy variables

Manipulation

methods can be chained ( linked all together by .)

Operations

+-/* - Operate on values

Modify

df.

reindex( index = <index>, columns = <list>)
rename(index =, columns = <func>)
shift(<shift num>)
T
- transpose
sort_index([axis], [b], [ascending])
modify values
- add, radd, sub, rsub, div, rdiv, floordiv, rfloordiv, mul, rmul, pow, rpow

Bining

cut(<df>, <bins>)
- key would be intervals
qcut

View/ Conversion to other types

df.

copy
index, columns,
- returns Index
values
- returns Numpy nd-array

Indexing and Slicing (and Add/Modify in place)

df.<col name> - select one column
df[<col name/ col name list>] - select a column
df[<index range>] -select a number of rows
df.filter(regrex = [regular expression]): search columns whose name matches
df.loc([<index range/index value>, <column list/column name>] -select by label
- df.at[<row pos>,<col pos>]
df.iloc[<row positions>, [column positions]]
- df.iat[<row position>, <column position>]
df[ <logic expression>] - logical indexing
- df[ <df/col>.isin(<vals>]
- df.drop_duplicates()
df.head(n), df.tail(n)
df.sample(frac = [frac]/n = [sample size]) :select a fraction of rows randomly
df.nlargest(n, 'value'), df.nsmallest(n, 'value')

Random Samples

df.

take(<sample func-a random permutation>)
sample

Functions

df.

apply
apply_map

use with lambda functions

Windows

df.

expanding()
rolling()

Operations

df.

query(<expr>)
- see how an expression is expressed if in Pandas
stats
- mean([axis])
- var, std, skew, kurty
- mad
Mathematics
- sub(<df>)
- cumsum(), cummax(), cummin(), cumprod()
- rank(method = [dense/min/first/max/average], pct = [Ture/False])
- shift([n])
- pct_change()
String Operations
- str.<oper func>()

summarize

is_unique(), unique()
.nunique() - counts unique values in each col
.value_counts(): produce a histogram
len(）
describe()
idx: labels, arg: positions
argmin, argmax, idxmin, idxmax
sum(), min(), max(), mean(), median(), var(), std()
arguments
- axis, skipna, level
count()
quantile

Load and Write Data

df.

to_csv(),
read_csv(<filename> ,[args]), read_table(<file_name>, [args])
- sep = "|", ","", "\t"
- names = names
- encoding
to_hdf(),
to_excel(), read_excel

pd.

read_csv
- nrows
- chucksize
  - returns in chunckers
- TextParser
to_csv
ExcelFile
read_excel
to_excel
read_html
read_pickle(), to_pickle
- to binary format
- other formats
  - bcolz
  - Feather
HDFSTore
- returns HDF5 object with dict-like API
  - 2 storage schemas : 'fixed' 'table' - latter slightly slower
  - methods
- - put
    - select
to_hdf
read_hdf
read_sql(<command>, <db>)

Merging and Grouping

Merge, Join and Concantenate

pd.

merge
concat

Grouping

pd.

groupby(<key>,[axis], [level=index name],[as index],[group_keys]).[columns].<aggregator>
- keys
  - column name
  - dictionary or series (mapping)
    - map column (row) name to different values then group by values
  - function
    - eg.len
  - Categorial object
    - eg. ones returned by pd.cut
- aggregator
  - count, sum, mean, media, sts, var, min, max
  - prod
  - first, last - Non NA Values
  - aggregate(lambda or list of lambda functions) or agg(lambda)
    - aggregate([(<func1>,<lambda1>),(<func2>,<lambda2>)
      - give column names when aggregate using lambda
  - apply - most general form
    - function is called on each row group from the data frame, then the results are glued together with pandas.concat()
- as index
  - not allow group keys to be hierarchical indices
- group_keys
  - not allow group keys, use as original columns
looping
- for (k1, k2...), v in <grouped>
groupby(<key1>)['data1']
- syntactic sugar for df['data1'].groupby(df['key1'])

Pivot Table

Special case of groupby

pd.pivot_table(<df>, values=<col name>, index=<cols name>, columns=[col to putin], [arrgfunc=np.mean])
df.pivot_table([cols fo pivot],<index=>, [columns],[margins],[aggfun=],[fill_value=],[drop_na])
- margins=True will include partial total
- default aggfunc is mean
df.pivot(<cols>,<vals>)
pd.crosstab
- special type of pivot table

Pandas Sql

from pandasql import sqldf

def pysqldf(q):
  return sqldf(q, globals())

q = '''
SELECT * 
FROM df_uk_rain a LEFT JOIN df_right b
ON a.year = b.year
LIMIT 5
'''

pysqldf(q）

Combine DataSets

df[df.col.isin( <a col in another df>)],df[df.col.isin( <a col in another df>)],

- append(<rows>, ignore_index = [T/F])

Data Preprocessing

Descriptive

df.

head(), tail()
info()
describe()
isnull.sum(), notnull().sum
plot()
- plot.hist()
- plot.scatter()

Pandas Data Profiling

import pandas_profiling
pandas_profiling.ProfileReport(df)

Reshaping

df.stack(): compress a level in the data frame's columns ( columns become a level index)
- unstack([index level]):unstack a index level to column
pd.melt(<df>, [indexes] ) : columns not in indexes are concated by different value combinations as rows

Duplication

df.

duplicated - returns boolean (duplicated rows)
- [np.where(df_uk_rain[['water_year', 'rain_octsep']].duplicated())]
drop_duplicates()

Outliers

use boxplot and winsorizing

Missing Data

Detection

df.

isna
isnull
notnull

can combine with any()

Drop or Imputation

df.

dropna
fillna(<method>)
- method = 'ffill', method = 'bfill'
fill with model

# fill with model
clf = KNeighborsClassifier(3, weights='distance')
clf.fit(X_no_nan[['age', 'preMLScore', 'postMLScore']], X_no_nan['gender'])
x_imputed = clf.predict(X_with_nan[['age', 'preMLScore', 'postMLScore']])
X_with_imputed = X.copy()
X_with_imputed.loc[idx_with_nan,'gender'] = x_imputed.reshape(-1, 1)
X_with_imputed

Categorical Data

Get Dummies

Converting categorical data to "dummy" or "indicator" matrix

df.

get_dummies()
- returns dummy variables
- get_indexer
- add_prefix

Vectorized String

df.str

contains
findall(<pattern>, flag)
extract
cat
count
endswith, startswith
findall, get, isalnum, isalpha, isdecimal, isdigit, islower, isnumeric, isupper
join, len, lower, upper
match
pad
cetner
repear
replace
split
strip, lstrip, rstrip

Time-Series Data(pd. Series)

df.

resample(<arg>).<func>()
- aggregators : max, mean....
  - olhc
rolling().<func> (apply)
- rolling window function
tz_localize()
tz_convert()
to_period()
- convert to period data
to_timestamp()

Pandas

Pandas

Basic Data Structure

Creation and Attributes

Manipulation

Operations

Modify

Bining

View/ Conversion to other types

Indexing and Slicing (and Add/Modify in place)

Random Samples

Functions

Windows

Operations

Load and Write Data

Merging and Grouping

Pivot Table

Pandas Sql

Combine DataSets

Data Preprocessing

Descriptive

Pandas Data Profiling

Reshaping

Duplication

Outliers

Missing Data

Detection

Drop or Imputation

Categorical Data

Get Dummies

Vectorized String

Time-Series Data(pd. Series)

Categorical Data

results matching ""

No results matching ""