Pandas


Basic Data Structure

Creation and Attributes

  • pd.Series(<list>, [index])
    • Attributes: both object and index have name attribute
  • pd.Dataframe(<ndarray/dictionary/list>, index = <pd.date_range/list>, columns = [list] )
    • index = pd. MultiIndex.from_tuples(<list of tuples>, names = [names of each layer]) : will yield a dataframe with hierarchical indexing
  • pd.dtypes - display the dtype of every column
    • float, datetime, category, object...
    • pd.Categorical()
      • pd.get_dummies(<categorical data>, prefix) - create dummy variables

Manipulation

methods can be chained ( linked all together by .)

Operations

  • +-/* - Operate on values

Modify

df.

  • reindex( index = <index>, columns = <list>)
  • rename(index =, columns = <func>)
  • shift(<shift num>)
  • T
    • transpose
  • sort_index([axis], [b], [ascending])
  • modify values
    • add, radd, sub, rsub, div, rdiv, floordiv, rfloordiv, mul, rmul, pow, rpow

Bining

pd

  • cut(<df>, <bins>)
    • key would be intervals
  • qcut

View/ Conversion to other types

df.

  • copy
  • index, columns,
    • returns Index
  • values
    • returns Numpy nd-array

Indexing and Slicing (and Add/Modify in place)

  • df.<col name> - select one column
  • df[<col name/ col name list>] - select a column
  • df[<index range>] -select a number of rows
  • df.filter(regrex = [regular expression]): search columns whose name matches
  • df.loc([<index range/index value>, <column list/column name>] -select by label
    • df.at[<row pos>,<col pos>]
  • df.iloc[<row positions>, [column positions]]
    • df.iat[<row position>, <column position>]
  • df[ <logic expression>] - logical indexing
    • df[ <df/col>.isin(<vals>]
    • df.drop_duplicates()
  • df.head(n), df.tail(n)
  • df.sample(frac = [frac]/n = [sample size]) :select a fraction of rows randomly
  • df.nlargest(n, 'value'), df.nsmallest(n, 'value')
Random Samples

df.

  • take(<sample func-a random permutation>)
  • sample

Functions

df.

  • apply
  • apply_map

use with lambda functions

Windows

df.

  • expanding()
  • rolling()

Operations

df.

  • query(<expr>)
    • see how an expression is expressed if in Pandas
  • stats
    • mean([axis])
    • var, std, skew, kurty
    • mad
  • Mathematics
    • sub(<df>)
    • cumsum(), cummax(), cummin(), cumprod()
    • rank(method = [dense/min/first/max/average], pct = [Ture/False])
    • shift([n])
    • pct_change()
  • String Operations
    • str.<oper func>()

summarize

  • is_unique(), unique()
  • .nunique() - counts unique values in each col
  • .value_counts(): produce a histogram
  • len()
  • describe()
  • idx: labels, arg: positions
  • argmin, argmax, idxmin, idxmax
  • sum(), min(), max(), mean(), median(), var(), std()
  • arguments
    • axis, skipna, level
  • count()
  • quantile


Load and Write Data

df.

  • to_csv(),
  • read_csv(<filename> ,[args]), read_table(<file_name>, [args])
    • sep = "|", ","", "\t"
    • names = names
    • encoding
  • to_hdf(),
  • to_excel(), read_excel

pd.

  • read_csv
    • nrows
    • chucksize
      • returns in chunckers
    • TextParser
  • to_csv
  • ExcelFile
  • read_excel
  • to_excel
  • read_html
  • read_pickle(), to_pickle
    • to binary format
    • other formats
      • bcolz
      • Feather
  • HDFSTore
    • returns HDF5 object with dict-like API
      • 2 storage schemas : 'fixed' 'table' - latter slightly slower
      • methods
      • put
        • select
  • to_hdf
  • read_hdf
  • read_sql(<command>, <db>)

Merging and Grouping

Merge, Join and Concantenate

pd.

  • merge
  • concat

Grouping

pd.

  • groupby(<key>,[axis], [level=index name],[as index],[group_keys]).[columns].<aggregator>
    • keys
      • column name
      • dictionary or series (mapping)
        • map column (row) name to different values then group by values
      • function
        • eg.len
      • Categorial object
    • aggregator
      • count, sum, mean, media, sts, var, min, max
      • prod
      • first, last - Non NA Values
      • aggregate(lambda or list of lambda functions) or agg(lambda)
        • aggregate([(<func1>,<lambda1>),(<func2>,<lambda2>)
          • give column names when aggregate using lambda
      • apply - most general form
        • function is called on each row group from the data frame, then the results are glued together with pandas.concat()
    • as index
      • not allow group keys to be hierarchical indices
    • group_keys
      • not allow group keys, use as original columns
  • looping
    • for (k1, k2...), v in <grouped>
  • groupby(<key1>)['data1']
    • syntactic sugar for df['data1'].groupby(df['key1'])

Pivot Table

Special case of groupby

  • pd.pivot_table(<df>, values=<col name>, index=<cols name>, columns=[col to putin], [arrgfunc=np.mean])
  • df.pivot_table([cols fo pivot],<index=>, [columns],[margins],[aggfun=],[fill_value=],[drop_na])
    • margins=True will include partial total
    • default aggfunc is mean
  • df.pivot(<cols>,<vals>)
  • pd.crosstab
    • special type of pivot table

Pandas Sql

from pandasql import sqldf

def pysqldf(q):
  return sqldf(q, globals())

q = '''
SELECT * 
FROM df_uk_rain a LEFT JOIN df_right b
ON a.year = b.year
LIMIT 5
'''

pysqldf(q)

Combine DataSets

df[df.col.isin( <a col in another df>)],df[df.col.isin( <a col in another df>)],

    • append(<rows>, ignore_index = [T/F])

Data Preprocessing

Descriptive

df.

  • head(), tail()
  • info()
  • describe()
  • isnull.sum(), notnull().sum
  • plot()
    • plot.hist()
    • plot.scatter()

Pandas Data Profiling

import pandas_profiling
pandas_profiling.ProfileReport(df)

Reshaping

  • df.stack(): compress a level in the data frame's columns ( columns become a level index)

    • unstack([index level]):unstack a index level to column
  • pd.melt(<df>, [indexes] ) : columns not in indexes are concated by different value combinations as rows

Duplication

df.

  • duplicated - returns boolean (duplicated rows)
    • [np.where(df_uk_rain[['water_year', 'rain_octsep']].duplicated())]
  • drop_duplicates()

Outliers

use boxplot and winsorizing

Missing Data

Detection

df.

  • isna
  • isnull
  • notnull

can combine with any()

Drop or Imputation

df.

  • dropna
  • fillna(<method>)
    • method = 'ffill', method = 'bfill'
  • fill with model
# fill with model
clf = KNeighborsClassifier(3, weights='distance')
clf.fit(X_no_nan[['age', 'preMLScore', 'postMLScore']], X_no_nan['gender'])
x_imputed = clf.predict(X_with_nan[['age', 'preMLScore', 'postMLScore']])
X_with_imputed = X.copy()
X_with_imputed.loc[idx_with_nan,'gender'] = x_imputed.reshape(-1, 1)
X_with_imputed

Categorical Data

Get Dummies

Converting categorical data to "dummy" or "indicator" matrix

df.

  • get_dummies()
    • returns dummy variables
    • get_indexer
    • add_prefix

Vectorized String

df.str

  • contains
  • findall(<pattern>, flag)
  • extract
  • cat
  • count
  • endswith, startswith
  • findall, get, isalnum, isalpha, isdecimal, isdigit, islower, isnumeric, isupper
  • join, len, lower, upper
  • match
  • pad
  • cetner
  • repear
  • replace
  • split
  • strip, lstrip, rstrip

Time-Series Data(pd. Series)

df.

  • resample(<arg>).<func>()
    • aggregators : max, mean....
      • olhc
  • rolling().<func> (apply)
    • rolling window function
  • tz_localize()
  • tz_convert()
  • to_period()
    • convert to period data
  • to_timestamp()

pd

  • to_datetime()
  • date_range()
  • period_range(,[freq])
    • asfreq()

Categorical Data

  • df[cols].astype("category")
  • df.cat
    • categories()
    • set_categories()
  • df.sort_values(by)

results matching ""

    No results matching ""