Sign up with your email address to be the first to know about new products, VIP offers, blog features & more.

Pandas Cheat Sheet

Introduction

Pandas is a tool for data analysis. It organizes the data in a tabular form object called DataFrame. It has labeled rows and columns which allows fast search and powerful relational operations. It has some special data types like Data Frame, Index and Series.

Data Analysis

import pandas as pd # importing pandas

df = pd.read_csv('path/to/file.csv') #df mean data frame # reading a dataset from a csv

df.shape() # displaying the shape

df.head() # head of the table

df.info() # returns the index type, the columns and the type of each column

df.tail() # the tail of the table

df.columns # columns

df.describe() # shows statistics: mean, std, min, 25%, 50%, 75% and max, about the data - all of them ignores null values

df['column_name'].value_counts() # display the total if each value found in the table

df['some_column'].describe() # returns the count (total of non null), unique (distinct values), top (most frequent category), freq (occurrences of top)

df['some_column'].unique() # displays the count of distinct values for the column

df.nunique() # displays the distinct values for each column

Slicing / Splitting

df.iloc[:10,:] # using index location, from beginning, get first 10 rows and all columns

df.iloc[:10:3,:] # using index location, from beginning, get first 10 rows, EVERY 3 STRIDES, and all columns

df.iloc[-10:, :] # using index location, from the end, get -10 and all columns

X = data.iloc[:, 0:13] # get the first to thirteenth columns 
y = data.iloc[:, -1:] # get the last column

new_sliced_df = df.loc['2018-1-21':'2018-4-20'] # creates a new data frame with the start date and end date, for time series pandas allows partial strings

df.loc['some_value', 'some_column']

Dropping Columns

X = df.drop(['id_column', 'label_column', 'some_column'], axis=1) # drop the given columns

Dropping Rows

indexes = df[df['Age'] == 30 ].index
df.drop(indexes, inplace=True)

Filtering by Rows and Columns

Creating a new Dataframe copy with specific Columns

df_new = df.filter(['col1','col2','col3'], axis=1)

Single condition:

indices = df['some_column'] == 'some_value' # other comparison operators can be used, like !=, <, >...
new_df = df.loc[indices, :] # extract a new DataFrame

Multiple Conditions:

df_filtered = df[(df['num_preg'] >= start) & (df['num_preg'] <= end ) & (df['diabetes'] == 0 )]

Removing Duplicate Rows

df.drop_duplicates() # usually retains the first row when duplicates are found

Series

Extracting a single column from a data frame returns a Series object, which is a specialized object. A pandas’s series is a 1D labelled Numpy array and a DataFrame is a 2D labelled array whose columns are Series.

high = df['high'] # returns a Series
highs = high.values # returns a numpy array

Grouping with groupby()

df['some_column'].mean().groupby('some_column')
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
df['some_column'].mean().groupby('some_column')
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

Building DataFrames

From CSV

df = pf.read_csv('path/to/file.csv') # create a dataframe from csv, index_col=0 to define the col to be the index

From Dictionary

adict = { 
 'name':['Fernando', 'Mariana', 'Kobe', Jordan],
 'scores':[100, 100, 98, 200]
}

df = pd.DataFrame(adict)
print(df)

'''
    name     scores
0   Fernando 100
1   Mariana  100
2   Kobe     98
3   Jordan   200
'''

From Lists (and then Dictionary)

list1 = ['Gohan', 'Fernando', 'Goku']
list2 = [31, 25, 45]
list3 = ['Half-Human','Human','Sayajin']
list_labels = ['name', 'age', 'type']
list_cols = [list1, list2, list3]

# zip will return iterable tuples
zipped = list(zip(list_labels, list_cols))

data_in_dict = dict(zipped)

df = pd.DataFrame(data_in_dict)
print(df)

'''
   age  name      type
0  31   Gohan     Half-Human
1  25   Fernando  Human
2  45   Goku      Sayajin
'''

Broadcasting

df.loc[::3,:] = np.nan # assigning scalar value to column slice BROADCASTS value to each row. In this case each 3 rows from beginning in the last column
df['attack'] = 'Final Flash' # will create a column in the data frame and assign the 'Final Flash' as value for each row
print(df)

'''
   age  name      type        attack
0  31   Gohan     Half-Human  Final Flash
1  25   Fernando  Human       Final Flash
2  45   Goku      Sayajin     Final Flash
'''

Index and Columns

# renaming columns
# the number of rows and columns must be the same
df.columns = ['possible_age','first_name','kind','attack?']
print(df)

'''
   possible_age first_name  kind        attack?
0  31           Gohan       Half-Human  Final Flash
1  25           Fernando    Human       Final Flash
2  45           Goku        Sayajin     Final Flash
'''
# changing index 
df.index = ['A','B','C']
print(df)

'''
   possible_age  first_name  kind        attack?
A  31            Gohan       Half-Human  Final Flash
B  25            Fernando    Human       Final Flash
C  45            Goku        Sayajin     Final Flash
'''
df.index = df['some_column']  # changing the index column 
df.index.name = 'new_name' # change the name for the index
# reindexing 
# go to: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html

Importing DataFrames

df = pf.read_csv('path/to/file.csv')
# extra important parameters for read_csv()
# header=None # will create a 0 based column name, can be useful if no header is provided or it does not make sense
# names = ['name1', 'name2', 'name3']
# na_values = [' '] # changes ' ' (spaces) to NaN in the data frame
# na_values = { 'name1': [' '], 'name2':[' -1']} # can be a dictionary that uses the column name
# parse_dates = [[0,1,2]] # will create a amalgamated (merged) column with, i.e. year_month_day
# parse_dates = True # try to transform date strings in datetime objects, the format of the date is ISO 8601

Trimming/Extracting unnecessary or redundant columns

interesting_cols = ['col', 'col2']
df = df[interesting_cols]

Exporting DataFrames

df.to_csv('filename.csv')
df.to_csv('filename.tsv', sep='\t')
df.to_excel('filename.xlsx')

References:
https://pandas.pydata.org/pandas-docs/stable/
Data Camp – Pandas Foundations

No Comments Yet.

What do you think?

Your email address will not be published. Required fields are marked *