Table of Contents
- 1 Introduction
- 2 Data Analysis
- 3 Slicing / Splitting
- 4 Dropping Columns
- 5 Dropping Rows
- 6 Filtering by Rows and Columns
- 7 Removing Duplicate Rows
- 8 Series
- 9 Building DataFrames
- 10 Broadcasting
- 11 Index and Columns
- 12 Importing DataFrames
- 13 Trimming/Extracting unnecessary or redundant columns
- 14 Exporting DataFrames
Introduction
Pandas is a tool for data analysis. It organizes the data in a tabular form object called DataFrame. It has labeled rows and columns which allows fast search and powerful relational operations. It has some special data types like Data Frame, Index and Series.
Data Analysis
import pandas as pd # importing pandas df = pd.read_csv('path/to/file.csv') #df mean data frame # reading a dataset from a csv df.shape() # displaying the shape df.head() # head of the table df.info() # returns the index type, the columns and the type of each column df.tail() # the tail of the table df.columns # columns df.describe() # shows statistics: mean, std, min, 25%, 50%, 75% and max, about the data - all of them ignores null values df['column_name'].value_counts() # display the total if each value found in the table df['some_column'].describe() # returns the count (total of non null), unique (distinct values), top (most frequent category), freq (occurrences of top) df['some_column'].unique() # displays the count of distinct values for the column df.nunique() # displays the distinct values for each column
Slicing / Splitting
df.iloc[:10,:] # using index location, from beginning, get first 10 rows and all columns df.iloc[:10:3,:] # using index location, from beginning, get first 10 rows, EVERY 3 STRIDES, and all columns df.iloc[-10:, :] # using index location, from the end, get -10 and all columns X = data.iloc[:, 0:13] # get the first to thirteenth columns y = data.iloc[:, -1:] # get the last column new_sliced_df = df.loc['2018-1-21':'2018-4-20'] # creates a new data frame with the start date and end date, for time series pandas allows partial strings df.loc['some_value', 'some_column']
Dropping Columns
X = df.drop(['id_column', 'label_column', 'some_column'], axis=1) # drop the given columns
Dropping Rows
indexes = df[df['Age'] == 30 ].index df.drop(indexes, inplace=True)
Filtering by Rows and Columns
Creating a new Dataframe copy with specific Columns
df_new = df.filter(['col1','col2','col3'], axis=1)
Single condition:
indices = df['some_column'] == 'some_value' # other comparison operators can be used, like !=, <, >... new_df = df.loc[indices, :] # extract a new DataFrame
Multiple Conditions:
df_filtered = df[(df['num_preg'] >= start) & (df['num_preg'] <= end ) & (df['diabetes'] == 0 )]
Removing Duplicate Rows
df.drop_duplicates() # usually retains the first row when duplicates are found
Series
Extracting a single column from a data frame returns a Series object, which is a specialized object. A pandas’s series is a 1D labelled Numpy array and a DataFrame is a 2D labelled array whose columns are Series.
high = df['high'] # returns a Series highs = high.values # returns a numpy array
Grouping with groupby()
df['some_column'].mean().groupby('some_column') # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
df['some_column'].mean().groupby('some_column') # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
Building DataFrames
From CSV
df = pf.read_csv('path/to/file.csv') # create a dataframe from csv, index_col=0 to define the col to be the index
From Dictionary
adict = { 'name':['Fernando', 'Mariana', 'Kobe', Jordan], 'scores':[100, 100, 98, 200] } df = pd.DataFrame(adict) print(df) ''' name scores 0 Fernando 100 1 Mariana 100 2 Kobe 98 3 Jordan 200 '''
From Lists (and then Dictionary)
list1 = ['Gohan', 'Fernando', 'Goku'] list2 = [31, 25, 45] list3 = ['Half-Human','Human','Sayajin'] list_labels = ['name', 'age', 'type'] list_cols = [list1, list2, list3] # zip will return iterable tuples zipped = list(zip(list_labels, list_cols)) data_in_dict = dict(zipped) df = pd.DataFrame(data_in_dict) print(df) ''' age name type 0 31 Gohan Half-Human 1 25 Fernando Human 2 45 Goku Sayajin '''
Broadcasting
df.loc[::3,:] = np.nan # assigning scalar value to column slice BROADCASTS value to each row. In this case each 3 rows from beginning in the last column
df['attack'] = 'Final Flash' # will create a column in the data frame and assign the 'Final Flash' as value for each row print(df) ''' age name type attack 0 31 Gohan Half-Human Final Flash 1 25 Fernando Human Final Flash 2 45 Goku Sayajin Final Flash '''
Index and Columns
# renaming columns # the number of rows and columns must be the same df.columns = ['possible_age','first_name','kind','attack?'] print(df) ''' possible_age first_name kind attack? 0 31 Gohan Half-Human Final Flash 1 25 Fernando Human Final Flash 2 45 Goku Sayajin Final Flash '''
# changing index df.index = ['A','B','C'] print(df) ''' possible_age first_name kind attack? A 31 Gohan Half-Human Final Flash B 25 Fernando Human Final Flash C 45 Goku Sayajin Final Flash '''
df.index = df['some_column'] # changing the index column df.index.name = 'new_name' # change the name for the index
# reindexing # go to: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
Importing DataFrames
df = pf.read_csv('path/to/file.csv') # extra important parameters for read_csv() # header=None # will create a 0 based column name, can be useful if no header is provided or it does not make sense # names = ['name1', 'name2', 'name3'] # na_values = [' '] # changes ' ' (spaces) to NaN in the data frame # na_values = { 'name1': [' '], 'name2':[' -1']} # can be a dictionary that uses the column name # parse_dates = [[0,1,2]] # will create a amalgamated (merged) column with, i.e. year_month_day # parse_dates = True # try to transform date strings in datetime objects, the format of the date is ISO 8601
Trimming/Extracting unnecessary or redundant columns
interesting_cols = ['col', 'col2'] df = df[interesting_cols]
Exporting DataFrames
df.to_csv('filename.csv') df.to_csv('filename.tsv', sep='\t') df.to_excel('filename.xlsx')
References:
https://pandas.pydata.org/pandas-docs/stable/
Data Camp – Pandas Foundations
What do you think?