Python pandas Quick Reference¶
DataFrame¶
Basics¶
import pandas as pd
# Initialization
df = pd.DataFrame(columns=['A', 'B', 'C'])
# sort by column
col_name = 'A'
df = df.sort_values(col_name)
Adding new row¶
TL;DR: Do NOT grow your DataFrame in loops.
Grow the data in a list, then create the DataFrame with the finalized list.
ref: https://stackoverflow.com/questions/13784192/creating-an-empty-pandas-dataframe-and-then-filling-it
Access¶
- Access the rows by integer position: df.iloc[r]
- Access the rows and columns by integer position: df.iloc[r, c]
- The input r and c can be:
- scalar integer: df.iloc[0]
- list of integer: df.iloc[ [0, 1, 2] ]
- slice object: df.iloc[:3]
- boolean array of the same length as the index: df.iloc[ [True, False, False] ]
- callable that takes the DataFrame as input and returns one of the above valid input type
Misc¶
Find the rows that contains a specific string pattern pattern
in the column MyColumnName
selected_rows = df['MyColumnName'].str.contains('pattern', na=False)
print( df.loc[selected_rows] )
Use multiple conditions to filter the rows
selected_rows = df['MyColumnName'].str.contains('pattern1', na=False)
selected_rows &= df['MyColumnName'].str.contains('pattern2', na=False)
selected_rows &= ~df['MyColumnName'].str.contains('pattern3', na=False)
print( df.loc[selected_rows] )
CSV¶
"Merge" two data CSV files that
- the first line is the column header
- at least 1 common column exists in both CSV files
import pandas as pd
a = pd.read_csv('file1.csv')
b = pd.read_csv('file2.csv')
a = a.merge(b, on="CommonColumn", how="outer")
a.to_csv('output.csv', index=False)
- Notes:
- The parameter
how
determines how the merge is performed. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
- The parameter
"Merge" two data CSV files that
- the first line is the column header
- the column headers are the same
- the rows are from different data sets
import pandas as pd
merged_data = pd.DataFrame()
a = pd.read_csv('file1.csv')
b = pd.read_csv('file2.csv')
# Introduce a new column 'DataSource', if necessary, to indicate the
# source of the entry
a['DataSource'] = 'file1'
b['DataSource'] = 'file2'
merged_data = merged_data.append(a, ignore_index=True, sort=False)
merged_data = merged_data.append(b, ignore_index=True, sort=False)
merged_data.to_csv('output.csv', index=False)