Extract data from DataFrame in Python
The article is maintained by the team at commabot.
Extracting data from a DataFrame in Python is useful for data analysis, manipulation, and visualization. This guide will walk you through the basics of extracting data from DataFrames, including selecting columns, filtering rows, and advanced techniques like conditional selections and data aggregation.
Prerequisites
- Pandas library. If you haven't installed it yet, you can do so by running
pip install pandas
in your terminal or command prompt.
Creating a DataFrame
Before extracting data, let's create a simple DataFrame to work with:
import pandas as pd
# Sample data
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
Selecting Columns
Single Column: To extract a single column, use the column label in square brackets.
names = df['Name']
Multiple Columns: To extract multiple columns, use a list of column labels.
subset = df[['Name', 'City']]
Filtering Rows
Based on Conditions: Use conditions inside the square brackets to filter rows.
kids = df[df['Age'] < 30]
Usingloc
and iloc
: For more advanced row selection based on index labels (loc
) or integer-location based indexing (iloc
).
# Select rows with index 0 and 2
selected_rows = df.iloc[[0, 2]]
# Select rows where Name is 'Peter'
peter_row = df.loc[df['Name'] == 'Peter']
Conditional Selections
You can use logical operators to perform conditional selections:
# Select people aged below 30 and living in Berlin
young_in_berlin = df[(df['Age'] < 30) & (df['City'] == 'Berlin')]
Extracting Specific Data Points
Usingat
and iat
: For extracting single data points using a label (at
) or integer location (iat
).
# Using `at` to get age of John
john_age = df.at[0, 'Age']
# Using `iat` for the same purpose
john_age_iat = df.iat[0, 1]
Data Aggregation
Pandas provides methods like groupby
, sum
, mean
, etc., for aggregating data based on some criteria.
# Average age by city
average_age_by_city = df.groupby('City')['Age'].mean()
Advanced Data Extraction
- Using
query
Method: For filtering rows using a query string.
adults_in_london = df.query("Age >= 18 and City == 'London'")
- Using
pivot_table
for Data Summarization: To create a pivot table that summarizes data.
pivot = df.pivot_table(values='Age', index='City', aggfunc='mean')
Extracting data from DataFrames is a versatile skill in Python's pandas library, enabling you to select, filter, and aggregate data efficiently. Practice these techniques with different datasets to become proficient in data manipulation and analysis.