Libraries: NumPy/Pandas

60 min12 pages

What is Libraries: NumPy/Pandas?

Concept from package: Comprehensive Python programming course covering syntax, data structures, functions, object-oriented

~60 min12 pages
pythondata-analysisdata-science

NumPy and Pandas are foundational libraries in Python for scientific computing and data analysis. NumPy provides high-performance multidimensional arrays and a rich set of mathematical functions to operate on them. It introduces the ndarray, a dense n-dimensional array object that supports vectorized operations and broadcasting, which makes numerical computation fast and expressive. Pandas builds on NumPy to offer labeled data structures—Series and DataFrame—that resemble tables and time-series data. Series is a one-dimensional labeled array, while DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). This combination enables efficient data manipulation, cleaning, and analysis: selecting subsets with boolean masks, aligning data by labels, handling missing values, and performing group operations. A typical workflow starts with loading data into a DataFrame from CSV or Excel, inspecting the first few rows with head(), checking data types with dtypes, and then applying vectorized computations or aggregations. The power of these libraries is that many operations run in optimized C code under the hood, making data tasks scalable. To write robust data pipelines, you’ll learn how to index, slice, and transform data, how to handle missing values with methods like fillna and dropna, and how to merge and join datasets on common keys. As you practice, you’ll see how NumPy arrays can act as efficient numerical backends for Pandas computations, and how Pandas brings us richer data semantics that mirror real-world data problems.

Which statement best describes the primary role of Pandas in data analysis?

Pandas provides low-level numerical kernels for fast matrix multiplication
Pandas offers labeled data structures (Series and DataFrame) for data manipulation
Pandas replaces NumPy as the core numerical engine
Pandas is used only for plotting data

Sign up to unlock quizzes

Example: Creating a DataFrame from a dictionary

python
import pandas as pd

data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 32, 37],
    'city': ['New York', 'Paris', 'Berlin']
}

df = pd.DataFrame(data)
print(df)
# Output:
#       name  age      city
# 0    Alice   25  New York
# 1      Bob   32     Paris
# 2  Charlie   37    Berlin

What are the two primary Pandas data structures for tabular data?

Series and DataFrame used for 1D and 2D data respectively
Array and Matrix used for numeric data
List and Dict used for generic data
Panel and Tree used for hierarchical data

Sign up to unlock quizzes

In Pandas, the function to read a CSV file is _____ and it returns a DataFrame.

Type your answer...

Sign up to unlock quizzes

Example: Basic DataFrame Operations

python
import pandas as pd

df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6]})
print(df.head())
print('Info:
', df.info())
print('Summary stats:
', df.describe())

Which Pandas method would you use to summarize numerical columns by count, mean, std, min, and max?

df.groupby
df.describe
df.mean
df.aggregate

Sign up to unlock quizzes

To select a column named _____ from a DataFrame df, you can use df['column_name'].

Type your answer...

Sign up to unlock quizzes

Example: Filtering with Boolean Masks

python
import pandas as pd

df = pd.DataFrame({'A':[10,20,30,40], 'B':[5,5,0,8]})
filtered = df[df['B'] > 4]
print(filtered)

Which operation in Pandas is commonly used to align two DataFrames on a key column?

merge (or join)
concat only
split
pivot

Sign up to unlock quizzes

Example: Simple Merge

python
import pandas as pd

df1 = pd.DataFrame({'id':[1,2,3], 'val':[10,20,30]})
df2 = pd.DataFrame({'id':[1,2,3], 'name':['A','B','C']})
merged = df1.merge(df2, on='id')
print(merged)

What is a key advantage of using NumPy arrays over Python lists for numerical computations?

NumPy arrays require more memory
NumPy arrays support element-wise operations and broadcasting with optimized C code
NumPy arrays cannot store numeric types
NumPy arrays are immutable

Sign up to unlock quizzes

NumPy focuses on efficient numerical computation with n-dimensional arrays. At its core is the ndarray, a homogeneous array where each element is the same data type. This homogeneity allows for contiguous memory storage and vectorized operations, which run much faster than Python loops. NumPy offers a broad set of universal functions (ufuncs) for elementwise operations like sin, cos, and arithmetic, plus aggregation like sum, mean, min, and max. Broadcasting allows arithmetic between arrays of different shapes, expanding smaller arrays to match larger ones without explicit looping. When you load data from files or compute statistics, NumPy arrays often back Pandas computations, providing the numerical engine under the hood. NumPy also includes random number generation, linear algebra routines, and Fourier transforms, making it a versatile toolkit for data analysis, simulations, and numerical methods. In practice, you begin by creating arrays via np.array or np.arange, exploring their shape and dtype, and then applying vectorized operations that avoid Python-level loops for performance. Understanding indexing and slicing in NumPy mirrors Python lists but with richer capabilities like boolean indexing and fancy indexing. Integration with Pandas allows for smooth transitions between raw numerical matrices and labeled data frames, enabling scalable data pipelines.

How do you create a NumPy array from a Python list?

np.create([1,2,3])
np.array([1,2,3])
np.list([1,2,3])
np.new_array([1,2,3])

Sign up to unlock quizzes

Example: Basic NumPy Array

python
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print('dtype:', arr.dtype)
print('shape:', arr.shape)

The NumPy function to generate evenly spaced values is _____, which is often used for creating ranges with a specific step.

Type your answer...

Sign up to unlock quizzes

Example: Boolean Indexing in NumPy

python
import numpy as np
arr = np.array([10, 15, 20, 25])
idx = arr > 15
print(arr[idx])

Which Pandas method is most suitable for reading CSV data into a DataFrame?

pd.read_excel
pd.read_csv
pd.read_sql
pd.read_json

Sign up to unlock quizzes

Example: Read CSV (simulated with a string)

python
import pandas as pd
from io import StringIO
csv_data = StringIO("name,age
Alice,28
Bob,34")
df = pd.read_csv(csv_data)
print(df)

Which operation would you use to combine two DataFrames on a common key column?

merge
split
reshape
slice

Sign up to unlock quizzes

Example: GroupBy in Pandas

python
import pandas as pd

df = pd.DataFrame({'team':['A','A','B','B'], 'points':[10,12,8,9]})
group = df.groupby('team').sum()
print(group)

The method to insert a new column with a constant value in a DataFrame is _____, for example df['new'] = 0

Type your answer...

Sign up to unlock quizzes

NumPy and Pandas together enable powerful data pipelines. In practice you will load raw data into a DataFrame, inspect the schema, and perform data cleaning steps: handling missing values with fillna or dropna, converting data types with astype, and parsing dates with to_datetime. Pandas provides robust indexing and selection capabilities: label-based loc for both rows and columns, iloc for integer-position based indexing, and at/iat for scalar access. You’ll often normalize or transform columns, apply functions with apply or map, and create new features through vectorized operations that leverage NumPy under the hood. When working with large datasets, it’s common to optimize memory usage by downcasting numeric types, categorizing text data, or using categorical dtypes. Pandas also supports time-series operations like resample and asfreq to aggregate data at different frequencies, which is essential for financial data and sensor streams. The synergy between NumPy and Pandas is foundational for data science workflows: NumPy handles raw numerical operations efficiently, while Pandas provides the labeled structure and rich methods for data wrangling. Building familiarity with indexing, grouping, and merging will empower you to perform complex analyses with concise, readable code.

Which Pandas function converts a column to datetime type, enabling time-based operations?

pd.to_datetime
pd.as_datetime
pd.dt_to_datetime
pd.convert_datetime

Sign up to unlock quizzes

Example: Handling Missing Values

python
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':[1, None, 3], 'B':[4, 5, np.nan]})
print('Original')
print(df)
print('
Cleaned: fill with mean')
df['A'] = df['A'].fillna(df['A'].mean())
df['B'] = df['B'].fillna(df['B'].mean())
print(df)

What does the Pandas method dropna() do by default?

Removes all rows with any missing values
Removes all columns with missing values
Fills missing values with zeros
Keeps only rows where all values are non-null

Sign up to unlock quizzes

Example: Merging DataFrames on Index

python
import pandas as pd

df1 = pd.DataFrame({'val':[1,2]}, index=['a','b'])
df2 = pd.DataFrame({'name':['Alpha','Beta']}, index=['a','b'])
merged = df1.join(df2)
print(merged)

The Pandas function to read Excel files is _____, though it may require an engine like openpyxl.

Type your answer...

Sign up to unlock quizzes

NumPy supports linear algebra through the linalg submodule, which includes matrix operations such as solving systems of equations, eigenvalues, and decompositions. Knowledge of matrix algebra is essential for many algorithms in machine learning and data analysis. NumPy arrays serve as the numerical backbone for many SciPy routines as well. Practically, you’ll often construct small numerical problems to test ideas: generate matrices with random values using numpy.random, perform matrix multiplication with dot or @, and compute determinants or inverses when needed. Understanding shapes and axes is critical as you move beyond 2D into higher dimensions. Broadcasting lets you apply operations between arrays of different shapes, simplifying code and avoiding explicit loops. Pandas, in turn, makes table-like data manipulations ergonomic; you can extract slices, filter rows, group by columns, and apply functions across whole data frames. These skills together enable you to clean, transform and analyze real-world datasets efficiently, whether you’re engineering features for a model or performing exploratory data analysis.

Which NumPy function creates an array of random numbers from a standard normal distribution?

np.random.normal
np.random.rand
np.random.randint
np.random.randn

Sign up to unlock quizzes

Example: Matrix Multiplication

python
import numpy as np
A = np.array([[1,2],[3,4]])
B = np.array([[5,6],[7,8]])
C = A @ B  # or A.dot(B)
print(C)

In NumPy, the operation to perform element-wise multiplication between arrays A and B is _____

Type your answer...

Sign up to unlock quizzes

Example: Broadcasting

python
import numpy as np
arr = np.array([[1,2,3],[4,5,6]])
vec = np.array([1,0,1])
print(arr * vec)  # broadcasts vec across rows

What is broadcasting in NumPy?

A way to expand smaller arrays to match larger ones for arithmetic operations
A method to broadcast signals in real-time
A type of data serialization
A debugging technique

Sign up to unlock quizzes

Example: Array Slicing

python
import numpy as np
arr = np.arange(12).reshape(3,4)
print(arr[1:, :2])

Which Pandas method returns the first n rows of a DataFrame?

tail
head
sample
take

Sign up to unlock quizzes

Example: DataFrame Slicing with loc

python
import pandas as pd

df = pd.DataFrame({'A':[10,20,30], 'B':[100,200,300]}, index=['x','y','z'])
print(df.loc['y':'z'])

To get a boolean mask selecting rows where A > 15, you would write df[_____]

Type your answer...

Sign up to unlock quizzes

Pandas provides robust date and time support, which is essential for time-series analysis. You can parse strings into datetime objects with to_datetime, enabling time-based indexing, resampling, and rolling windows. Date handling is a common source of bugs, but Pandas offers flexible parsing, timezone localization, and frequency-based resampling. Also remember that Pandas can handle missing data in many columns differently, so you often coerce types, fill missing values, or forward-fill sequences in a way that preserves the integrity of your analysis. When dealing with heterogeneous datasets, you’ll frequently convert columns to appropriate dtypes (int, float, object, category) to optimize memory or enable specific operations. Performance considerations are important: vectorized operations with NumPy-backed arrays are generally faster than Python loops, and using Categorical types for text data can dramatically reduce memory usage and speed up operations like groupby. Building a dataset-aware workflow means combining these techniques: clean, type-cast, and time-align data, then apply analytical transformations that reveal trends and insights.

Which Pandas method converts a column to datetime type?

pd.to_datetime
pd.to_date
pd.convert_datetime
pd.parse_datetime

Sign up to unlock quizzes

Example: Resampling Time Series

python
import pandas as pd

dates = pd.date_range('20240101', periods=6)
df = pd.DataFrame({'value':[1,2,3,4,5,6]}, index=dates)
print(df.resample('D').sum())

To group data by a categorical column and compute the mean, you would use df.groupby('category')._____()

Type your answer...

Sign up to unlock quizzes

Example: Joining DataFrames on Multiple Keys

python
import pandas as pd

df1 = pd.DataFrame({'id':[1,1,2,2], 'val':[10,20,30,40]})
df2 = pd.DataFrame({'id':[1,2], 'name':['Alice','Bob']})
merged = df1.merge(df2, on='id', how='left')
print(merged)

What does the 'how' parameter in DataFrame.merge control?

The type of join (left, right, inner, outer)
The sorting order of keys
The kind of comparison used for equality
Whether to ignore missing values

Sign up to unlock quizzes

Example: Creating a Categorical Column

python
import pandas as pd

df = pd.DataFrame({'city':['NY','LA','NY','LA','SF']})
df['city'] = df['city'].astype('category')
print(df.dtypes)

The Pandas option to reduce memory usage by converting text data to categories is _____

Type your answer...

Sign up to unlock quizzes

Example: Handling Missing Data with fillna and dropna

python
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':[1, None, 3], 'B':[np.nan, 2, 3]})
filled = df.fillna({'A': 0, 'B': df['B'].mean()})
dropped = df.dropna()
print('Filled:
', filled)
print('
Dropped:
', dropped)

Which Pandas method would you use to replace missing values with a specified value?

fillna
dropna
replace
interpolate

Sign up to unlock quizzes

Example: Interacting with NumPy Arrays in Pandas

python
import pandas as pd
import numpy as np

df = pd.DataFrame({'x': np.arange(5), 'y': np.random.randn(5)})
df['z'] = df['x'] * 2 + df['y']
print(df)

Which statement best describes the relationship between NumPy and Pandas?

NumPy provides the numerical engine; Pandas offers labeled data structures built on NumPy.
Pandas is independent of NumPy and does not rely on it
NumPy is a plotting library that complements Pandas
Pandas replaces NumPy for numerical computations

Sign up to unlock quizzes

This page continues exploring practical workflows with NumPy and Pandas. You will build a small data analysis script that reads a CSV, cleans missing values, converts date columns, and computes a few summary statistics. Start by loading a dataset, then inspect its columns with df.columns and df.info(). Next, identify numeric columns and compute basic statistics with df.describe(). Use boolean indexing to filter rows that meet a condition and assign results to a new DataFrame. Group data by a category to compute aggregate metrics like mean or sum. Finally, save the processed data to a new CSV for downstream steps. As you implement, keep in mind the efficiency considerations: vectorized operations with NumPy, avoiding Python loops, and leveraging the powerful, expressive capabilities of Pandas for labeling, indexing, and joining. The combination of these tools often yields concise code that is easy to read and will scale well as your data grows.

Which Pandas method would you use to save a DataFrame to a CSV file?

df.export_csv
df.to_csv
pd.write_csv
df.save_csv

Sign up to unlock quizzes

Example: End-to-End Mini-Analysis

python
import pandas as pd
import numpy as np

# Create sample data
df = pd.DataFrame({"date": pd.date_range('2024-01-01', periods=6),
                   "category": ["A","A","B","B","A","B"],
                   "value": [1.2, 3.4, None, 2.2, 4.5, 3.3]})

# Clean data
df['value'] = df['value'].fillna(df['value'].mean())
df['date'] = pd.to_datetime(df['date'])

# Analysis
summary = df.groupby('category').agg(mean_value=('value','mean'), count=('value','count'))
print(summary)

# Output to CSV
summary.to_csv('summary.csv')

What is a common first step in cleaning a dataset before analysis?

Compute means of all columns
Handle missing values (fill or drop)
Plot all columns
Sort by index

Sign up to unlock quizzes

Example: Using apply for elementwise transformation

python
import pandas as pd

df = pd.DataFrame({'val':[1, -2, 3, -4]})
df['abs'] = df['val'].apply(abs)
print(df)

To select rows where 'value' is greater than 10, use df[df['value'] > 10], which is a form of _____ indexing.

Type your answer...

Sign up to unlock quizzes

You've previewed 6 of 12 pages

Sign up free to unlock the rest of this lesson and start tracking your progress.

6 more pages waiting:

  • Page 7
  • Page 8
  • Page 9
  • Page 10
  • +2 more...

Related Topics

Learn Libraries: NumPy/Pandas | Computer Science | Clarity