Philip Vishnevsky¶
- Pandas
- Powerful data manipulation tool. Built for speed and efficiency.
- Lets you handle data with columns and rows.
- NumPy
- Extremely fast mathematical library. Most scientific computing python packages are built ontop of NumPy
- Super fast N-Dimensional Arrays
- Matplotlib
- Graphical visualizations of array data, either in pandas, numpy, or list format.
- Third party package seaborn offers more appealing API and color schemes.
In this notebook, we'll install all of the packages at once, then run through NumPy, Pandas, and Matplotlib sections. You'll need to fill in some code, marked by #TODO.
Installation¶
Run the following cell to install all four packages at once. If this does not work on your system, you can manually use the command line to install these packages.
!pip install numpy pandas matplotlib seaborn
Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.12/site-packages (1.26.4) Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.12/site-packages (2.2.2) Requirement already satisfied: matplotlib in /opt/anaconda3/lib/python3.12/site-packages (3.8.4) Requirement already satisfied: seaborn in /opt/anaconda3/lib/python3.12/site-packages (0.13.2) Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.12/site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.12/site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/lib/python3.12/site-packages (from pandas) (2023.3) Requirement already satisfied: contourpy>=1.0.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (1.2.0) Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (1.4.4) Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (23.2) Requirement already satisfied: pillow>=8 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (10.3.0) Requirement already satisfied: pyparsing>=2.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (3.0.9) Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
NumPy¶
In this section, we'll work wtih NumPy arrays and understand how how we can work with this data structure
import numpy as np
N-Dimensional Arrays¶
Run the two cells below. Observe how the first cell is a plain python list. The output of the second cell is a np.array with the same values. There is no "shape" to the plain python list. However, we can call .shape on any np.array to get the shape of that array. (Remember, arrays are N-Dimensional!)
[1,2,3]
[1, 2, 3]
arr1 = np.array([1,2,3])
arr1
array([1, 2, 3])
arr1.shape
(3,)
Notice how a 1 dimensional array's shape has only one length followed by a trailing comma. This tells us that the first dimension's length is 3. Why is it not (3,1)? In the case of an array with shape (3,1), this would tell us it's a 2-dimensional array with the first dimension's length is 3 and the second's is 1. This may seem like the same kind of array, but it is distinctly different -- especially when we get to Pandas!
arr2 = np.array([
[1,2,3],
[4,5,6]
])
arr2
array([[1, 2, 3],
[4, 5, 6]])
#TODO: Write a line of code below to output the shape of the arr2 array. Expected output is (2,3)
arr2.shape
(2, 3)
#TODO: Write some code to create an np.array that has the shape (3,3).
arr3 = np.array([
[1,2,3],
[4,5,6],
[7,8,9]
])
arr3.shape
(3, 3)
Utility properties¶
Using arr1 and arr2 from the previous cells, let's explore some utility properties that numpy offers.
# Finding the number of dimensions
arr1.ndim
1
arr2.ndim
2
Notice that the .ndim is the same if we took the length of shape:
len(arr1.shape)
1
len(arr2.shape)
2
# Finding what kind of number is stored
arr1.dtype
dtype('int64')
Notice that this means we can't have floats and ints in the same array. If we try to do that, our ints will be converted into floats.
np.array([1.4, 1])
array([1.4, 1. ])
Pandas¶
import pandas as pd
Most of Pandas' strength comes from DataFrames, which are like 2 dimensional arrays. There is also Pandas Series, which are like 1-dimensional arrays. Unlike NumPy, they can store objects (such as strings)
animals = pd.Series(['dog','cat','frog','bee'])
animals
0 dog 1 cat 2 frog 3 bee dtype: object
colors = pd.Series(['brown', 'black', 'green', 'yellow'])
colors
0 brown 1 black 2 green 3 yellow dtype: object
animal_df = pd.DataFrame({
"animal": animals,
"color": colors
})
animal_df
| animal | color | |
|---|---|---|
| 0 | dog | brown |
| 1 | cat | black |
| 2 | frog | green |
| 3 | bee | yellow |
#TODO: Create your own dataframe! It can be anything, as long as it has at least 2 columns and 3 rows.
shapes_df = pd.DataFrame({
"parallelograms": ['square','rectangle','rhombus'],
"gons": ['hexagon','octagon','decagon'],
"3d_shapes": ['sphere','cube','cone'],
})
shapes_df
| parallelograms | gons | 3d_shapes | |
|---|---|---|---|
| 0 | square | hexagon | sphere |
| 1 | rectangle | octagon | cube |
| 2 | rhombus | decagon | cone |
Working with data!¶
In this section, we'll be using the two csv files from canvas:
- honeyproduction.csv
- honeyproduction_withnulls.csv
Download the files from canvas and place them in the same folder as this notebook.
honey_df = pd.read_csv("data/honeyproduction.csv")
Now let's take a look at the first five rows. We can use .head() to display the "head"
honey_df.head()
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | |
|---|---|---|---|---|---|---|---|---|
| 0 | AL | 16000.0 | 71 | 1136000.0 | 159000.0 | 0.72 | 818000.0 | 1998 |
| 1 | AZ | 55000.0 | 60 | 3300000.0 | 1485000.0 | 0.64 | 2112000.0 | 1998 |
| 2 | AR | 53000.0 | 65 | 3445000.0 | 1688000.0 | 0.59 | 2033000.0 | 1998 |
| 3 | CA | 450000.0 | 83 | 37350000.0 | 12326000.0 | 0.62 | 23157000.0 | 1998 |
| 4 | CO | 27000.0 | 72 | 1944000.0 | 1594000.0 | 0.70 | 1361000.0 | 1998 |
DataFrame utility properties!¶
# get the data type for each column.
honey_df.dtypes
state object numcol float64 yieldpercol int64 totalprod float64 stocks float64 priceperlb float64 prodvalue float64 year int64 dtype: object
Notice the similarity in how these are presented compared to NumPy dtypes!
# get a list-like object (Index) of the columns. This is an iterable object.
honey_df.columns
Index(['state', 'numcol', 'yieldpercol', 'totalprod', 'stocks', 'priceperlb',
'prodvalue', 'year'],
dtype='object')
Now, try to get the shape of the DataFrame. Think of how shapes work in NumPy.
#TODO get shape of honey_df. Expected output: (626, 8)
honey_df.shape
(626, 8)
We can also use .describe() to see some statistical information on the numerical columns.
honey_df.describe()
| numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | |
|---|---|---|---|---|---|---|---|
| count | 626.000000 | 626.000000 | 6.260000e+02 | 6.260000e+02 | 626.000000 | 6.260000e+02 | 626.000000 |
| mean | 60284.345048 | 62.009585 | 4.169086e+06 | 1.318859e+06 | 1.409569 | 4.715741e+06 | 2004.864217 |
| std | 91077.087231 | 19.458754 | 6.883847e+06 | 2.272964e+06 | 0.638599 | 7.976110e+06 | 4.317306 |
| min | 2000.000000 | 19.000000 | 8.400000e+04 | 8.000000e+03 | 0.490000 | 1.620000e+05 | 1998.000000 |
| 25% | 9000.000000 | 48.000000 | 4.750000e+05 | 1.430000e+05 | 0.932500 | 7.592500e+05 | 2001.000000 |
| 50% | 26000.000000 | 60.000000 | 1.533000e+06 | 4.395000e+05 | 1.360000 | 1.841500e+06 | 2005.000000 |
| 75% | 63750.000000 | 74.000000 | 4.175250e+06 | 1.489500e+06 | 1.680000 | 4.703250e+06 | 2009.000000 |
| max | 510000.000000 | 136.000000 | 4.641000e+07 | 1.380000e+07 | 4.150000 | 6.961500e+07 | 2012.000000 |
Slicing DataFrames¶
Let's practice slicing the DataFrame to precicely access certain parts of the data.
# get the column with state information
honey_df['state']
0 AL
1 AZ
2 AR
3 CA
4 CO
..
621 VA
622 WA
623 WV
624 WI
625 WY
Name: state, Length: 626, dtype: object
# get the columns with state, priceperlb, and year information
honey_df[['state', 'priceperlb', 'year']]
| state | priceperlb | year | |
|---|---|---|---|
| 0 | AL | 0.72 | 1998 |
| 1 | AZ | 0.64 | 1998 |
| 2 | AR | 0.59 | 1998 |
| 3 | CA | 0.62 | 1998 |
| 4 | CO | 0.70 | 1998 |
| ... | ... | ... | ... |
| 621 | VA | 3.77 | 2012 |
| 622 | WA | 2.38 | 2012 |
| 623 | WV | 2.91 | 2012 |
| 624 | WI | 2.05 | 2012 |
| 625 | WY | 1.87 | 2012 |
626 rows × 3 columns
CHECKPOINT!
Notice how the syntax is different for accessing one column compared to multiple. Also note how when accessing one column, we get a Series object. Contrast this to multiple columns, where we get a DataFrame that is a subset of honey_df
But what about rows?
# let's take a look at our DataFrame for reference.
honey_df.head()
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | |
|---|---|---|---|---|---|---|---|---|
| 0 | AL | 16000.0 | 71 | 1136000.0 | 159000.0 | 0.72 | 818000.0 | 1998 |
| 1 | AZ | 55000.0 | 60 | 3300000.0 | 1485000.0 | 0.64 | 2112000.0 | 1998 |
| 2 | AR | 53000.0 | 65 | 3445000.0 | 1688000.0 | 0.59 | 2033000.0 | 1998 |
| 3 | CA | 450000.0 | 83 | 37350000.0 | 12326000.0 | 0.62 | 23157000.0 | 1998 |
| 4 | CO | 27000.0 | 72 | 1944000.0 | 1594000.0 | 0.70 | 1361000.0 | 1998 |
# if we want to LOCate the row with the label 0:
honey_df.loc[0]
state AL numcol 16000.0 yieldpercol 71 totalprod 1136000.0 stocks 159000.0 priceperlb 0.72 prodvalue 818000.0 year 1998 Name: 0, dtype: object
# if we want to locate the row based on position (not necessarily the labeled index)
honey_df.iloc[2]
state AR numcol 53000.0 yieldpercol 65 totalprod 3445000.0 stocks 1688000.0 priceperlb 0.59 prodvalue 2033000.0 year 1998 Name: 2, dtype: object
# we can do ranges as well. they work like normal python slicing syntax
honey_df.iloc[1:4]
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | |
|---|---|---|---|---|---|---|---|---|
| 1 | AZ | 55000.0 | 60 | 3300000.0 | 1485000.0 | 0.64 | 2112000.0 | 1998 |
| 2 | AR | 53000.0 | 65 | 3445000.0 | 1688000.0 | 0.59 | 2033000.0 | 1998 |
| 3 | CA | 450000.0 | 83 | 37350000.0 | 12326000.0 | 0.62 | 23157000.0 | 1998 |
# just to prove the point, we can even get the first few even rows:
honey_df.iloc[0:11:2]
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | |
|---|---|---|---|---|---|---|---|---|
| 0 | AL | 16000.0 | 71 | 1136000.0 | 159000.0 | 0.72 | 818000.0 | 1998 |
| 2 | AR | 53000.0 | 65 | 3445000.0 | 1688000.0 | 0.59 | 2033000.0 | 1998 |
| 4 | CO | 27000.0 | 72 | 1944000.0 | 1594000.0 | 0.70 | 1361000.0 | 1998 |
| 6 | GA | 75000.0 | 56 | 4200000.0 | 307000.0 | 0.69 | 2898000.0 | 1998 |
| 8 | ID | 120000.0 | 50 | 6000000.0 | 2220000.0 | 0.65 | 3900000.0 | 1998 |
| 10 | IN | 9000.0 | 92 | 828000.0 | 489000.0 | 0.85 | 704000.0 | 1998 |
Filtering DataFrames with conditionals¶
# filter the honey_df DataFrame so that we have all rows from NC
honey_df[honey_df['state'] == 'NC']
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | |
|---|---|---|---|---|---|---|---|---|
| 27 | NC | 8000.0 | 59 | 472000.0 | 151000.0 | 1.38 | 651000.0 | 1998 |
| 70 | NC | 9000.0 | 46 | 414000.0 | 104000.0 | 1.62 | 671000.0 | 1999 |
| 113 | NC | 11000.0 | 49 | 539000.0 | 243000.0 | 1.43 | 771000.0 | 2000 |
| 156 | NC | 13000.0 | 44 | 572000.0 | 172000.0 | 1.48 | 847000.0 | 2001 |
| 200 | NC | 16000.0 | 42 | 672000.0 | 74000.0 | 1.41 | 948000.0 | 2002 |
| 244 | NC | 10000.0 | 44 | 440000.0 | 79000.0 | 1.92 | 845000.0 | 2003 |
| 287 | NC | 9000.0 | 40 | 360000.0 | 72000.0 | 1.93 | 695000.0 | 2004 |
| 328 | NC | 10000.0 | 54 | 540000.0 | 146000.0 | 1.88 | 1015000.0 | 2005 |
| 369 | NC | 10000.0 | 50 | 500000.0 | 215000.0 | 1.57 | 785000.0 | 2006 |
| 410 | NC | 12000.0 | 45 | 540000.0 | 76000.0 | 2.49 | 1345000.0 | 2007 |
| 451 | NC | 12000.0 | 52 | 624000.0 | 137000.0 | 2.18 | 1360000.0 | 2008 |
| 491 | NC | 11000.0 | 45 | 495000.0 | 84000.0 | 2.57 | 1272000.0 | 2009 |
| 531 | NC | 13000.0 | 46 | 598000.0 | 144000.0 | 2.66 | 1591000.0 | 2010 |
| 571 | NC | 14000.0 | 62 | 868000.0 | 95000.0 | 2.83 | 2456000.0 | 2011 |
| 611 | NC | 13000.0 | 39 | 507000.0 | 106000.0 | 3.76 | 1906000.0 | 2012 |
This syntax is confusing, but let's break it down:
Here's the column of honey_df that has state data
honey_df['state']
0 AL
1 AZ
2 AR
3 CA
4 CO
..
621 VA
622 WA
623 WV
624 WI
625 WY
Name: state, Length: 626, dtype: object
We can use a conditional to return a series of boolean values, which indicate whether that conditional was evaluated as true or false for a given row.
honey_df['state'] == 'NC'
0 False
1 False
2 False
3 False
4 False
...
621 False
622 False
623 False
624 False
625 False
Name: state, Length: 626, dtype: bool
Bringing it together, we get the rows of honey_df where our conditional was true!
honey_df[honey_df['state'] == 'NC']
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | |
|---|---|---|---|---|---|---|---|---|
| 27 | NC | 8000.0 | 59 | 472000.0 | 151000.0 | 1.38 | 651000.0 | 1998 |
| 70 | NC | 9000.0 | 46 | 414000.0 | 104000.0 | 1.62 | 671000.0 | 1999 |
| 113 | NC | 11000.0 | 49 | 539000.0 | 243000.0 | 1.43 | 771000.0 | 2000 |
| 156 | NC | 13000.0 | 44 | 572000.0 | 172000.0 | 1.48 | 847000.0 | 2001 |
| 200 | NC | 16000.0 | 42 | 672000.0 | 74000.0 | 1.41 | 948000.0 | 2002 |
| 244 | NC | 10000.0 | 44 | 440000.0 | 79000.0 | 1.92 | 845000.0 | 2003 |
| 287 | NC | 9000.0 | 40 | 360000.0 | 72000.0 | 1.93 | 695000.0 | 2004 |
| 328 | NC | 10000.0 | 54 | 540000.0 | 146000.0 | 1.88 | 1015000.0 | 2005 |
| 369 | NC | 10000.0 | 50 | 500000.0 | 215000.0 | 1.57 | 785000.0 | 2006 |
| 410 | NC | 12000.0 | 45 | 540000.0 | 76000.0 | 2.49 | 1345000.0 | 2007 |
| 451 | NC | 12000.0 | 52 | 624000.0 | 137000.0 | 2.18 | 1360000.0 | 2008 |
| 491 | NC | 11000.0 | 45 | 495000.0 | 84000.0 | 2.57 | 1272000.0 | 2009 |
| 531 | NC | 13000.0 | 46 | 598000.0 | 144000.0 | 2.66 | 1591000.0 | 2010 |
| 571 | NC | 14000.0 | 62 | 868000.0 | 95000.0 | 2.83 | 2456000.0 | 2011 |
| 611 | NC | 13000.0 | 39 | 507000.0 | 106000.0 | 3.76 | 1906000.0 | 2012 |
#TODO write your own conditional to filter a subset of the DataFrame.
#It can be anything EXCEPT filtering by state. Hint: Try year!
honey_df[honey_df['year'] == 2004]
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | |
|---|---|---|---|---|---|---|---|---|
| 261 | AL | 12000.0 | 87 | 1044000.0 | 282000.0 | 1.41 | 1472000.0 | 2004 |
| 262 | AZ | 32000.0 | 55 | 1760000.0 | 774000.0 | 1.11 | 1954000.0 | 2004 |
| 263 | AR | 40000.0 | 57 | 2280000.0 | 388000.0 | 0.87 | 1984000.0 | 2004 |
| 264 | CA | 390000.0 | 45 | 17550000.0 | 5792000.0 | 1.05 | 18428000.0 | 2004 |
| 265 | CO | 23000.0 | 80 | 1840000.0 | 791000.0 | 1.35 | 2484000.0 | 2004 |
| 266 | FL | 205000.0 | 98 | 20090000.0 | 2009000.0 | 1.02 | 20492000.0 | 2004 |
| 267 | GA | 63000.0 | 49 | 3087000.0 | 648000.0 | 1.20 | 3704000.0 | 2004 |
| 268 | HI | 8000.0 | 96 | 768000.0 | 77000.0 | 1.59 | 1221000.0 | 2004 |
| 269 | ID | 100000.0 | 63 | 6300000.0 | 2520000.0 | 1.02 | 6426000.0 | 2004 |
| 270 | IL | 7000.0 | 55 | 385000.0 | 193000.0 | 1.86 | 716000.0 | 2004 |
| 271 | IN | 7000.0 | 59 | 413000.0 | 145000.0 | 1.47 | 607000.0 | 2004 |
| 272 | IA | 35000.0 | 67 | 2345000.0 | 1337000.0 | 1.06 | 2486000.0 | 2004 |
| 273 | KS | 14000.0 | 80 | 1120000.0 | 683000.0 | 1.18 | 1322000.0 | 2004 |
| 274 | KY | 5000.0 | 56 | 280000.0 | 34000.0 | 1.96 | 549000.0 | 2004 |
| 275 | LA | 35000.0 | 98 | 3430000.0 | 240000.0 | 0.79 | 2710000.0 | 2004 |
| 276 | ME | 7000.0 | 31 | 217000.0 | 37000.0 | 1.28 | 278000.0 | 2004 |
| 277 | MI | 65000.0 | 67 | 4355000.0 | 2439000.0 | 1.16 | 5052000.0 | 2004 |
| 278 | MN | 135000.0 | 75 | 10125000.0 | 1924000.0 | 1.08 | 10935000.0 | 2004 |
| 279 | MS | 18000.0 | 65 | 1170000.0 | 421000.0 | 0.79 | 924000.0 | 2004 |
| 280 | MO | 16000.0 | 41 | 656000.0 | 151000.0 | 1.36 | 892000.0 | 2004 |
| 281 | MT | 140000.0 | 77 | 10780000.0 | 3773000.0 | 1.08 | 11642000.0 | 2004 |
| 282 | NE | 51000.0 | 89 | 4539000.0 | 2043000.0 | 1.01 | 4584000.0 | 2004 |
| 283 | NV | 14000.0 | 55 | 770000.0 | 316000.0 | 1.78 | 1371000.0 | 2004 |
| 284 | NJ | 12000.0 | 27 | 324000.0 | 45000.0 | 1.40 | 454000.0 | 2004 |
| 285 | NM | 8000.0 | 44 | 352000.0 | 127000.0 | 1.19 | 419000.0 | 2004 |
| 286 | NY | 64000.0 | 67 | 4288000.0 | 1887000.0 | 1.36 | 5832000.0 | 2004 |
| 287 | NC | 9000.0 | 40 | 360000.0 | 72000.0 | 1.93 | 695000.0 | 2004 |
| 288 | ND | 390000.0 | 78 | 30420000.0 | 9126000.0 | 1.05 | 31941000.0 | 2004 |
| 289 | OH | 16000.0 | 58 | 928000.0 | 353000.0 | 1.53 | 1420000.0 | 2004 |
| 290 | OR | 42000.0 | 54 | 2268000.0 | 1111000.0 | 1.21 | 2744000.0 | 2004 |
| 291 | PA | 30000.0 | 54 | 1620000.0 | 810000.0 | 1.42 | 2300000.0 | 2004 |
| 292 | SD | 215000.0 | 105 | 22575000.0 | 13545000.0 | 1.01 | 22801000.0 | 2004 |
| 293 | TN | 6000.0 | 54 | 324000.0 | 91000.0 | 1.73 | 561000.0 | 2004 |
| 294 | TX | 116000.0 | 76 | 8816000.0 | 1411000.0 | 0.97 | 8552000.0 | 2004 |
| 295 | UT | 24000.0 | 70 | 1680000.0 | 554000.0 | 1.10 | 1848000.0 | 2004 |
| 296 | VT | 6000.0 | 68 | 408000.0 | 192000.0 | 1.51 | 616000.0 | 2004 |
| 297 | VA | 7000.0 | 38 | 266000.0 | 69000.0 | 2.10 | 559000.0 | 2004 |
| 298 | WA | 56000.0 | 63 | 3528000.0 | 1376000.0 | 0.98 | 3457000.0 | 2004 |
| 299 | WV | 9000.0 | 55 | 495000.0 | 183000.0 | 1.41 | 698000.0 | 2004 |
| 300 | WI | 68000.0 | 86 | 5848000.0 | 2632000.0 | 1.19 | 6959000.0 | 2004 |
| 301 | WY | 39000.0 | 75 | 2925000.0 | 380000.0 | 1.10 | 3218000.0 | 2004 |
Working with missing data/nulls!¶
Now, real world data is never 'clean'. We should practice on some data that has null values. Let's re-assign honey_df to a DataFrame with data from honeyproduction_withnulls.csv
honey_df = pd.read_csv('data/honeyproduction_withnulls.csv')
honey_df.head()
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | |
|---|---|---|---|---|---|---|---|---|
| 0 | AL | 16000.0 | 71.0 | 1136000.0 | 159000.0 | 0.72 | 818000.0 | 1998 |
| 1 | AZ | 55000.0 | 60.0 | 3300000.0 | 1485000.0 | 0.64 | 2112000.0 | 1998 |
| 2 | AR | 53000.0 | 65.0 | NaN | 1688000.0 | 0.59 | 2033000.0 | 1998 |
| 3 | CA | 450000.0 | 83.0 | 37350000.0 | 12326000.0 | 0.62 | 23157000.0 | 1998 |
| 4 | CO | 27000.0 | 72.0 | 1944000.0 | 1594000.0 | NaN | 1361000.0 | 1998 |
We can see some nulls: specifically under totalprod and priceperlb. These are marked as NaN, as that's how NumPy handles non-existant data. Let's see how many nulls we have:
honey_df.isna().sum()
state 0 numcol 2 yieldpercol 2 totalprod 5 stocks 5 priceperlb 4 prodvalue 3 year 0 dtype: int64
This returns the sum of the true values from isna(), which gives us a ccount of all nulls in our DataFrame, by column. For totalprod, let's just full the null values with the mean. This can work since there's very few nulls.
honey_df['totalprod'] = honey_df['totalprod'].fillna(
honey_df['totalprod'].mean()
)
honey_df.isna().sum()
state 0 numcol 2 yieldpercol 2 totalprod 0 stocks 5 priceperlb 4 prodvalue 3 year 0 dtype: int64
#TODO do the same for priceperlb
honey_df['priceperlb'] = honey_df['priceperlb'].fillna(
honey_df['priceperlb'].mean()
)
honey_df.isna().sum()
state 0 numcol 2 yieldpercol 2 totalprod 0 stocks 5 priceperlb 0 prodvalue 3 year 0 dtype: int64
Now we're going to just drop the rows that have null values. If you're certain that you only have a few null datapoints and that the presence of null values have no impact on your dataset, this can be a good approach. Let's see the length of honey_df before and after as well.
len(honey_df)
626
honey_df = honey_df.dropna(axis=0)
len(honey_df)
615
Only dropping 11 rows is great for this dataset!
Adding new columns¶
Let's say we need to add the price per ounce as well. Here's how we can add another column to our DataFrame.
honey_df['priceperoz'] = honey_df['priceperlb'] / 16
honey_df.head()
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | priceperoz | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | AL | 16000.0 | 71.0 | 1.136000e+06 | 159000.0 | 0.720000 | 818000.0 | 1998 | 0.045000 |
| 1 | AZ | 55000.0 | 60.0 | 3.300000e+06 | 1485000.0 | 0.640000 | 2112000.0 | 1998 | 0.040000 |
| 2 | AR | 53000.0 | 65.0 | 4.133610e+06 | 1688000.0 | 0.590000 | 2033000.0 | 1998 | 0.036875 |
| 3 | CA | 450000.0 | 83.0 | 3.735000e+07 | 12326000.0 | 0.620000 | 23157000.0 | 1998 | 0.038750 |
| 4 | CO | 27000.0 | 72.0 | 1.944000e+06 | 1594000.0 | 1.411994 | 1361000.0 | 1998 | 0.088250 |
Sampling data¶
If we need to take a sampling of our data, say 23%, we can do this easily.
sampled_df = honey_df.sample(frac=0.23)
len(sampled_df)
141
len(honey_df)
615
Applying functions to a column¶
Let's say we need the price per gram. We could do this using a similar syntax to when we created priceperoz, but for the sake of exercise let's use .apply()
honey_df['priceperoz'].apply(lambda p: p * 28.35)
0 1.275750
1 1.134000
2 1.045406
3 1.098563
4 2.501876
...
621 6.679969
622 4.217062
623 5.156156
624 3.632344
625 3.313406
Name: priceperoz, Length: 615, dtype: float64
To add this to our DataFrame, we can just assign a column to this Series.
honey_df['pricepergram'] = honey_df['priceperoz'].apply(lambda p: p * 28.35)
honey_df.head()
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | priceperoz | pricepergram | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AL | 16000.0 | 71.0 | 1.136000e+06 | 159000.0 | 0.720000 | 818000.0 | 1998 | 0.045000 | 1.275750 |
| 1 | AZ | 55000.0 | 60.0 | 3.300000e+06 | 1485000.0 | 0.640000 | 2112000.0 | 1998 | 0.040000 | 1.134000 |
| 2 | AR | 53000.0 | 65.0 | 4.133610e+06 | 1688000.0 | 0.590000 | 2033000.0 | 1998 | 0.036875 | 1.045406 |
| 3 | CA | 450000.0 | 83.0 | 3.735000e+07 | 12326000.0 | 0.620000 | 23157000.0 | 1998 | 0.038750 | 1.098563 |
| 4 | CO | 27000.0 | 72.0 | 1.944000e+06 | 1594000.0 | 1.411994 | 1361000.0 | 1998 | 0.088250 | 2.501876 |
Matplotlib¶
import matplotlib.pyplot as plt
Matplotlib is a powerful and versatile tool for creating graphs. Follow along with the code below:
# an empty plot
plt.plot()
[]
# plotting a list
plt.plot([1, 2, 3, 4])
[<matplotlib.lines.Line2D at 0x165482a50>]
# plotting two lists against eachother
x = [1, 2, 3, 4]
y = [5, 19, 12, 12]
fig, ax = plt.subplots()
ax.plot(x, y)
[<matplotlib.lines.Line2D at 0x1654d35c0>]
# a more professional looking plot, with saving to png!
# 1. prepare data
x = [1, 2, 3, 4]
y = [13, 53, 39, 2]
# 2. setup plot
fig, ax = plt.subplots(figsize=(20,10))
# 3. plotting data
ax.plot(x, y)
# 4. customize plot
ax.set(title="simple plot", xlabel="x axis", ylabel="yaxis")
# 5. save and show figure
fig.savefig('sample-plot.png')
Working with DataFrames¶
# it's very easy to use .plot() on a DataFrame.
honey_df.plot(x='year', y='stocks', kind='scatter')
<Axes: xlabel='year', ylabel='stocks'>
#TODO plot your own scatter plot on honey_df:
honey_df.plot(x='year', y='pricepergram', kind='scatter')
<Axes: xlabel='year', ylabel='pricepergram'>
Seaborn¶
We can use seaborn for prettier graphics
import seaborn as sns
sns.set(rc={"figure.figsize":(20,12)}) # change the figure size to width=20, height=12
sns.lineplot(data=honey_df, x='year', y='stocks', hue='state')
<Axes: xlabel='year', ylabel='stocks'>
Notice here that we need to pass in the data to .lineplot()
What happens if we want a scatterplot against all columns?
sns.pairplot(data=honey_df)
<seaborn.axisgrid.PairGrid at 0x174910560>
#TODO: Change this line plot to a scatter plot. (HINT: Change sns.lineplot)
sns.set(rc={"figure.figsize":(20,12)}) # change the figure size to width=20, height=12
sns.scatterplot(data=honey_df, x='year', y='stocks', hue='state')
<Axes: xlabel='year', ylabel='stocks'>