Data Understanding and Visualization¶

with Pandas, NumPy, and Matplotlib¶

Philip Vishnevsky¶

  • Pandas
    • Powerful data manipulation tool. Built for speed and efficiency.
    • Lets you handle data with columns and rows.
  • NumPy
    • Extremely fast mathematical library. Most scientific computing python packages are built ontop of NumPy
    • Super fast N-Dimensional Arrays
  • Matplotlib
    • Graphical visualizations of array data, either in pandas, numpy, or list format.
    • Third party package seaborn offers more appealing API and color schemes.

In this notebook, we'll install all of the packages at once, then run through NumPy, Pandas, and Matplotlib sections. You'll need to fill in some code, marked by #TODO.

Installation¶

Run the following cell to install all four packages at once. If this does not work on your system, you can manually use the command line to install these packages.

In [7]:
!pip install numpy pandas matplotlib seaborn
Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.12/site-packages (1.26.4)
Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.12/site-packages (2.2.2)
Requirement already satisfied: matplotlib in /opt/anaconda3/lib/python3.12/site-packages (3.8.4)
Requirement already satisfied: seaborn in /opt/anaconda3/lib/python3.12/site-packages (0.13.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.12/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.12/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/lib/python3.12/site-packages (from pandas) (2023.3)
Requirement already satisfied: contourpy>=1.0.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=8 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)

NumPy¶

In this section, we'll work wtih NumPy arrays and understand how how we can work with this data structure

In [10]:
import numpy as np

N-Dimensional Arrays¶

Run the two cells below. Observe how the first cell is a plain python list. The output of the second cell is a np.array with the same values. There is no "shape" to the plain python list. However, we can call .shape on any np.array to get the shape of that array. (Remember, arrays are N-Dimensional!)

In [12]:
[1,2,3]
Out[12]:
[1, 2, 3]
In [13]:
arr1 = np.array([1,2,3])
arr1
Out[13]:
array([1, 2, 3])
In [14]:
arr1.shape
Out[14]:
(3,)

Notice how a 1 dimensional array's shape has only one length followed by a trailing comma. This tells us that the first dimension's length is 3. Why is it not (3,1)? In the case of an array with shape (3,1), this would tell us it's a 2-dimensional array with the first dimension's length is 3 and the second's is 1. This may seem like the same kind of array, but it is distinctly different -- especially when we get to Pandas!

In [16]:
arr2 = np.array([
    [1,2,3],
    [4,5,6]
])
arr2
Out[16]:
array([[1, 2, 3],
       [4, 5, 6]])
In [17]:
#TODO: Write a line of code below to output the shape of the arr2 array. Expected output is (2,3)
arr2.shape
Out[17]:
(2, 3)
In [18]:
#TODO: Write some code to create an np.array that has the shape (3,3).
arr3 = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
])
arr3.shape
Out[18]:
(3, 3)

Utility properties¶

Using arr1 and arr2 from the previous cells, let's explore some utility properties that numpy offers.

In [20]:
# Finding the number of dimensions
arr1.ndim
Out[20]:
1
In [21]:
arr2.ndim
Out[21]:
2

Notice that the .ndim is the same if we took the length of shape:

In [23]:
len(arr1.shape)
Out[23]:
1
In [24]:
len(arr2.shape)
Out[24]:
2
In [25]:
# Finding what kind of number is stored
arr1.dtype
Out[25]:
dtype('int64')

Notice that this means we can't have floats and ints in the same array. If we try to do that, our ints will be converted into floats.

In [27]:
np.array([1.4, 1])
Out[27]:
array([1.4, 1. ])

Pandas¶

In [29]:
import pandas as pd

Most of Pandas' strength comes from DataFrames, which are like 2 dimensional arrays. There is also Pandas Series, which are like 1-dimensional arrays. Unlike NumPy, they can store objects (such as strings)

In [31]:
animals = pd.Series(['dog','cat','frog','bee'])
animals
Out[31]:
0     dog
1     cat
2    frog
3     bee
dtype: object
In [32]:
colors = pd.Series(['brown', 'black', 'green', 'yellow'])
colors
Out[32]:
0     brown
1     black
2     green
3    yellow
dtype: object
In [33]:
animal_df = pd.DataFrame({
    "animal": animals,
    "color": colors
})
animal_df
Out[33]:
animal color
0 dog brown
1 cat black
2 frog green
3 bee yellow
In [34]:
#TODO: Create your own dataframe! It can be anything, as long as it has at least 2 columns and 3 rows.
shapes_df = pd.DataFrame({
    "parallelograms": ['square','rectangle','rhombus'],
    "gons": ['hexagon','octagon','decagon'],
    "3d_shapes": ['sphere','cube','cone'],
})
shapes_df
Out[34]:
parallelograms gons 3d_shapes
0 square hexagon sphere
1 rectangle octagon cube
2 rhombus decagon cone

Working with data!¶

In this section, we'll be using the two csv files from canvas:

  • honeyproduction.csv
  • honeyproduction_withnulls.csv

Download the files from canvas and place them in the same folder as this notebook.

In [37]:
honey_df = pd.read_csv("data/honeyproduction.csv")

Now let's take a look at the first five rows. We can use .head() to display the "head"

In [39]:
honey_df.head()
Out[39]:
state numcol yieldpercol totalprod stocks priceperlb prodvalue year
0 AL 16000.0 71 1136000.0 159000.0 0.72 818000.0 1998
1 AZ 55000.0 60 3300000.0 1485000.0 0.64 2112000.0 1998
2 AR 53000.0 65 3445000.0 1688000.0 0.59 2033000.0 1998
3 CA 450000.0 83 37350000.0 12326000.0 0.62 23157000.0 1998
4 CO 27000.0 72 1944000.0 1594000.0 0.70 1361000.0 1998

DataFrame utility properties!¶

In [41]:
# get the data type for each column.
honey_df.dtypes
Out[41]:
state           object
numcol         float64
yieldpercol      int64
totalprod      float64
stocks         float64
priceperlb     float64
prodvalue      float64
year             int64
dtype: object

Notice the similarity in how these are presented compared to NumPy dtypes!

In [43]:
# get a list-like object (Index) of the columns. This is an iterable object.
honey_df.columns
Out[43]:
Index(['state', 'numcol', 'yieldpercol', 'totalprod', 'stocks', 'priceperlb',
       'prodvalue', 'year'],
      dtype='object')

Now, try to get the shape of the DataFrame. Think of how shapes work in NumPy.

In [45]:
#TODO get shape of honey_df. Expected output: (626, 8)
honey_df.shape
Out[45]:
(626, 8)

We can also use .describe() to see some statistical information on the numerical columns.

In [47]:
honey_df.describe()
Out[47]:
numcol yieldpercol totalprod stocks priceperlb prodvalue year
count 626.000000 626.000000 6.260000e+02 6.260000e+02 626.000000 6.260000e+02 626.000000
mean 60284.345048 62.009585 4.169086e+06 1.318859e+06 1.409569 4.715741e+06 2004.864217
std 91077.087231 19.458754 6.883847e+06 2.272964e+06 0.638599 7.976110e+06 4.317306
min 2000.000000 19.000000 8.400000e+04 8.000000e+03 0.490000 1.620000e+05 1998.000000
25% 9000.000000 48.000000 4.750000e+05 1.430000e+05 0.932500 7.592500e+05 2001.000000
50% 26000.000000 60.000000 1.533000e+06 4.395000e+05 1.360000 1.841500e+06 2005.000000
75% 63750.000000 74.000000 4.175250e+06 1.489500e+06 1.680000 4.703250e+06 2009.000000
max 510000.000000 136.000000 4.641000e+07 1.380000e+07 4.150000 6.961500e+07 2012.000000

Slicing DataFrames¶

Let's practice slicing the DataFrame to precicely access certain parts of the data.

In [49]:
# get the column with state information
honey_df['state']
Out[49]:
0      AL
1      AZ
2      AR
3      CA
4      CO
       ..
621    VA
622    WA
623    WV
624    WI
625    WY
Name: state, Length: 626, dtype: object
In [50]:
# get the columns with state, priceperlb, and year information
honey_df[['state', 'priceperlb', 'year']]
Out[50]:
state priceperlb year
0 AL 0.72 1998
1 AZ 0.64 1998
2 AR 0.59 1998
3 CA 0.62 1998
4 CO 0.70 1998
... ... ... ...
621 VA 3.77 2012
622 WA 2.38 2012
623 WV 2.91 2012
624 WI 2.05 2012
625 WY 1.87 2012

626 rows × 3 columns

CHECKPOINT! Notice how the syntax is different for accessing one column compared to multiple. Also note how when accessing one column, we get a Series object. Contrast this to multiple columns, where we get a DataFrame that is a subset of honey_df

But what about rows?

In [53]:
# let's take a look at our DataFrame for reference.
honey_df.head()
Out[53]:
state numcol yieldpercol totalprod stocks priceperlb prodvalue year
0 AL 16000.0 71 1136000.0 159000.0 0.72 818000.0 1998
1 AZ 55000.0 60 3300000.0 1485000.0 0.64 2112000.0 1998
2 AR 53000.0 65 3445000.0 1688000.0 0.59 2033000.0 1998
3 CA 450000.0 83 37350000.0 12326000.0 0.62 23157000.0 1998
4 CO 27000.0 72 1944000.0 1594000.0 0.70 1361000.0 1998
In [54]:
# if we want to LOCate the row with the label 0:
honey_df.loc[0]
Out[54]:
state                 AL
numcol           16000.0
yieldpercol           71
totalprod      1136000.0
stocks          159000.0
priceperlb          0.72
prodvalue       818000.0
year                1998
Name: 0, dtype: object
In [55]:
# if we want to locate the row based on position (not necessarily the labeled index)
honey_df.iloc[2]
Out[55]:
state                 AR
numcol           53000.0
yieldpercol           65
totalprod      3445000.0
stocks         1688000.0
priceperlb          0.59
prodvalue      2033000.0
year                1998
Name: 2, dtype: object
In [56]:
# we can do ranges as well. they work like normal python slicing syntax
honey_df.iloc[1:4]
Out[56]:
state numcol yieldpercol totalprod stocks priceperlb prodvalue year
1 AZ 55000.0 60 3300000.0 1485000.0 0.64 2112000.0 1998
2 AR 53000.0 65 3445000.0 1688000.0 0.59 2033000.0 1998
3 CA 450000.0 83 37350000.0 12326000.0 0.62 23157000.0 1998
In [57]:
# just to prove the point, we can even get the first few even rows:
honey_df.iloc[0:11:2]
Out[57]:
state numcol yieldpercol totalprod stocks priceperlb prodvalue year
0 AL 16000.0 71 1136000.0 159000.0 0.72 818000.0 1998
2 AR 53000.0 65 3445000.0 1688000.0 0.59 2033000.0 1998
4 CO 27000.0 72 1944000.0 1594000.0 0.70 1361000.0 1998
6 GA 75000.0 56 4200000.0 307000.0 0.69 2898000.0 1998
8 ID 120000.0 50 6000000.0 2220000.0 0.65 3900000.0 1998
10 IN 9000.0 92 828000.0 489000.0 0.85 704000.0 1998

Filtering DataFrames with conditionals¶

In [59]:
# filter the honey_df DataFrame so that we have all rows from NC
honey_df[honey_df['state'] == 'NC']
Out[59]:
state numcol yieldpercol totalprod stocks priceperlb prodvalue year
27 NC 8000.0 59 472000.0 151000.0 1.38 651000.0 1998
70 NC 9000.0 46 414000.0 104000.0 1.62 671000.0 1999
113 NC 11000.0 49 539000.0 243000.0 1.43 771000.0 2000
156 NC 13000.0 44 572000.0 172000.0 1.48 847000.0 2001
200 NC 16000.0 42 672000.0 74000.0 1.41 948000.0 2002
244 NC 10000.0 44 440000.0 79000.0 1.92 845000.0 2003
287 NC 9000.0 40 360000.0 72000.0 1.93 695000.0 2004
328 NC 10000.0 54 540000.0 146000.0 1.88 1015000.0 2005
369 NC 10000.0 50 500000.0 215000.0 1.57 785000.0 2006
410 NC 12000.0 45 540000.0 76000.0 2.49 1345000.0 2007
451 NC 12000.0 52 624000.0 137000.0 2.18 1360000.0 2008
491 NC 11000.0 45 495000.0 84000.0 2.57 1272000.0 2009
531 NC 13000.0 46 598000.0 144000.0 2.66 1591000.0 2010
571 NC 14000.0 62 868000.0 95000.0 2.83 2456000.0 2011
611 NC 13000.0 39 507000.0 106000.0 3.76 1906000.0 2012

This syntax is confusing, but let's break it down:

Here's the column of honey_df that has state data

In [61]:
honey_df['state']
Out[61]:
0      AL
1      AZ
2      AR
3      CA
4      CO
       ..
621    VA
622    WA
623    WV
624    WI
625    WY
Name: state, Length: 626, dtype: object

We can use a conditional to return a series of boolean values, which indicate whether that conditional was evaluated as true or false for a given row.

In [63]:
honey_df['state'] == 'NC'
Out[63]:
0      False
1      False
2      False
3      False
4      False
       ...  
621    False
622    False
623    False
624    False
625    False
Name: state, Length: 626, dtype: bool

Bringing it together, we get the rows of honey_df where our conditional was true!

In [65]:
honey_df[honey_df['state'] == 'NC']
Out[65]:
state numcol yieldpercol totalprod stocks priceperlb prodvalue year
27 NC 8000.0 59 472000.0 151000.0 1.38 651000.0 1998
70 NC 9000.0 46 414000.0 104000.0 1.62 671000.0 1999
113 NC 11000.0 49 539000.0 243000.0 1.43 771000.0 2000
156 NC 13000.0 44 572000.0 172000.0 1.48 847000.0 2001
200 NC 16000.0 42 672000.0 74000.0 1.41 948000.0 2002
244 NC 10000.0 44 440000.0 79000.0 1.92 845000.0 2003
287 NC 9000.0 40 360000.0 72000.0 1.93 695000.0 2004
328 NC 10000.0 54 540000.0 146000.0 1.88 1015000.0 2005
369 NC 10000.0 50 500000.0 215000.0 1.57 785000.0 2006
410 NC 12000.0 45 540000.0 76000.0 2.49 1345000.0 2007
451 NC 12000.0 52 624000.0 137000.0 2.18 1360000.0 2008
491 NC 11000.0 45 495000.0 84000.0 2.57 1272000.0 2009
531 NC 13000.0 46 598000.0 144000.0 2.66 1591000.0 2010
571 NC 14000.0 62 868000.0 95000.0 2.83 2456000.0 2011
611 NC 13000.0 39 507000.0 106000.0 3.76 1906000.0 2012
In [66]:
#TODO write your own conditional to filter a subset of the DataFrame.
#It can be anything EXCEPT filtering by state. Hint: Try year!
honey_df[honey_df['year'] == 2004]
Out[66]:
state numcol yieldpercol totalprod stocks priceperlb prodvalue year
261 AL 12000.0 87 1044000.0 282000.0 1.41 1472000.0 2004
262 AZ 32000.0 55 1760000.0 774000.0 1.11 1954000.0 2004
263 AR 40000.0 57 2280000.0 388000.0 0.87 1984000.0 2004
264 CA 390000.0 45 17550000.0 5792000.0 1.05 18428000.0 2004
265 CO 23000.0 80 1840000.0 791000.0 1.35 2484000.0 2004
266 FL 205000.0 98 20090000.0 2009000.0 1.02 20492000.0 2004
267 GA 63000.0 49 3087000.0 648000.0 1.20 3704000.0 2004
268 HI 8000.0 96 768000.0 77000.0 1.59 1221000.0 2004
269 ID 100000.0 63 6300000.0 2520000.0 1.02 6426000.0 2004
270 IL 7000.0 55 385000.0 193000.0 1.86 716000.0 2004
271 IN 7000.0 59 413000.0 145000.0 1.47 607000.0 2004
272 IA 35000.0 67 2345000.0 1337000.0 1.06 2486000.0 2004
273 KS 14000.0 80 1120000.0 683000.0 1.18 1322000.0 2004
274 KY 5000.0 56 280000.0 34000.0 1.96 549000.0 2004
275 LA 35000.0 98 3430000.0 240000.0 0.79 2710000.0 2004
276 ME 7000.0 31 217000.0 37000.0 1.28 278000.0 2004
277 MI 65000.0 67 4355000.0 2439000.0 1.16 5052000.0 2004
278 MN 135000.0 75 10125000.0 1924000.0 1.08 10935000.0 2004
279 MS 18000.0 65 1170000.0 421000.0 0.79 924000.0 2004
280 MO 16000.0 41 656000.0 151000.0 1.36 892000.0 2004
281 MT 140000.0 77 10780000.0 3773000.0 1.08 11642000.0 2004
282 NE 51000.0 89 4539000.0 2043000.0 1.01 4584000.0 2004
283 NV 14000.0 55 770000.0 316000.0 1.78 1371000.0 2004
284 NJ 12000.0 27 324000.0 45000.0 1.40 454000.0 2004
285 NM 8000.0 44 352000.0 127000.0 1.19 419000.0 2004
286 NY 64000.0 67 4288000.0 1887000.0 1.36 5832000.0 2004
287 NC 9000.0 40 360000.0 72000.0 1.93 695000.0 2004
288 ND 390000.0 78 30420000.0 9126000.0 1.05 31941000.0 2004
289 OH 16000.0 58 928000.0 353000.0 1.53 1420000.0 2004
290 OR 42000.0 54 2268000.0 1111000.0 1.21 2744000.0 2004
291 PA 30000.0 54 1620000.0 810000.0 1.42 2300000.0 2004
292 SD 215000.0 105 22575000.0 13545000.0 1.01 22801000.0 2004
293 TN 6000.0 54 324000.0 91000.0 1.73 561000.0 2004
294 TX 116000.0 76 8816000.0 1411000.0 0.97 8552000.0 2004
295 UT 24000.0 70 1680000.0 554000.0 1.10 1848000.0 2004
296 VT 6000.0 68 408000.0 192000.0 1.51 616000.0 2004
297 VA 7000.0 38 266000.0 69000.0 2.10 559000.0 2004
298 WA 56000.0 63 3528000.0 1376000.0 0.98 3457000.0 2004
299 WV 9000.0 55 495000.0 183000.0 1.41 698000.0 2004
300 WI 68000.0 86 5848000.0 2632000.0 1.19 6959000.0 2004
301 WY 39000.0 75 2925000.0 380000.0 1.10 3218000.0 2004

Working with missing data/nulls!¶

Now, real world data is never 'clean'. We should practice on some data that has null values. Let's re-assign honey_df to a DataFrame with data from honeyproduction_withnulls.csv

In [68]:
honey_df = pd.read_csv('data/honeyproduction_withnulls.csv')
In [69]:
honey_df.head()
Out[69]:
state numcol yieldpercol totalprod stocks priceperlb prodvalue year
0 AL 16000.0 71.0 1136000.0 159000.0 0.72 818000.0 1998
1 AZ 55000.0 60.0 3300000.0 1485000.0 0.64 2112000.0 1998
2 AR 53000.0 65.0 NaN 1688000.0 0.59 2033000.0 1998
3 CA 450000.0 83.0 37350000.0 12326000.0 0.62 23157000.0 1998
4 CO 27000.0 72.0 1944000.0 1594000.0 NaN 1361000.0 1998

We can see some nulls: specifically under totalprod and priceperlb. These are marked as NaN, as that's how NumPy handles non-existant data. Let's see how many nulls we have:

In [71]:
honey_df.isna().sum()
Out[71]:
state          0
numcol         2
yieldpercol    2
totalprod      5
stocks         5
priceperlb     4
prodvalue      3
year           0
dtype: int64

This returns the sum of the true values from isna(), which gives us a ccount of all nulls in our DataFrame, by column. For totalprod, let's just full the null values with the mean. This can work since there's very few nulls.

In [73]:
honey_df['totalprod'] = honey_df['totalprod'].fillna(
    honey_df['totalprod'].mean()
)
In [74]:
honey_df.isna().sum()
Out[74]:
state          0
numcol         2
yieldpercol    2
totalprod      0
stocks         5
priceperlb     4
prodvalue      3
year           0
dtype: int64
In [75]:
#TODO do the same for priceperlb
honey_df['priceperlb'] = honey_df['priceperlb'].fillna(
    honey_df['priceperlb'].mean()
)
honey_df.isna().sum()
Out[75]:
state          0
numcol         2
yieldpercol    2
totalprod      0
stocks         5
priceperlb     0
prodvalue      3
year           0
dtype: int64

Now we're going to just drop the rows that have null values. If you're certain that you only have a few null datapoints and that the presence of null values have no impact on your dataset, this can be a good approach. Let's see the length of honey_df before and after as well.

In [77]:
len(honey_df)
Out[77]:
626
In [78]:
honey_df = honey_df.dropna(axis=0)
In [79]:
len(honey_df)
Out[79]:
615

Only dropping 11 rows is great for this dataset!

Adding new columns¶

Let's say we need to add the price per ounce as well. Here's how we can add another column to our DataFrame.

In [83]:
honey_df['priceperoz'] = honey_df['priceperlb'] / 16
In [84]:
honey_df.head()
Out[84]:
state numcol yieldpercol totalprod stocks priceperlb prodvalue year priceperoz
0 AL 16000.0 71.0 1.136000e+06 159000.0 0.720000 818000.0 1998 0.045000
1 AZ 55000.0 60.0 3.300000e+06 1485000.0 0.640000 2112000.0 1998 0.040000
2 AR 53000.0 65.0 4.133610e+06 1688000.0 0.590000 2033000.0 1998 0.036875
3 CA 450000.0 83.0 3.735000e+07 12326000.0 0.620000 23157000.0 1998 0.038750
4 CO 27000.0 72.0 1.944000e+06 1594000.0 1.411994 1361000.0 1998 0.088250

Sampling data¶

If we need to take a sampling of our data, say 23%, we can do this easily.

In [87]:
sampled_df = honey_df.sample(frac=0.23)
In [88]:
len(sampled_df)
Out[88]:
141
In [89]:
len(honey_df)
Out[89]:
615

Applying functions to a column¶

Let's say we need the price per gram. We could do this using a similar syntax to when we created priceperoz, but for the sake of exercise let's use .apply()

In [92]:
honey_df['priceperoz'].apply(lambda p: p * 28.35)
Out[92]:
0      1.275750
1      1.134000
2      1.045406
3      1.098563
4      2.501876
         ...   
621    6.679969
622    4.217062
623    5.156156
624    3.632344
625    3.313406
Name: priceperoz, Length: 615, dtype: float64

To add this to our DataFrame, we can just assign a column to this Series.

In [94]:
honey_df['pricepergram'] = honey_df['priceperoz'].apply(lambda p: p * 28.35)
In [95]:
honey_df.head()
Out[95]:
state numcol yieldpercol totalprod stocks priceperlb prodvalue year priceperoz pricepergram
0 AL 16000.0 71.0 1.136000e+06 159000.0 0.720000 818000.0 1998 0.045000 1.275750
1 AZ 55000.0 60.0 3.300000e+06 1485000.0 0.640000 2112000.0 1998 0.040000 1.134000
2 AR 53000.0 65.0 4.133610e+06 1688000.0 0.590000 2033000.0 1998 0.036875 1.045406
3 CA 450000.0 83.0 3.735000e+07 12326000.0 0.620000 23157000.0 1998 0.038750 1.098563
4 CO 27000.0 72.0 1.944000e+06 1594000.0 1.411994 1361000.0 1998 0.088250 2.501876

Matplotlib¶

In [97]:
import matplotlib.pyplot as plt

Matplotlib is a powerful and versatile tool for creating graphs. Follow along with the code below:

In [99]:
# an empty plot
plt.plot()
Out[99]:
[]
No description has been provided for this image
In [100]:
# plotting a list
plt.plot([1, 2, 3, 4])
Out[100]:
[<matplotlib.lines.Line2D at 0x165482a50>]
No description has been provided for this image
In [101]:
# plotting two lists against eachother
x = [1, 2, 3, 4]
y = [5, 19, 12, 12]
fig, ax = plt.subplots()
ax.plot(x, y)
Out[101]:
[<matplotlib.lines.Line2D at 0x1654d35c0>]
No description has been provided for this image
In [102]:
# a more professional looking plot, with saving to png!
# 1. prepare data
x = [1, 2, 3, 4]
y = [13, 53, 39, 2]

# 2. setup plot
fig, ax = plt.subplots(figsize=(20,10))

# 3. plotting data
ax.plot(x, y)

# 4. customize plot
ax.set(title="simple plot", xlabel="x axis", ylabel="yaxis")

# 5. save and show figure
fig.savefig('sample-plot.png')
No description has been provided for this image

Working with DataFrames¶

In [104]:
# it's very easy to use .plot() on a DataFrame.
honey_df.plot(x='year', y='stocks', kind='scatter')
Out[104]:
<Axes: xlabel='year', ylabel='stocks'>
No description has been provided for this image
In [105]:
#TODO plot your own scatter plot on honey_df:
honey_df.plot(x='year', y='pricepergram', kind='scatter')
Out[105]:
<Axes: xlabel='year', ylabel='pricepergram'>
No description has been provided for this image

Seaborn¶

We can use seaborn for prettier graphics

In [107]:
import seaborn as sns
In [108]:
sns.set(rc={"figure.figsize":(20,12)}) # change the figure size to width=20, height=12
sns.lineplot(data=honey_df, x='year', y='stocks', hue='state')
Out[108]:
<Axes: xlabel='year', ylabel='stocks'>
No description has been provided for this image

Notice here that we need to pass in the data to .lineplot()

What happens if we want a scatterplot against all columns?

In [111]:
sns.pairplot(data=honey_df)
Out[111]:
<seaborn.axisgrid.PairGrid at 0x174910560>
No description has been provided for this image
In [112]:
#TODO: Change this line plot to a scatter plot. (HINT: Change sns.lineplot)
sns.set(rc={"figure.figsize":(20,12)}) # change the figure size to width=20, height=12
sns.scatterplot(data=honey_df, x='year', y='stocks', hue='state')
Out[112]:
<Axes: xlabel='year', ylabel='stocks'>
No description has been provided for this image