Before we can look at my plot types, we need some data to work with.
The iris flower dataset has been available since 1936 and is still incredibly useful almost a century later. It consists of measurements on 150 irises. There are three species of iris in the dataset and for each flower four measurements are given: petal length and width, and sepal length and width.
Sepals are leaves that surrounded the flower and resemble petals somewhat, but aren’t actually petals. They are easily seen on cauliflowers for example (a cauliflower being an actual flower bud), where you typically cut them off before cooking.
The iris dataset, along with many other useful datasets, can be automatically loaded by the scikit-learn package. You may need to install scikit-learn with pip install scikit-learn
.
We can then load the data, retrieving it in a dictionary-like structure.
Important keys are as follows.
- data contains the actual iris data: sepal length, sepal width, petal length, petal width (all in centimetres)
- target contains a number, 0, 1 or 2, that identifies the iris species for each sample
- target_names contains the actual species names for each of the indices in target
import sklearn.datasets as ds
iris = ds.load_iris()
print(iris['data'])
print(iris['target_names'])
print(iris['target'])
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
...
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]
['setosa' 'versicolor' 'virginica']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
To make this a little easier to work with, we can use the as_frame parameter to return the main data segment as a Pandas data frame. This again is a dictionary-like structure where the keys are the column names.
import sklearn.datasets as ds
iris = ds.load_iris(as_frame=True)
print(iris['data'])
print(iris['target_names'])
print(iris['target'])
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
[150 rows x 4 columns]
['setosa' 'versicolor' 'virginica']
0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: target, Length: 150, dtype: int64
If you want to put all this data in a single data frame (a data frame basically being a table of information with numbered rows), it’s easily done.
After getting most of the data from the data key, we can add a new column containing the target information, and we can add the actual species names by mapping the target indices to text using, for example, the Numpy choose function.
import sklearn.datasets as ds
import numpy as np
iris = ds.load_iris(as_frame=True)
df = iris['data']
df['species_index'] = iris['target']
df['species'] = np.choose(iris['target'], iris['target_names'])
print(df)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species_index species
0 5.1 3.5 1.4 0.2 0 setosa
1 4.9 3.0 1.4 0.2 0 setosa
2 4.7 3.2 1.3 0.2 0 setosa
3 4.6 3.1 1.5 0.2 0 setosa
4 5.0 3.6 1.4 0.2 0 setosa
.. ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2 virginica
146 6.3 2.5 5.0 1.9 2 virginica
147 6.5 3.0 5.2 2.0 2 virginica
148 6.2 3.4 5.4 2.3 2 virginica
149 5.9 3.0 5.1 1.8 2 virginica
[150 rows x 6 columns]
This is getting a bit hard to read when displayed on a web page because the rows contain too many columns to easily fit on one line, but for some purposes it’s very convenient to work with.
Scikit-learn allows you to load many other datasets in exactly the same kind of way.
Leave a Reply