Home » Machine Learning » Pandas » Creating DataFrames

Creating DataFrames

The Pandas package enables us to create “dataframes“, which are a lot like sheets in a spreadsheet app.

This means you can handle numerical and textual data together in the same data structure.

There are many different ways to create a dataframe, and we’ll take a look at a few here.

Creating and Populating Dataframes from Scratch

We can easily create empty dataframes and then populate them with data in a variety of different ways.

import pandas as pd

df = pd.DataFrame()

df['weight'] = [80, 50, 90]
df['height'] = [182, 159, 181]

print(df)
   weight  height
0      80     182
1      50     159
2      90     181

Notice the first pseudo-column is an index that numbers all the rows. This can be accessed via df.index, which is an iterator.

We can alternatively supply data in a dictionary.

import pandas as pd

data = {
    'height': [80, 50, 90],
    'weight': [182, 159, 181],
}

df = pd.DataFrame(data)

print(df)
   height  weight
0      80     182
1      50     159
2      90     181

Loading Existing Data

We can easily load csv data from files.

height,weight
80,182
50,159
90,181
import pandas as pd

df = pd.read_csv('heightweight.txt', delimiter=',')

print(df)
   height  weight
0      80     182
1      50     159
2      90     181

Loading Larger Datasets

Now let’s try to load the iris flower dataset. We’ll add the species into its own column.

import pandas as pd
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris(as_frame=True)

df = pd.DataFrame(iris['data'])

df['species'] = np.choose(iris['target'], iris['target_names'])

print(df)

This is actually easier to read in a terminal, since in a browser I find the rows wrap, being too long to display as they are, but this will depend on your font size.

Notice that by default not all rows are displayed. All columns may not be displayed either, if you have many columns. Usually this is what you want, but this behaviour of hiding data may be reconfigured if needed.

Here the output is rather long, so I won’t reproduce it. But notice the set_option method.

import pandas as pd
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris(as_frame=True)

df = pd.DataFrame(iris['data'])

df['species'] = np.choose(iris['target'], iris['target_names'])

pd.set_option("display.max_rows", 200)
pd.set_option("display.max_columns", 20)

print(df)

Leave a Reply

Blog at WordPress.com.

%d