Notes

Before we can even get to the fun of data analysis, we first need to learn how to load in our data!

Today, we’ll learn to work with the following categories of data inputs and outputs:

Reading and Writing Data in Text Format

read_csv Arguments

Table 6.1 lists the various data types pandas can read.

Each function can be called with pd.read_* (for example, pd.read_csv).

Note

Wes points out that the number of arguments can be overwhelming. pd.read_csv has about 50. The pandas documentation is a good resource for finding the right arguments.

Table 6.2 lists frequently used options in pd.read_csv.

Let’s import the Palmer Penguins dataset to explore this function and some of the csv arguments. Note: I added random numbers for month and day to demonstrate date parsing.

import pandas as pd

penguins = pd.read_csv("data/penguins.csv")

penguins.head(5)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex month day year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 4 10 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 3 6 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 7 22 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2 13 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 8 21 2007

Index Columns

Indexing gets column names from the file or from this argument

penguins_indexed = pd.read_csv("data/penguins.csv", index_col = "species")
penguins_indexed.head(5)
island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex month day year
species
Adelie Torgersen 39.1 18.7 181.0 3750.0 male 4 10 2007
Adelie Torgersen 39.5 17.4 186.0 3800.0 female 3 6 2007
Adelie Torgersen 40.3 18.0 195.0 3250.0 female 7 22 2007
Adelie Torgersen NaN NaN NaN NaN NaN 2 13 2007
Adelie Torgersen 36.7 19.3 193.0 3450.0 female 8 21 2007

Infer or Convert Data Type

Type inference and data conversion converts values (including missing) to a user-defined value.

If you data uses another string value as the missing placeholder, you can add it to na_values.

penguins_NA = pd.read_csv(
  "data/penguins.csv", 
  na_values = ["male"]
  )
  
penguins_NA.head(5)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex month day year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 NaN 4 10 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 3 6 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 7 22 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2 13 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 8 21 2007

Parse Date and Time

Date and time parsing combines date and time from multiple columns into a single column

penguins_dates = pd.read_csv(
  "data/penguins.csv", 
  parse_dates = {"date": ["month", "day", "year"]}
  )
  
penguins_dates["date"] = pd.to_datetime(
  penguins_dates.date, 
  format = "%m%d%Y"
  )
  
print(penguins_dates.date.head(5))

print(penguins_dates.date.dtypes)
0   2007-04-10
1   2007-03-06
2   2007-07-22
3   2007-02-13
4   2007-08-21
Name: date, dtype: datetime64[ns]
datetime64[ns]

Iterate Through Large Files

Iterating allows iteration over chunks of very large files

Using nrows to read in only 5 rows:

pd.read_csv("data/penguins.csv", nrows = 5
  )
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex month day year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 4 10 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 3 6 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 7 22 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2 13 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 8 21 2007

Using chunksize and the TextFileReader to aggregate and summarize the data by species:

chunker = pd.read_csv("data/penguins.csv", chunksize = 10)

print(type(chunker))

tot = pd.Series([], dtype = 'int64')
for piece in chunker:
    tot = tot.add(piece["species"].value_counts(), fill_value = 0)

tot
<class 'pandas.io.parsers.readers.TextFileReader'>
Adelie       152.0
Chinstrap     68.0
Gentoo       124.0
dtype: float64

Import Semi-Clean Data

Unclean data issues skips rows, comments, punctuation, etc.

We can import a subset of the columns using usecols and change their names (header = 0; names = [list]).

penguins_custom = pd.read_csv(
  "data/penguins.csv", 
  usecols = [0,1,6],
  header = 0, 
  names = ["Species", "Island", "Sex"]
  )

penguins_custom.head(5)
Species Island Sex
0 Adelie Torgersen male
1 Adelie Torgersen female
2 Adelie Torgersen female
3 Adelie Torgersen NaN
4 Adelie Torgersen female

Writing Data to Text Format

To write to a csv file, we can use pandas DataFrame’s to_csv method with index = False so the row numbers are not stored in the first column. Missing values are written as empty strings, we can specify a placeholder with na_rep = "NA":

penguins_custom.to_csv(
  "data/penguins_custom.csv", 
  index = False,
  na_rep = "NA"
  )

Working with Other Delimited Formats

Reading

In case your tabular data makes pandas trip up and you need a little extra manual processing, you can use Python’s built in csv module.

Let’s read in the penguins dataset the hard, manual way.

import csv

penguin_reader = csv.reader(penguins)

print(penguin_reader)
<_csv.reader object at 0x0000026704996340>

Now we have the _csv_reader object.

Next, Wes iterated through the reader to print the lines, which seems to only give me the row with my headings.

for line in penguin_reader:
  print(line)
['species']
['island']
['bill_length_mm']
['bill_depth_mm']
['flipper_length_mm']
['body_mass_g']
['sex']
['month']
['day']
['year']

We’ll keep following along to wrangle it into a form we can use:

with open("data/penguins.csv") as penguin_reader:
  lines = list(csv.reader(penguin_reader))
  
header, values = lines[0], lines[1:]

print(header)
print(values[5])
['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'month', 'day', 'year']
['Adelie', 'Torgersen', '39.3', '20.6', '190', '3650', 'male', '3', '3', '2007']

Now we have two lists: header and values. We use a dictionary of data columns and the expression zip(*values). This combination of dictionary comprehension and expression is generally faster than iterating through a loop. However, Wes warns that this can use a lot of memory on large files.

penguin_dict = {h: v for h, v in zip(header, zip(*values))}

# too big to print and I'm not sure how to print a select few key-value pairs
Recall

For a reminder on dictionary comprehensions, see Chapter 3.

Now to finally get this into a usable dataframe we’ll use pandas DataFrame from_dict method!

penguin_df = pd.DataFrame.from_dict(penguin_dict)
penguin_df.head(5)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex month day year
0 Adelie Torgersen 39.1 18.7 181 3750 male 4 10 2007
1 Adelie Torgersen 39.5 17.4 186 3800 female 3 6 2007
2 Adelie Torgersen 40.3 18 195 3250 female 7 22 2007
3 Adelie Torgersen NA NA NA NA NA 2 13 2007
4 Adelie Torgersen 36.7 19.3 193 3450 female 8 21 2007

csv.Dialect

Since there are many kinds of delimited files, string quoting conventions, and line terminators, you may find yourself wanting to define a “Dialect” to read in your delimited file. The options available are found in Table 6.3.

You can either define a csv.Dialect subclass or pass dialect parameters to csv.reader.

# option 1

## define a dialect subclass

class my_dialect(csv.Dialect):
    lineterminator = "\n"
    delimiter = ";"
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL
    
## use the subclass

reader = csv.reader(penguins, dialect = my_dialect)

# option 2

## pass just dialect parameters

reader = csv.reader(penguins, delimiter = ",")
Recap for when to use what?

For most data, pandas read_* functions, plus the overwhelming number of options, will likely get you close to what you need.

If there are additional, minor wrangling needs, you can try using Python’s csv.reader with either a csv.Dialect subclass or just by passing in dialect parameters.

If you have complicated or multicharacter delimiters, you’ll likely need to import the string module and use the split method or regular expression method re.split.

Writing

csv.writer is the companion to csv.reader with the same dialect and format options. The first argument in open is the path and filename you want to write to and the second argument "w" makes the file writeable.

Note

Python documentation notes that newline="" should be specified in case there are newlines embedded inside quoted fields to ensure they are interpreted correctly.

with open("data/write_data.csv", "w", newline = "") as f:
    writer = csv.writer(f, dialect = my_dialect)
    writer.writerow(("one", "two", "three"))
    writer.writerow(("1", "2", "3"))
    writer.writerow(("4", "5", "6"))
    writer.writerow(("7", "8", "9"))

JavaScript Object Notation (JSON) Data

Standard format for HTTP requests between web browsers, applications, and APIs. Its almost valid Python code:

  • Instead of NaN, it uses null

  • Doesn’t allow trailing commas at end of lists

  • Data types: objects (dictionaries), arrays (lists), strings, numbers, booleans, and nulls.

We’ll make up a simple file of my pets’ names, types, and sex to demonstrate JSON data loading and writing.

Import the json module and use json.loads to convert a JSON string to Python. There are multiple ways to convert JSON objects to a DataFrame.

import json

obj = """
{"name": "Jadey",
 "pets": [{"name": "Mai", "type": "cat", "sex": "Female"},
          {"name": "Tai", "type": "cat", "sex": "Male"},
          {"name": "Skye", "type": "cat", "sex": "Female"}]
}
"""

json_to_py = json.loads(obj)

print(json_to_py)
type(json_to_py)
{'name': 'Jadey', 'pets': [{'name': 'Mai', 'type': 'cat', 'sex': 'Female'}, {'name': 'Tai', 'type': 'cat', 'sex': 'Male'}, {'name': 'Skye', 'type': 'cat', 'sex': 'Female'}]}
dict

Since this imported the object as a dictionary, we can use pd.DataFrame to create a DataFrame of the pets’ names, type, and sex.

pets_df = pd.DataFrame(json_to_py["pets"], columns = ["name", "type", "sex"])

print(type(pets_df))
pets_df
<class 'pandas.core.frame.DataFrame'>
name type sex
0 Mai cat Female
1 Tai cat Male
2 Skye cat Female

Use json.dumps to convert from Python (class: dictionary) back to JSON (class: string).

py_to_json = json.dumps(json_to_py)

print("json_to_py type:", type(json_to_py))
print("py_to_json type:", type(py_to_json))
py_to_json
json_to_py type: <class 'dict'>
py_to_json type: <class 'str'>
'{"name": "Jadey", "pets": [{"name": "Mai", "type": "cat", "sex": "Female"}, {"name": "Tai", "type": "cat", "sex": "Male"}, {"name": "Skye", "type": "cat", "sex": "Female"}]}'

We can use pandas pd.read_json function and to_json DataFrame method to read and write JSON files.

pets_df.to_json("data/pets.json")

We can easily import a JSON file using pandas.read_json.

pet_data = pd.read_json("data/pets.json")
pet_data
name type sex
0 Mai cat Female
1 Tai cat Male
2 Skye cat Female

Web Scraping

HTML

pd.read_html uses libraries to read and write HTML and XML:

  • Try: xlml [faster]

  • Catch: beautifulsoup4 and html5lib [better equipped for malformed files]

If you want to specify which parsing engine is used, you can use the flavor argument.

tables = pd.read_html(
  "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/", 
  flavor = "html5lib"
  )

print("Table Length:", len(tables))

# since this outputs a list of tables, we can grab just the first table

tables[0].head(5)
Table Length: 1
Bank NameBank CityCity StateSt CertCert Acquiring InstitutionAI Closing DateClosing FundFund
0 Almena State Bank Almena KS 15426 Equity Bank October 23, 2020 10538
1 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb October 16, 2020 10537
2 The First State Bank Barboursville WV 14361 MVB Bank, Inc. April 3, 2020 10536
3 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank February 14, 2020 10535
4 City National Bank of New Jersey Newark NJ 21111 Industrial Bank November 1, 2019 10534

XML

XML format is more general than HTML, but they are structurally similar. See pandas documentation for pd.read_xml.

This snippet of an xml file is from Microsoft.

<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
books = pd.read_xml("data/books.xml")

books.head(5)
id author title genre price publish_date description
0 bk101 Gambardella, Matthew XML Developer's Guide Computer 44.95 2000-10-01 An in-depth look at creating applications \n ...
1 bk102 Ralls, Kim Midnight Rain Fantasy 5.95 2000-12-16 A former architect battles corporate zombies, ...
2 bk103 Corets, Eva Maeve Ascendant Fantasy 5.95 2000-11-17 After the collapse of a nanotechnology \n ...
3 bk104 Corets, Eva Oberon's Legacy Fantasy 5.95 2001-03-10 In post-apocalypse England, the mysterious \n ...
4 bk105 Corets, Eva The Sundered Grail Fantasy 5.95 2001-09-10 The two daughters of Maeve, half-sisters, \n ...

If you’d like to manually parse a file, Wes demonstrates this process in the textbook, before demonstrating how the following steps are turned into one line of code using pd.read_xml.

  1. from lxml import objectify
  2. Use lxml.objectify,
  3. Create a dictionary of tag names to data values
  4. Cnvert that list of dictionaries into a DataFrame.

Binary Data Formats

Pickle

Python has a built-in pickle module that converts pandas objects into the pickle format (serializes the data into a byte stream), which is generally readable only in Python.

More information can be found in Python documentation.

Here’s a demo to show pickling and unpickling the penguins dataset.

print("Unpickled penguins type:", type(penguins))

penguins.to_pickle("data/penguins_pickle")

# do some machine learning

pickled_penguins = pd.read_pickle("data/penguins_pickle")
pickled_penguins
Unpickled penguins type: <class 'pandas.core.frame.DataFrame'>
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex month day year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 4 10 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 3 6 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 7 22 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2 13 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 8 21 2007
... ... ... ... ... ... ... ... ... ... ...
339 Chinstrap Dream 55.8 19.8 207.0 4000.0 male 6 4 2009
340 Chinstrap Dream 43.5 18.1 202.0 3400.0 female 6 8 2009
341 Chinstrap Dream 49.6 18.2 193.0 3775.0 male 4 8 2009
342 Chinstrap Dream 50.8 19.0 210.0 4100.0 male 8 24 2009
343 Chinstrap Dream 50.2 18.7 198.0 3775.0 female 2 11 2009

344 rows × 10 columns

Warning

pickle is recommended only as a short-term storage format (i.e. loading and unloading your machine learning models) because the format may not be stable over time. Also, the module is not secure – pickle data can be maliciously tampered with. Python docs recommend signing data with hmac to ensure it hasn’t been tampered with.

Microsoft Excel Files

pd.ExcelFile class or pd.read_excel functions use packages xlrd (for older .xlx files) and openpyxl (for newer .xlsx files), which must be installed separately from pandas.

conda install xlrd openpyxl

pd.read_excel takes most of the same arguments as pd.read_csv.

penguins_excel = pd.read_excel(
  "data/penguins.xlsx", 
  index_col = "species",
  parse_dates = {"date": ["month", "day", "year"]}
)

penguins_excel.head(5)
date island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
species
Adelie 2007-04-10 Torgersen 39.1 18.7 181.0 3750.0 male
Adelie 2007-03-06 Torgersen 39.5 17.4 186.0 3800.0 female
Adelie 2007-07-22 Torgersen 40.3 18.0 195.0 3250.0 female
Adelie 2007-02-13 Torgersen NaN NaN NaN NaN NaN
Adelie 2007-08-21 Torgersen 36.7 19.3 193.0 3450.0 female

To read multiple sheets, use pd.ExcelFile.

penguins_sheets = pd.ExcelFile("data/penguins_sheets.xlsx")

print("Available sheet names:", penguins_sheets.sheet_names)

penguins_sheets
Available sheet names: ['chinstrap', 'gentoo', 'adelie']
<pandas.io.excel._base.ExcelFile at 0x26705acdd90>

Then we can parse all sheets into a dictionary by specifying the sheet_name argument as None. Or, we can read in a subset of sheets.

sheets = penguins_sheets.parse(sheet_name = None)

sheets
{'chinstrap':       species island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
 0   Chinstrap  Dream            46.5           17.9                192   
 1   Chinstrap  Dream            50.0           19.5                196   
 2   Chinstrap  Dream            51.3           19.2                193   
 3   Chinstrap  Dream            45.4           18.7                188   
 4   Chinstrap  Dream            52.7           19.8                197   
 ..        ...    ...             ...            ...                ...   
 63  Chinstrap  Dream            55.8           19.8                207   
 64  Chinstrap  Dream            43.5           18.1                202   
 65  Chinstrap  Dream            49.6           18.2                193   
 66  Chinstrap  Dream            50.8           19.0                210   
 67  Chinstrap  Dream            50.2           18.7                198   
 
     body_mass_g     sex  month  day  year  
 0          3500  female      7    4  2007  
 1          3900    male      9    6  2007  
 2          3650    male      4   15  2007  
 3          3525  female      6   10  2007  
 4          3725    male      8   19  2007  
 ..          ...     ...    ...  ...   ...  
 63         4000    male      6    4  2009  
 64         3400  female      6    8  2009  
 65         3775    male      4    8  2009  
 66         4100    male      8   24  2009  
 67         3775  female      2   11  2009  
 
 [68 rows x 10 columns],
 'gentoo':     species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
 0    Gentoo  Biscoe            46.1           13.2              211.0   
 1    Gentoo  Biscoe            50.0           16.3              230.0   
 2    Gentoo  Biscoe            48.7           14.1              210.0   
 3    Gentoo  Biscoe            50.0           15.2              218.0   
 4    Gentoo  Biscoe            47.6           14.5              215.0   
 ..      ...     ...             ...            ...                ...   
 119  Gentoo  Biscoe             NaN            NaN                NaN   
 120  Gentoo  Biscoe            46.8           14.3              215.0   
 121  Gentoo  Biscoe            50.4           15.7              222.0   
 122  Gentoo  Biscoe            45.2           14.8              212.0   
 123  Gentoo  Biscoe            49.9           16.1              213.0   
 
      body_mass_g     sex  month  day  year  
 0         4500.0  female      3    8  2007  
 1         5700.0    male      2    4  2007  
 2         4450.0  female      7    1  2007  
 3         5700.0    male      9   15  2007  
 4         5400.0    male     11   19  2007  
 ..           ...     ...    ...  ...   ...  
 119          NaN     NaN     12   11  2009  
 120       4850.0  female      7   20  2009  
 121       5750.0    male      9   18  2009  
 122       5200.0  female     12   11  2009  
 123       5400.0    male      6   15  2009  
 
 [124 rows x 10 columns],
 'adelie':     species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
 0    Adelie  Torgersen            39.1           18.7              181.0   
 1    Adelie  Torgersen            39.5           17.4              186.0   
 2    Adelie  Torgersen            40.3           18.0              195.0   
 3    Adelie  Torgersen             NaN            NaN                NaN   
 4    Adelie  Torgersen            36.7           19.3              193.0   
 ..      ...        ...             ...            ...                ...   
 147  Adelie      Dream            36.6           18.4              184.0   
 148  Adelie      Dream            36.0           17.8              195.0   
 149  Adelie      Dream            37.8           18.1              193.0   
 150  Adelie      Dream            36.0           17.1              187.0   
 151  Adelie      Dream            41.5           18.5              201.0   
 
      body_mass_g     sex  month  day  year  
 0         3750.0    male      4   10  2007  
 1         3800.0  female      3    6  2007  
 2         3250.0  female      7   22  2007  
 3            NaN     NaN      2   13  2007  
 4         3450.0  female      8   21  2007  
 ..           ...     ...    ...  ...   ...  
 147       3475.0  female     11    4  2009  
 148       3450.0  female      5   21  2009  
 149       3750.0    male      8   15  2009  
 150       3700.0  female      1   16  2009  
 151       4000.0    male      5    8  2009  
 
 [152 rows x 10 columns]}

Then we can subset one of the sheets as a pandas DataFrame object.

chinstrap = sheets["chinstrap"].head(5)
chinstrap
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex month day year
0 Chinstrap Dream 46.5 17.9 192 3500 female 7 4 2007
1 Chinstrap Dream 50.0 19.5 196 3900 male 9 6 2007
2 Chinstrap Dream 51.3 19.2 193 3650 male 4 15 2007
3 Chinstrap Dream 45.4 18.7 188 3525 female 6 10 2007
4 Chinstrap Dream 52.7 19.8 197 3725 male 8 19 2007

Write one sheet to using to_excel:

chinstrap.to_excel("data/chinstrap.xlsx")

If you want to write to multiple sheets, create an ExcelWriter class and then write the data to it:

gentoo = sheets["gentoo"].head(5)

writer = pd.ExcelWriter("data/chinstrap_gentoo.xlsx")

chinstrap.to_excel(writer, sheet_name = "chinstrap")

gentoo.to_excel(writer, sheet_name = "gentoo")

writer.save()

HDF5 Format

Hierarchical data format (HDF) is used in Python, C, Java, Julia, MATLAB, and others for storing big scientific array data (multiple datasets and metadata within one file). HDF5 can be used to efficiently read/write chunks of large arrays.

The PyTables package must first be installed.

conda install pytables

pip install tables # the package is called "tables" in PyPI

pandas provides an dictionary-like-class for HDF5 files called HDFStore:

store = pd.HDFStore("data/pets.h5")

store["pets"] = pets_df
store["pets"]
name type sex
0 Mai cat Female
1 Tai cat Male
2 Skye cat Female

HDFStore can store data as a fixed or as a table schema. Table allows querying but is generally slower.

pets_df.to_hdf("data/petnames.h5", "pets", format = "table")
pd.read_hdf("data/petnames.h5", "pets", where=["columns = name"])
name
0 Mai
1 Tai
2 Skye
When should I use HDF5?

Wes recommends using HDF5 for write-once, read-many datasets that are worked with locally. If your data is stored on remote servers, then you may try other binary formats designed for distributed storage (for example, Apache Parquet).

Interacting with Web APIs

To access data from APIs, Wes suggests using the requests package.

conda install requests

Let’s pull from this free zoo animal API.

import requests

url = "https://zoo-animal-api.herokuapp.com/animals/rand"

resp = requests.get(url)

resp.raise_for_status()

print("HTTP status", resp)

animal = resp.json()
animal

animal_df = pd.DataFrame([animal]) # important to wrap the dictionary object into a list
animal_df
HTTP status <Response [200]>
name latin_name animal_type active_time length_min length_max weight_min weight_max lifespan habitat diet geo_range image_link id
0 Brazilian Porcupine Coendou prehensilis Mammal Nocturnal 1.5 1.7 9 11 17 Tropical rainforest Leaves, bark, fruits, shoots and small animals Northern and eastern South America https://upload.wikimedia.org/wikipedia/commons... 45
Note

It is important to note that the dictionary is wrapped into a list. If it isn’t, then you will get the following error: ValueError: If using all scalar values, you must pass an index.

Interacting with Databases

Some popular SQL-based relational databases are: SQL Server, PostgreSQL, MySQL, SQLite3. We can use pandas to load the results of a SQL query into a DataFrame.

Import sqlite3 and create a database.

import sqlite3

con = sqlite3.connect("data/data.sqlite")

This creates a table.

query = """
  CREATE TABLE states
  (Capital VARCHAR(20), State VARCHAR(20),
  x1 REAL, x2 INTEGER
);"""

con.execute(query)

con.commit()

This inserts the rows of data:

data = [("Atlanta", "Georgia", 1.25, 6), ("Seattle", "Washington", 2.6, 3), ("Sacramento", "California", 1.7, 5)]
        
stmt = "INSERT INTO states VALUES(?, ?, ?, ?)"

con.executemany(stmt, data)

con.commit()

Now we can look at the data:

cursor = con.execute("SELECT * FROM states")

rows = cursor.fetchall()

rows
[('Atlanta', 'Georgia', 1.25, 6),
 ('Seattle', 'Washington', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5),
 ('Atlanta', 'Georgia', 1.25, 6),
 ('Seattle', 'Washington', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5),
 ('Atlanta', 'Georgia', 1.25, 6),
 ('Seattle', 'Washington', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5),
 ('Atlanta', 'Georgia', 1.25, 6),
 ('Seattle', 'Washington', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5),
 ('Atlanta', 'Georgia', 1.25, 6),
 ('Seattle', 'Washington', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5),
 ('Atlanta', 'Georgia', 1.25, 6),
 ('Seattle', 'Washington', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5),
 ('Atlanta', 'Georgia', 1.25, 6),
 ('Seattle', 'Washington', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

To get the data into a pandas DataFrame, we’ll need to provide column names in the cursor.description.

print(cursor.description)

pd.DataFrame(rows, columns = [x[0] for x in cursor.description])
(('Capital', None, None, None, None, None, None), ('State', None, None, None, None, None, None), ('x1', None, None, None, None, None, None), ('x2', None, None, None, None, None, None))
Capital State x1 x2
0 Atlanta Georgia 1.25 6
1 Seattle Washington 2.60 3
2 Sacramento California 1.70 5
3 Atlanta Georgia 1.25 6
4 Seattle Washington 2.60 3
5 Sacramento California 1.70 5
6 Atlanta Georgia 1.25 6
7 Seattle Washington 2.60 3
8 Sacramento California 1.70 5
9 Atlanta Georgia 1.25 6
10 Seattle Washington 2.60 3
11 Sacramento California 1.70 5
12 Atlanta Georgia 1.25 6
13 Seattle Washington 2.60 3
14 Sacramento California 1.70 5
15 Atlanta Georgia 1.25 6
16 Seattle Washington 2.60 3
17 Sacramento California 1.70 5
18 Atlanta Georgia 1.25 6
19 Seattle Washington 2.60 3
20 Sacramento California 1.70 5

As per usual, Wes likes to show us the manual way first and then the easier version. Using SQLAlchemy, we can must less verbosely create our DataFrame.

import sqlalchemy as sqla

db = sqla.create_engine("sqlite:///data/data.sqlite")

pd.read_sql("SELECT * FROM states", db)
Capital State x1 x2
0 Atlanta Georgia 1.25 6
1 Seattle Washington 2.60 3
2 Sacramento California 1.70 5
3 Atlanta Georgia 1.25 6
4 Seattle Washington 2.60 3
5 Sacramento California 1.70 5
6 Atlanta Georgia 1.25 6
7 Seattle Washington 2.60 3
8 Sacramento California 1.70 5
9 Atlanta Georgia 1.25 6
10 Seattle Washington 2.60 3
11 Sacramento California 1.70 5
12 Atlanta Georgia 1.25 6
13 Seattle Washington 2.60 3
14 Sacramento California 1.70 5
15 Atlanta Georgia 1.25 6
16 Seattle Washington 2.60 3
17 Sacramento California 1.70 5
18 Atlanta Georgia 1.25 6
19 Seattle Washington 2.60 3
20 Sacramento California 1.70 5