1  Notes

1.0.1 Why Python for Data Analysis (Data Science)?

One of the most important languages for data science, machine learning, and general software development in academia and industry.

1.0.2 Solving the “Two-Language” Problem

Python is a suitable language not only for doing research and prototyping but also for building the production systems (also Julia programming language).

1.0.3 What Kinds of Data?

1.1 Essential Python Libraries

  • NumPy ( Numerical Python)
  • Pandas
  • Matplotlib
  • Ipython and Jupyter
  • Scipy
  • Sklearn
  • StatsModel

1.2 Numpy

Short for Numerical Python, has long been a cornerstone of numerical computing in Python. It provides the data structures, algorithms, and library glue needed for most scientific applications involving numerical data in Python

1.3 Pandas

pandas provides high-level data structures and functions designed to make working with structured or tabular data intuitive and flexible.

  • It provides convenient indexing functionality to enable you to reshape, slice and dice, perform aggregations, and select subsets of data. Since data manipulation, preparation, and cleaning is such an important skill in data analysis,

  • See R vs Pandas comparison

1.4 Matplotlib

is the most popular Python library for producing plots and other two-dimensional data visualizations

1.5 IPython and Jupyter

  • The IPython system can now be used as a kernel (a programming language mode) for using Python with Jupyter.

1.6 SciPy

SciPy is a collection of packages addressing a number of foundational problems in scientific computing.

1.7 Scikit-learn

general-purpose machine learning toolkit for Python programmers.

1.8 Statsmodels

is a statistical analysis package

  • Compared with scikit-learn, statsmodels contains algorithms for classical (primarily frequentist) statistics and econometrics.

1.9 Other Packages

  • TensorFlow or PyTorch or Keras

1.10 Installing Necessary Packages

We can install Python packages using “Pip” or “Conda”. Read more about pip vs python

The author recommends:

  • Miniconda, a minimal installation of the conda package manager, along with conda-forge, a community-maintained software distribution based on conda.

  • This book uses Python 3.10 throughout.

1.11 Mini-conda

Conda is a packaging tool and installer that aims to do more than what pip does; handle library dependencies outside of the Python packages as well as the Python packages themselves. Conda also creates a virtual environment, like virtualenv does

1.12 Mini-forge

  • miniforge is the community (conda-forge) driven minimalistic conda installer. Subsequent package installations come thus from conda-forge channel. Mini-forge

  • miniconda is the Anaconda (company) driven minimalistic conda installer. Subsequent package installations come from the anaconda channels (default or otherwise).

  • miniforge started because miniconda doens’t support aarch64, very quickly the ‘PyPy’ people jumped on board, and in the mean time there are also miniforge versions for all Linux architectures, as well as MacOS.

  • AARCH64, sometimes also referred to as ARM64, is a CPU architecture developed by ARM Ltd., and a 64-bit extension of the pre-existing ARM architecture. ARM architectures are primarily known for their energy efficiency and low power consumption. For that reason, virtually all mobile phones and tablets today use ARM architecture-based CPUs.

  • Although AARCH64 and x64 (Intel, AMD, …) are both 64-bit CPU architectures, their inner basics are vastly different. Programs compiled for one platform, won’t work on the other (except with some magic), and vice-versa. That means, software does not only need to be recompiled, but often requires extensive optimization for either platform.

The first step is to configure conda-forge as your default package channel by running the following commands in a shell:

! conda config --add channels conda-forge
! conda config --set channel_priority strict
Warning: 'conda-forge' already in 'channels' list, moving to the top

Now, we will install the essential packages used throughout the book (along with their dependencies) with conda install

conda create -y -n pydata-book python=3.10 # create enviroment with python 3.10 installed
conda activate pydata-book # activate enviroment 
(pydata-book) $ conda install -y pandas jupyter matplotlib # install a

Install complete packages used in the the book

conda install lxml beautifulsoup4 html5lib openpyxl
requests sqlalchemy seaborn scipy statsmodels
patsy scikit-learn pyarrow pytables numba

1.13 Should I use Pip or Conda ?

While you can use both conda and pip to install packages, you should avoid updating packages originally installed with conda using pip (and vice versa), as doing so can lead to environment problems. I recommend sticking to conda if you can and falling back on pip only for packages which are unavailable with conda install.

conda install should always be preferred, but some packages are not available through conda so if conda install $package_name fails, try pip install $package_name.

1.14 What can we do with Conda?

  • Many commands : create env, activate env, delete env, lists env

  • Install tldr (https://github.com/tldr-pages/tldr) : The tldr-pages project is a collection of community-maintained help pages for command-line tools, that aims to be a simpler, more approachable complement to traditional

1.16 Import Conventions

The Python community has adopted a number of naming conventions for commonly used modules:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels as
  Input In [1]
    import statsmodels as
                         ^
SyntaxError: invalid syntax

This means that when you see np.arange, this is a reference to the arange function in NumPy. This is done because it’s considered bad practice in Python software development to import everything (from numpy import *) from a large package like NumPy

import numpy as np 

x = np.random.random((64, 3, 32, 10)) 
y = np.random.random((32, 10)) 

z = np.maximum(x, y)