Python Tutorial Part 2: Libraries
While Part 1 of this compact Python tutorial introduced
basic syntax and commands, Part 2 explains how to install, import and use libraries and gives a brief overview on some popular libraries.
Python has a built-in standard library,
which is available immediately after installation and includes functions for mathematical / system / file access / encryption and other
frequently needed tasks.
Additionally, Python has many specialized libraries that provide functionality
for data analysis, machine learning, cryptography and other applications.
An up-to-date list of Python libraries and software is maintained in the
Python Package Index (PyPI).
Motivation
The ease of use of free libraries for data science such as NumPy, Matplotlib, Pandas, Scikit-Learn, Keras and many others make Python an attractive choice for beginners and professionals. Python libraries do nor only have an extensive documentation and many available examples, but also two powerful package management systems, pip and conda, that make it easy to create working environments for different tasks.
Overview
This Python tutorial is available online on Google Colab:
1 Install Python libraries
Python libraries that are not already included after installation must be installed using the package management tools pip or conda. Both are command-line tools operated via the Python console. While pip is mainly for Python package management, conda supports other languages such as the statistical programming langage R and, more importantly, supports environments. The syntax of the commands is similar, e.g. pip list (install, update, uninstall...) or conda list (install, update, remove ...).
1-1 Packet manager pip
Pip is Python's default package management system. The general syntax of a pip command is pip command [options] package. The command is one of: list, show, install, uninstall. The options can specified in a long form, such as in pip --help, or in a short form, such as in pip -h.
Manage packages using pip
This example shows how to use pip to list packages, show information for a specific package, install / update / uninstall a package.
pip --help
pip list
pip show numpy, pandas
pip install pandas
pip install numpy --update
pip uninstall numpy
Larger Python projects may require the installation of multiple Python packages in matching versions. To simplify this process, the required dependencies for a project are written into a requirements.txt file, which is passed to the pip install command using the -r option. The option -r installs all packages listed in an requirements.txt file.
Example: Install libraries using requirements.txt
In this example, all libraries required by an Reinforcement Learning project are specified
with exact version in the file requirements.txt and installed in a dedicated Conda environment using pip.
Step 1: Create requirements.txt
This file should be placed in your project folder.
# requirements.txt tensorflow==2.12.0 keras-rl2==1.0.5 gym==0.25.2
Step 2: Install using requirements.txt
The next commands are execute in the terminal.
It is important that you first change the directory to the path where the requirements.txt sits.
Then, create a new conda environment, activate it, and install the libraries using pip.
All the specified packages and dependencies will be installed in this environment.
cd C:\users\yourname\rl conda create --name env-rl conda activate env-rl pip install -r requirements.txt
1-2 Packet manager conda
Conda is a Python management system that can be used for
installing and updating packages and also for managing application environments.
The general syntax of a conda command is conda command [options] package.
Example: Manage packages using conda
This example shows how to use conda to list packages and install / update / uninstall a package.
conda list
conda install pandas # install into current environment
conda install --name myenv pandas #install into specified environment
conda update pandas
conda remove pandas
Example: Manage environments using conda
Conda environments are directories that store specific versions of Python + libraries + dependencies as needed by different projects. Environments can be activated and deactivated, so that you can have setups for different projects in parallel and switch between them. In this example we create an environment called ML, activate it, install a special set of program packages into it, then deactivate it.
conda create --name ML
conda activate ML
conda install pandas tensorflow keras
conda deactivate ML
2 Import Python libraries
Top |
Python libraries and functions are imported using the import command, followed by the name of the library to be imported, for example import numpy. After importing a library, all of its classes and functions can be used by prepending the name of the library to the class or function name, as in numpy.array or numpy.linspace
import numpy
x = numpy.array([0, 0.25, 0.5, 0.75, 1])
print(x)
x = numpy.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1. ]
print(x)
2-1 Import package using alias name
A best practice for imports is to use an abbreviated alias name, e.g. np for numpy, or plt for matplotlib. In the following example, the packages NumPy and Matplotlib are each imported with an alias name, which must then be prepended to the function calls, e.g. np.linspace(), or plt.plot().
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,10,40) # x-values
y = np.sin(x); # y-values
plt.plot(x, y,'r*', label='sin');
2-2 Import selected functions from package
With the import statement you import entire packages, e.g. import numpy
or selectively classes / functions / constants from packages, e.g. from numpy import linspace, sin.
In this example we import only the needed functions: linspace, sin and plot.
from numpy import linspace, sin
from matplotlib.pyplot import plot
x = linspace(0, 10, 40)
y = sin(x)
plot(x, y,'r*', label='sin');
2-3 Best practices
The best practice for importing packages is to keep it as explicit and unambiguous as possible. Usually, this means to import the entire package using a short alias name, such as np for numpy and plt for matplotlib, and prepending this alias to every class and function call. This ensures that the code is transparent, making it clear which function belongs to which package, although the function calls get a bit longer.
When working with specific classes in a deeper hierarchy, it may be more convenient to import that single class. For example, assume you use the subclasses and methods of the class tree from the Scikit-Learn library. Then you would import the class tree, and subsequently access its subclasses and methods by prepending "tree" to all the calls.
from sklearn import tree
model = tree.DecisionTreeClassifier(criterion='entropy', splitter='best')
model.fit(X_train, y_train)
tree.plot_tree(model)
3 Python libraries for data analysis
Top |
Frequently used Python libraries for data analysis are:
- NumPy - provides support for creating and manipulating multidimensional arrays and random numbers
- Matplotlib - provides functions for data visualization
- Pandas - provides special data structures and methods for processing tabular data, e.g. Series, DataFrames
- Scikit-Learn - provides algorithms for Machine Learning, Classification, Regression
The following sections give a short description for each package and illustrate its basic usage with a representative code fragment.
Top |
3-1 NumPy
NumPy is a Python library for data management, providing support for array creation and manipulation. NumPy arrays have a fixed size, contain elements of the same data type and efficiently support elementwise operations. With NumPy, you can create arrays, initialize and extract data, perform elementwise operations on arrays, sort, search, count elements and calculate array statistics. NumPy also provides mathematical constants and functions (pi, sin, cos ...).
Create and manipulate NumPy 1D-arrays
Functions: array, arange
import numpy as np
# One-dimensional NumPy arrays
x1 = np.array([1, 2, 3, 4])
x2 = np.arange(1, 8, 2)
sum = x1 + x2 # Elementwise sum
print('x1:', x1, '\nx2:', x2, '\nsum:', sum)
Create and manipulate NumPy 2D-arrays
Functions: array, eye
# Two-dimensional NumPy arrays
a1 = np.array([[1, 2], [3, 4]], )
a2 = np.eye((2))
prod = a1 * a2 # Elementwise product
print('a1:\n', a1, '\na2:\n', a2, '\nprod:\n', prod)
NumPy arrays vs. Python lists
What is the difference between NumPy arrays and Python's built-in lists, which is to be used
in which context?
➔ Python built-in lists are of variable length, may contains elements of mixed data type and support
a reduced set of operations as needed to store and retrieve data efficiently.
➔ NumPy arrays have fixed size, contain elements of the same data type,
and efficiently support element-wise operations and a variety of statistical functions.
They are therefore preferably used in data analysis tasks, where you need all these element-wise operations
and statistical functions.
Adding NumPy arrays vs. Python lists
The following example shows the difference in using the "+" operator on Python lists vs. NumPy arrays.
➔ If you add two Python lists, the result is a new list that contains the elements of both lists.
➔ If you add two NumPy arrays, the result is a new array whose elements are formed by elementwise
addition. Element-wise addition can also be done with Python lists, then a loop or the so-called list comprehension must be used.
import numpy as np
# Create 2 Python lists
list1 = [1, 2, 3, 4]
list2 = [5, 6, 7, 8]
# Convert them to NumPy arrays
arr1 = np.array(list1) # NumPy Array [1 2 3 4]
arr2 = np.array(list2) # NumPy Array [5 6 7 8]
# Adding Python lists means: appending
sum = list1 + list2
print(sum) # [1, 2, 3, 4, 5, 6, 7, 8]
# Adding NumPy arrays means: elementwise addition
sum = arr1 + arr2
print(sum) # [6 8 10 12]
Top |
3-2 Matplotlib
Matplotlib is a Python library for data visualization that supports the creation of various types of charts via the pyplot package: line, scatter, histograms, bar charts, one- and two-dimensional plots, static or interactive. Frequently used commands are plot for one-dimensional and surf for two-dimensional plots. The plot command receives as parameters the x and y coordinates of the data to be plotted, and optionally a string with formatting information. There are also many options for adding labels, titles, legends, etc.
Create a line plot with Matplotlib
Functions: figure, plot, title, grid, legend, xlabel, ylabel
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,10,40) # x-values
y1 = np.sin(x);y2 = np.cos(x); # y-values
fig = plt.figure(figsize=[6, 3])
plt.plot(x, y1,'r*', label='sin');
plt.plot(x, y2,'b+', label='cos');
plt.title('Sin and Cos Functions');
plt.grid(True)
plt.legend(loc="upper center")
plt.xlabel('x');plt.ylabel('y');
Create a histogram with Matplotlib
Functions: hist, show
A histogram is the graphic representation of the frequency distribution of a variable. The data of the variable are sorted according to their size and divided into a number of classes or bins that can, but do not have to be, of the same width.
import matplotlib.pyplot as plt
import numpy as np
# Create 100 random numbers
# with mean 50 and standard deviation 20
x = np.random.normal(50, 20, 100)
# Create histogram with 7 bins
plt.hist(x, 7, density=False, fill=False, hatch='//')
plt.show()
Top |
3-3 Pandas
Pandas is a Python library for creating and manipulating spreadsheet-like data. Pandas provides support for loading large data files (csv, xls) into the program's memory and then further cleanse and visualize the data. Pandas functions such as iloc(), loc(), resample() are used to select rows/columns/cells, to group and aggregate data etc.
Pandas vs Python Lists and NumPy Arrays
Pandas is suitable for loading an entire already existing dataset into a DataFrame.
Although it is possible to create an empty DataFrame with given column names and then growing
it by appending rows, this is not the intended use case and has a bad performance.
Pandas is meant to be used together with Python lists and NumPy arrays.
So first create Python lists for the columns that you need to store, here you can use append and it is fast.
Then create a dictionary, that associates to each columns name its values, and finally create
a Pandas DataFrame from this dictionary - as in our example.
There are a number of functions that support conversion between Pandas DataFrame columns and Python lists,
such as tolist().
Top |
Read from and write to files using Pandas
Functions: read_excel, to_csv
In this example we read the data stored in the excel sheet students.xlsx, store them in a DataFrame df and then write them to a csv-file.
import pandas as pd
# Read data from students.xlsx
df = pd.read_excel('students.xlsx', index_col=0)
# Write data to a csv file
df.to_csv('students.csv', index=True, sep = ';')
df
Create and plot Pandas DataFrame
import pandas as pd
persons = {
"Surname": ["Miller", "Simpson", "Doe", "Smith", "Dunham"],
"Name": ["Max", "Anna", "John", "Jack", "Olive"],
"Age": [35, 24, 23, 22, 32],
"Income": [3000, 2000, 1800, 1800, 2500]
}
# Create data frame from dictionary
df = pd.DataFrame(persons)
print(df)
# Scatter-plot Income vs Age
df.plot.scatter(x = 'Age', y = 'Income', c='Red');
Top |
3-4 Scikit-Learn
Scikit-Learn is a
Python library for machine learning that provides
support for the usual steps of supervised and unsupervised learning:
data preparation, training and model evaluation, as well as powerful algorithms for classification,
regression and clustering problems. Scikit-Learn is used together with NumPy, Matplotlib and Pandas:
with Pandas, we read data from files and convert them to NumPy arrays, with Matplotlib, we plot them,
with NumPy, we pass data as arguments to Scikit-Learn functions.
Scikit-Learn is installed via pip using its full name:
pip install scikit-learn
Example: Failure classification using DecisionTreeClassifier
This example code shows how to train a decision tree model and subsequently use it
for the classification of failures.
Preparation: Set up a data set
The data set data2.csv contains 8 measurements of temperature and humidity, and a column "failure" that contains the label of one measurement. The first row contains the column headers. First row can be interpreted as follows: for temperature 20.3 degrees (Celsius) and humidity 60%, there has been no failure.
Data set data2.csv
id;temp;humidity;failure 1;20.3;60;no 2;35.6;80;yes 3;40;55;yes 4;25;50;no 5;17;60;no 6;15;75;yes 7;20.3;80;yes 8;35.6;60;yes
Step 1: Read data from file
First step is to read the data from the file using Pandas' read_csv() function. This function takes a number of parameters that control the reading process, we specify here: file: the name of the file, header=0: row with index 0 contains the column headers, index_col = 0: column with index 0 contains the row headers.
import pandas as pd
# 1.
file = "https://evamariakiss.de/tutorial/python/data2.csv"
df = pd.read_csv(file, header=0, sep = ";", index_col=0)
print('DataFrame:\n', df);
Step 2: Extract features and target variable
Next we extract features and target variable in NumPy arrays x and y. This must be done since the Scikit-Learn Classifier and more precisely the fit-method expects NumPy arrays as parameters.
# 2.
x = df.iloc[:,0:2].to_numpy() # Extract features
y = df[['failure']] # Extract target variable
y = y.values
Step 3: Create a train-test-split
In Step 3 we split the data set into training and validation data using Scikit-Learn's train_test_split. The parameter test_size specifies that 0.1 of the data set is used for testing / validation.
from sklearn import model_selection as ms
# 3.
X_train, X_test, y_train, y_test = \
ms.train_test_split(x, y, test_size=0.1, random_state=1)
Step 4: Train a decision tree
In Step 4 we create a new instance of the DecisionTreeClassifier, conveniently named "model". We train this model using the fit function and training data have been created previously using the train_test_split-function.
from sklearn import tree, model_selection as ms
# 4.
model = tree.DecisionTreeClassifier(criterion='entropy', splitter='best')
model.fit(X_train, y_train) # Train decision tree
Step 5: Visualize the decision tree
We visualize the decision tree using ScikitLearn's visualization functions plot_tree or alternatively export_graphviz. Both functions create a decision tree visualization that specifies for each node the splitting criteria, the entropy, the number of samples in that node and how many samples are in each class. While plot_tree is simpler, export_graphviz produces a sharp graph in SVG-format, suitable for re-use in websites and presentations.
Visualization using plot_tree
import matplotlib.pyplot as plt
from sklearn import tree
# 5. Visualize decision tree
fig, ax = plt.subplots(figsize=(5, 6))
tree.plot_tree(model, filled=True,
feature_names=df.columns[0:2],
class_names=['no','yes'])
plt.show()
Visualization using export_graphviz
import graphviz as gv
import IPython.display as disp
from sklearn.tree import export_graphviz
# 5. Visualize decision tree
dot=export_graphviz(model, out_file=None,filled=True,
feature_names=df.columns[0:2],
class_names=['no','yes']);
graph = gv.Source(dot)
style = ''
display(disp.HTML(style))
display(graph)
Steps 1 to 5: the complete code
Finally, we put together the code from steps 1 to 5 in one single script.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import tree, model_selection as ms
# 1.
file = "https://evamariakiss.de/tutorial/python/data2.csv"
df = pd.read_csv(file, header=0, sep = ";", index_col=0)
print('DataFrame:\n', df);
# 2.
x = df.iloc[:,0:2].to_numpy() # Extract features
y = df[['failure']] # Extract target variable
y = y.values
# 3.
X_train, X_test, y_train, y_test = \
ms.train_test_split(x, y, test_size=0.1, random_state=1)
# 4.
model = tree.DecisionTreeClassifier(criterion='entropy', splitter='best')
model.fit(X_train, y_train) # Train decision tree
# 5.
fig, ax = plt.subplots(figsize=(5, 6)) # Visualize decision tree
tree.plot_tree(model, filled=True, feature_names=df.columns[0:2], class_names=['no','yes'])
plt.show()
References and tools
- [1] Python Documentation at python.org: docs.python.org/3/tutorial/
- [2] Anaconda: anaconda.com/: package management system, also needed for installing Jupyter Notebook
- [3] PIP Packet Manager: pypi.org/project/pip/
- [4] Conda Cheatsheet: conda-cheatsheet.pdf
- [5] Visual Studio Code: code.visualstudio.com/
- [6] NumPy: numpy.org/ – Arrays, Random Number Creation
- [7] Matplotlib: matplotlib.org/ – Data Visualization, Plotting
- [8] Pandas: pandas.pydata.org/ – DataFrames, Series
- [9] Scikit-Learn: scikit-learn.org – Algorithms for Machine Learning