Sparse matrices, graphs and maps – HAX712X: Software development for data science

Memory efficiency: sparse matrices

Sparse matrices are useful to handle potentially huge matrices with structures. The most standard structure is sparsity, i.e., matrices that have only a few non-zero coefficients.

The most common formats are

COO_matrix(arg1[, shape, dtype, copy]): A sparse matrix in COOrdinate format.
csr_matrix(arg1[, shape, dtype, copy]): Compressed Sparse Row matrix (default format in scipy)
csc_matrix(arg1[, shape, dtype, copy]): Compressed Sparse Column matrix

Below are examples of common scenarios where sparse matrices are often used:

Natural Language Processing (NLP)

A matrix might encode the presence/absence in a text of a word from a dictionary (say the set of French words). The number 0 (resp. 1) represents the presence (resp. absence) of a word in a text.

One-hot encoding/Dummy variables

To handle categorical data encoding as strings, many algorithms require that you first transform such variables as numerical vectors. The simplest protocol is to create one column to represent each modality of the categorical variable and to set the value to 1 if the observation is of that modality, and 0 otherwise. This is called one-hot encoding or dummy encoding. By construction, this can be encoded into a sparse (binary) matrix.

An example with pandas and sklearn is given below (recall that 0 represents False and 1 represents True):

import os
import pandas as pd
import pooch  # download data / avoid re-downloading
from sklearn import datasets
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

url = "http://josephsalmon.eu/enseignement/datasets/titanic.csv"
path_target = "./titanic.csv"
path, fname = os.path.split(path_target)
pooch.retrieve(url, path=path, fname=fname, known_hash=None)

categorical_column = ['Embarked']
df_titanic_raw = pd.read_csv("titanic.csv", usecols=categorical_column)
print(df_titanic_raw.info())
print(df_titanic_raw.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Embarked  889 non-null    object
dtypes: object(1)
memory usage: 7.1+ KB
None
  Embarked
0        S
1        C
2        S
3        S
4        S

pandas approach with get_dummies:

dummies = pd.get_dummies(df_titanic_raw.Embarked, sparse=True)
dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype              
---  ------  --------------  -----              
 0   C       1 non-null      Sparse[bool, False]
 1   Q       1 non-null      Sparse[bool, False]
 2   S       1 non-null      Sparse[bool, False]
dtypes: Sparse[bool, False](3)
memory usage: 4.5 KB

sklearn approach with OneHotEncoder:

transformer=make_column_transformer((OneHotEncoder(sparse_output=True), ['Embarked']), remainder="passthrough")
matrix_embarked = transformer.fit_transform(df_titanic_raw)
print(matrix_embarked)
print(type(matrix_embarked))

  (0, 2)    1.0
  (1, 0)    1.0
  (2, 2)    1.0
  (3, 2)    1.0
  (4, 2)    1.0
  (5, 1)    1.0
  (6, 2)    1.0
  (7, 2)    1.0
  (8, 2)    1.0
  (9, 0)    1.0
  (10, 2)   1.0
  (11, 2)   1.0
  (12, 2)   1.0
  (13, 2)   1.0
  (14, 2)   1.0
  (15, 2)   1.0
  (16, 1)   1.0
  (17, 2)   1.0
  (18, 2)   1.0
  (19, 0)   1.0
  (20, 2)   1.0
  (21, 2)   1.0
  (22, 1)   1.0
  (23, 2)   1.0
  (24, 2)   1.0
  : :
  (866, 0)  1.0
  (867, 2)  1.0
  (868, 2)  1.0
  (869, 2)  1.0
  (870, 2)  1.0
  (871, 2)  1.0
  (872, 2)  1.0
  (873, 2)  1.0
  (874, 0)  1.0
  (875, 0)  1.0
  (876, 2)  1.0
  (877, 2)  1.0
  (878, 2)  1.0
  (879, 0)  1.0
  (880, 2)  1.0
  (881, 2)  1.0
  (882, 2)  1.0
  (883, 2)  1.0
  (884, 2)  1.0
  (885, 1)  1.0
  (886, 2)  1.0
  (887, 2)  1.0
  (888, 2)  1.0
  (889, 0)  1.0
  (890, 1)  1.0
<class 'scipy.sparse._csr.csr_matrix'>

Note

The differences between the two approaches are: - the encoding of absence/presence (True/False vs. 0/1) - the output created handles missing values differently by default

Discretization of a physical system

For instance, when very distant particles/objects have little influence, their value can be set to zero when representing physical quantities (e.g., heat diffusion, fluid mechanics, electro/magnetism, etc.)

Graphs

Graphs are naturally represented by adjacency or incidence matrices (cf. below), and therefore beyond the graphs, maps! We illustrate that in the last section of the course.

References:

Scipy Lectures: why sparse matrices
Sparse data structures in Python, by Artem Golubin
Introduction to Sparse Matrices in Python with SciPy, by cmdlinetips
Scipy doc on sparse matrices

Usage in `scipy`

from scipy import sparse
from scipy.sparse import isspmatrix

Id = sparse.eye(3)
print(Id.toarray())
print(f'Q: Is the matrix Id is sparse?\nA: {isspmatrix(Id)}')

n1 = 9
n2 = 9
mat_rnd = sparse.rand(n1, n2, density=0.25, format="csr",
                      random_state=42)
print(mat_rnd.toarray())
print(f'Q: Is the matrix mat_rnd sparse?\nA: {isspmatrix(mat_rnd)}')

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
Q: Is the matrix Id is sparse?
A: True
[[0.54269608 0.         0.07455064 0.         0.         0.11586906
  0.         0.         0.        ]
 [0.         0.77224477 0.         0.98688694 0.         0.
  0.33089802 0.         0.86310343]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.        ]
 [0.         0.81546143 0.         0.28093451 0.         0.
  0.         0.         0.        ]
 [0.00552212 0.         0.14092422 0.80219698 0.06355835 0.70685734
  0.         0.77127035 0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.35846573 0.        ]
 [0.         0.         0.         0.72900717 0.         0.
  0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.62329813 0.19871568 0.        ]
 [0.         0.         0.         0.07404465 0.         0.
  0.         0.         0.        ]]
Q: Is the matrix mat_rnd sparse?
A: True

A matrix-vector product can be performed as usual, with the @ operators:

import numpy as np
v = np.random.rand(n2)
mat_rnd@v

array([0.3804422 , 1.8111527 , 0.        , 0.57522109, 1.76339961,
       0.26322729, 0.33582318, 0.28643412, 0.03410928])

EXERCISE: Linear models & sparse matrices

Create a function that can fit ordinary least squares for sparse matrices (or not). In particular, handle the usual pre-processing step of standardizing the columns of the design matrix (i.e., centering columns and dividing by standard deviation)? Beware: often you cannot center a sparse matrix without making it non-sparse.

The choice of a sparse format depends on the nature and structure of the data. For instance: - csc_matrix is more efficient for slicing by column, - csr_matrix is more efficient for slicing by row.

Graphs and sparsity

A classical field for the application of sparse matrices is when using graphs: although the number of nodes can be huge, each node of a graph is in general not connected to all nodes. Let us represent a graph by its adjacency matrix:

Definition: adjacency matrix

Suppose that G=(V,E) is a graph, where \left|V\right|=n. Suppose that the vertices of G are arbitrarily numbered v_1,\ldots,v_n. The adjacency matrix A of G is the matrix n \times n of general term:

A_{{i,j}}= \left\{ \begin{array}{rl} 1, & \text{if } (v_i,v_j) \in E \\ 0, & \text{o.w.} \end{array} \right.

Note that instead of 1, the value could vary on a per-edge basis (cf. Figure 1).

For standard graph manipulation the package networkx is very useful.

import networkx as nx

It allows one to define a graph very easily:

G = nx.Graph()
G.add_edge('A', 'B', weight=4)
G.add_edge('A', 'C', weight=3)
G.add_edge('B', 'D', weight=2)
G.add_edge('C', 'D', weight=4)
G.add_edge('D', 'A', weight=2)

and then visualize it:

import matplotlib.pyplot as plt
my_seed = 44
nx.draw_networkx(
    G, with_labels=True,
    node_size=1000,
    pos=nx.spring_layout(G, seed=my_seed)
)

labels = nx.get_edge_attributes(G, "weight")
nx.draw_networkx_edges(
    G,
    pos=nx.spring_layout(G, seed=my_seed),
    width=[6 * i for i in list(labels.values())],
)
nx.draw_networkx_edge_labels(
    G, pos=nx.spring_layout(G, seed=my_seed),
    edge_labels=labels
)
plt.axis("off")
plt.show()

You can get (a variant of) the adjacency matrix using

A = nx.adjacency_matrix(G)
print(A.todense(), f"Q: Is A spase?\nA: {isspmatrix(A)}")

[[0 4 3 2]
 [4 0 0 2]
 [3 0 0 4]
 [2 2 4 0]] Q: Is A spase?
A: False

where the adjacency matrix is stored in a sparse format, and can also encode the weights of the edges.

With networkx, it is simple to get the shortest path between two nodes in a graph using the shortest_path function:

shortest_path = nx.shortest_path(G, 'C', 'B', weight='weight')
str_shortest_path = ' -> '.join(shortest_path)
print(f"The shortest path between C and B is:\n {str_shortest_path}")

The shortest path between C and B is:
 C -> D -> B

Definition : incidence matrix

Let G = (V,E) be a (non-oriented) graph with n vertices, V = [1,\dots,n], and p edges, E = [1,\dots,p]. The graph can be represented by its vertex-edge incidence matrix D^\top \in \mathbb{R}^{p \times n} defined by

(D^\top)_{{e,v}} = \left\{ \begin{array}{rl} + 1, & \text{if } v = \min(i,j) \\ -1, & \text{si } v = \max(i,j) \\ 0, & \text{sinon} \end{array} \right.

where e = (i,j).

Definition : Laplacian matrix

The matrix L=D D^\top is the so-called graph Laplacian of G

D = nx.incidence_matrix(G, oriented=True).T
L = nx.laplacian_matrix(G)
print(isspmatrix(D))
print(D.todense(), L.todense())

False
[[-1.  1.  0.  0.]
 [-1.  0.  1.  0.]
 [-1.  0.  0.  1.]
 [ 0. -1.  0.  1.]
 [ 0.  0. -1.  1.]] [[ 9 -4 -3 -2]
 [-4  6  0 -2]
 [-3  0  7 -4]
 [-2 -2 -4  8]]

More graph visualizations

g = nx.karate_club_graph()

# Return a list of tuples each tuple is (node, deg)
list_degree = list(g.degree())
# Build a node list and corresponding degree list
nodes, degree = map(list, zip(*list_degree))

fig, ax = plt.subplots(1, 1, figsize=(8, 6))
nx.draw(
    g,
    ax=ax,
    nodelist=nodes,
    node_size=[(v * 30) + 1 for v in degree],
    width=4,
    alpha=0.7,
    edgecolors="white",
    node_color="#1f78b4",
    edge_color="#1f78b4",
)
plt.axis("off")
plt.show()

Back to sparse matrices

A = nx.adjacency_matrix(g).T
print(A.todense())

fig, ax = plt.subplots()
ax = plt.spy(A)
print(f"Pourcentage of active edges: {(g.number_of_edges() / g.number_of_nodes()**2) * 100:.2f} %")

[[0 4 5 ... 2 0 0]
 [4 0 6 ... 0 0 0]
 [5 6 0 ... 0 2 0]
 ...
 [2 0 0 ... 0 4 4]
 [0 0 2 ... 4 0 5]
 [0 0 0 ... 4 5 0]]
Pourcentage of active edges: 6.75 %

Remark: a possible visualization with Javascript (not so stable though, can be skipped)

Planar graphs and maps

Open Street Map can interface with networkx using the osmnx package.

The `osmnx` package

Known bug for old versions: - Cannot import name ‘CRS’ from ‘pyproj’ in osmnx

TypeError: argument of type ‘CRS’ is not iterable” with osmnx

So pick version 0.14 at least conda install osmnx>=0.14 or pip install osmnx>=0.10.

For Windows users, there might be some trouble with installing the fiona package, see:

Special case for osmnx on Windows follow the next step in order:

pip install osmnx
pip install Rtree
conda install -c conda-forge libspatialindex=1.9.3
pip install osmnx
Install all packages required up to fiona.
conda install -c conda-forge geopandas
Say yes to everything
Once done, launch pip install osmnx==1.0.1

import osmnx as ox
ox.settings.use_cache=True
ox.__version__

'1.9.4'

G = ox.graph_from_place('Montpellier, France', network_type='bike')
print(f"nb edges: {G.number_of_edges()}")
print(f"nb nodes: {G.number_of_nodes()}")

nb edges: 33992
nb nodes: 15345

fig, ax = ox.plot_graph(G)

You also need to install the package folium for the visualization of the maps.

import folium

map_osm = folium.Map(location=[43.610769, 3.876716])

map_osm.add_child(folium.RegularPolygonMarker(location=[43.610769, 3.876716],
                  fill_color='#132b5e', radius=5))
map_osm

Make this Notebook Trusted to load map: File -> Trust Notebook

D = nx.incidence_matrix(G, oriented=True).T

element = np.zeros(1, dtype=float)
mem = np.prod(D.shape) * element.data.nbytes / (1024**2)
print('Size of full matrix with zeros: {0:3.2f}  MB'.format(mem))

print('Size of sparse matrix: {0:3.2f}  MB'.format(D.data.nbytes/(1024**2) ))

print('Ratio  of full matrix size / sparse: {0:3.2f}%'.format(100 * D.data.nbytes / (1024**2 * mem)))
print(isspmatrix(D))

Size of full matrix with zeros: 3979.55  MB
Size of sparse matrix: 0.51  MB
Ratio  of full matrix size / sparse: 0.01%
False

Alternatively: you can uncomment the following line, and check that the size of a similar matrix (with a non-sparse format) would be huge,

>>> Size of a full matrix encoding the zeros: 4 gB

Create a matrix of similar size. BEWARE: This creates a huge matrix:

M = np.random.randn(G.number_of_nodes(), G.number_of_nodes())
print('Size of a full encoding the zeros: {0:3.2f}  MB'.format(M.nbytes/(1024**2)))

Size of a full encoding the zeros: 1796.49  MB

Now you can check the memory usage of the sparse matrix representation associated:

print(" {0:.2} % of edges only are needed to represent the graph of Montpellier".format(100 * G.number_of_edges() / G.number_of_nodes() ** 2))

 0.014 % of edges only are needed to represent the graph of Montpellier

Hopefully, this is convincing enough to use sparse matrices when dealing with graphs and maps!

Visualize the shortest path between two points

References:

OpenStreetMap Roads Data Using osmnx

origin = ox.geocoder.geocode('Place Eugène Bataillon, Montpellier, France')
destination = ox.geocoder.geocode('Maison du Lez, Montpellier, France')

origin_node = ox.nearest_nodes(G, origin[1], origin[0])
destination_node = ox.nearest_nodes(G, destination[1], destination[0])

print(origin)
print(destination)
route = ox.routing.shortest_path(G, origin_node, destination_node)
route_back = ox.routing.shortest_path(G, destination_node, origin_node)

(43.6314565, 3.8607694)
(43.61032245, 3.8966295)

fig, ax = ox.plot_graph_routes(G, [route, route_back], route_linewidth=6, route_colors=['red', 'blue'], node_size=0)

Visualize the same routes, but on a map:

route_edges = ox.utils_graph.route_to_gdf(G, route)
route_back_edges = ox.utils_graph.route_to_gdf(G, route_back)

m = route_edges.explore(color="red", style_kwds={"weight": 5, "opacity": 0.75})
m = route_back_edges.explore(m=m, color="blue", style_kwds={"weight": 5, "opacity": 0.75})
m

/tmp/ipykernel_22968/3590511927.py:1: FutureWarning:

The `route_to_gdf` function has moved to the `routing` module. Calling `utils_graph.route_to_gdf` is deprecated and will be removed in the v2.0.0 release. Call it via `routing.route_to_gdf` instead. See the OSMnx v2 migration guide: https://github.com/gboeing/osmnx/issues/1123

/tmp/ipykernel_22968/3590511927.py:2: FutureWarning:

The `route_to_gdf` function has moved to the `routing` module. Calling `utils_graph.route_to_gdf` is deprecated and will be removed in the v2.0.0 release. Call it via `routing.route_to_gdf` instead. See the OSMnx v2 migration guide: https://github.com/gboeing/osmnx/issues/1123

Make this Notebook Trusted to load map: File -> Trust Notebook

Note

Note that even if we call G a graph it is rather a multigraph (there might be several edges between two nodes):

G.is_multigraph()

True

References:

Memory efficiency: sparse matrices

Natural Language Processing (NLP)

One-hot encoding/Dummy variables

Discretization of a physical system

Graphs

Usage in scipy

Graphs and sparsity

Definition: adjacency matrix

Definition : incidence matrix

Definition : Laplacian matrix

More graph visualizations

Back to sparse matrices

Planar graphs and maps

The osmnx package

Visualize the shortest path between two points

Usage in `scipy`

The `osmnx` package