Bayesian networks in Python
This is an unambitious Python library for working with Bayesian networks. For serious usage, you should probably be using a more established project, such as pomegranate, PyMC, Stan, Edward, and Pyro. There's also the welldocumented bnlearn package in R. Hey, you could even go medieval and use something like Netica — I'm just jesting, they actually have a nice tutorial on Bayesian networks. On a same not, if you're not familiar with Bayesian networks, then I highly recommend Patrick Winston's MIT courses on probabilistic inference (part 1, part 2).
The main goal of this project is to be used for educational purposes. As such, more emphasis is put on tidyness and conciseness than on performance. I find libraries such as pomegranate
wonderful, but they literally contain several thousand lines of code, at the detriment of simplicity and ease of comprehension. Nonetheless, hedgehog
is reasonably efficient and should be able to satisfy most usecases in a timely manner.
Table of contents
Installation
You should be able to install and use this library with any Python version above 3.5.
$ pip install git+https://github.com/MaxHalford/hedgehog
Note that under the hood hedgehog
uses vose
for random sampling, which is written in Cython.
Usage
✍️
Manual definition
The central construct in hedgehog
is the BayesNet
class. The edges of the network can be provided manually when initialising a BayesNet
. As an example, let's use Judea Pearl's famous alarm network:
>>> import hedgehog as hh
>>> bn = hh.BayesNet(
... ('Burglary', 'Alarm'),
... ('Earthquake', 'Alarm'),
... ('Alarm', 'John calls'),
... ('Alarm', 'Mary calls')
... )
You can also use the following notation, which is slightly more terse:
>>> import hedgehog as hh
>>> bn = hh.BayesNet(
... (('Burglary', 'Earthquake'), 'Alarm'),
... ('Alarm', ('John calls', 'Mary calls'))
... )
In Judea Pearl's example, the conditional probability tables are given. Therefore, we can define them manually by setting the values of the P
attribute:
>>> import pandas as pd
# P(Burglary)
>>> bn.P['Burglary'] = pd.Series({False: .999, True: .001})
# P(Earthquake)
>>> bn.P['Earthquake'] = pd.Series({False: .998, True: .002})
# P(Alarm  Burglary, Earthquake)
>>> bn.P['Alarm'] = pd.Series({
... (True, True, True): .95,
... (True, True, False): .05,
...
... (True, False, True): .94,
... (True, False, False): .06,
...
... (False, True, True): .29,
... (False, True, False): .71,
...
... (False, False, True): .001,
... (False, False, False): .999
... })
# P(John calls  Alarm)
>>> bn.P['John calls'] = pd.Series({
... (True, True): .9,
... (True, False): .1,
... (False, True): .05,
... (False, False): .95
... })
# P(Mary calls  Alarm)
>>> bn.P['Mary calls'] = pd.Series({
... (True, True): .7,
... (True, False): .3,
... (False, True): .01,
... (False, False): .99
... })
The prepare
method has to be called whenever the structure and/or the P are manually specified. This will do some housekeeping and make sure everything is sound. Just like brushing your teeth, it is not compulsory but highly recommended.
>>> bn.prepare()
🔮
Probabilistic inference
A Bayesian network is a generative model. Therefore, it can be used for many purposes. First of all, it can answer probabilistic queries, such as:
What is the likelihood of there being a burglary if both John and Mary call?
This question can be answered by using the query
method, which returns the probability distribution for the possible outcomes.
>>> bn.query('Burglary', event={'Mary calls': True, 'John calls': True})
Burglary
False 0.715828
True 0.284172
Name: P(Burglary), dtype: float64
We can also answer questions that involve multiple query variables, for instance:
What are the chances that John and Mary call if an earthquake happens?
>>> bn.query('John calls', 'Mary calls', event={'Earthquake': True})
John calls Mary calls
False False 0.675854
True 0.027085
True False 0.113591
True 0.183470
Name: P(John calls, Mary calls), dtype: float64
By default, the answer is found via an exact inference procedure. For small networks this isn't very expensive to perform. However, for larger networks, you might want to prefer using approximate inference. The latter is a class of methods that randomly sample the network and return an estimate of the answer. The quality of the estimate increases with the number of iterations that are performed. For instance, you can use Gibbs sampling:
>>> import numpy as np
>>> np.random.seed(42)
>>> bn.query(
... 'Burglary',
... event={'Mary calls': True, 'John calls': True},
... algorithm='gibbs',
... n_iterations=1000
... )
Burglary
False 0.73
True 0.27
Name: P(Burglary), dtype: float64
The supported inference methods are:
exact
for variable elimination.gibbs
for Gibbs sampling.likekihood
for likelihood weighting.rejection
for rejection sampling.
❓
Missing value imputation
A usecase for probabilistic inference is to impute missing values. The impute
method fills the missing values with the most likely replacements, given the present information. This is usually more accurate than simply replacing by the mean or the most common value. Additionally, such an approach can be much more efficient than modelbased iterative imputation.
>>> from pprint import pprint
>>> sample = {
... 'Alarm': True,
... 'Burglary': True,
... 'Earthquake': False,
... 'John calls': None, # missing
... 'Mary calls': None # missing
... }
>>> sample = bn.impute(sample)
>>> pprint(sample)
{'Alarm': True,
'Burglary': True,
'Earthquake': False,
'John calls': True,
'Mary calls': True}
Note that the impute
method can be seen as the equivalent of pomegranate
's predict
method.
🎲
Random sampling
You can use a Bayesian network to generate random samples. The samples will follow the distribution induced by the network's structure and it's conditional probability tables.
>>> pprint(bn.sample())
{'Alarm': False,
'Burglary': False,
'Earthquake': False,
'John calls': False,
'Mary calls': False}
>>> bn.sample(5)
Alarm Burglary Earthquake John calls Mary calls
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False True False
4 False False False False False
🧮
Parameter estimation
You can determine the values of the P from a dataset. This is a straightforward procedure, as it only requires perfoming a groupby
followed by a value_counts
for each CPT.
>>> samples = bn.sample(1000)
>>> bn = bn.fit(samples)
Note that in this case you do not have to call the prepare
method because it is done for you implicitely.
If you want to update an already existing Bayesian networks with new observations, then you can use partial_fit
:
>>> bn = bn.partial_fit(samples[:500])
>>> bn = bn.partial_fit(samples[500:])
The same result will be obtained whether you use fit
once or partial_fit
multiple times in succession.
🧱
Structure learning
This isn't available yet. I kind of know what I'm doing and there are some sweet algorithms as of 2020. However, it would be nice to take a modern approach and actually take into account causality. Indeed, most structure learning algorithms don't make a difference between different edge directions. Stay tuned!
👀
Visualization
You can use the graphviz
method to return a graphviz.Digraph
.
>>> bn = hh.load_asia()
>>> dot = bn.graphviz()
>>> path = dot.render('asia', directory='figures', format='svg', cleanup=True)
Note that the graphviz
library is not installed by default because it requires a platform dependent binary. Therefore, you have to install it by yourself.
👁️
Graphical user interface
A sidegoal of this project is to provide a user interface to play around with a given user interface. Fortunately, we live in wonderful times where many powerful and opensource tools are available. At the moment, I have a preference for streamlit
.
You can install the GUI dependencies by running the following command:
$ pip install git+https://github.com/MaxHalford/hedgehog installoption="extrasrequire=gui"
You can then launch a demo by running the hedgehog
command:
$ hedgehog
This will launch a streamlit
interface where you can play around with the examples that hedgehog
provides. You can see a running instance of it in this Heroku app.
An obvious next step would be to allow users to run this with their own Bayesian networks. Then again, using streamlit
is so easy that you might as well do this yourself.
🔢
Support for continuous variables
Bayesian networks that handle both discrete and continuous are said to be hybrid. There are two approaches to deal with continuous variables. The first approach is to use parametric distributions within nodes that pertain to a continuous variable. This has two disavantages. First of all, it is complex because there are different cases to handle: a discrete variable conditioned by a continuous one, a continuous variable conditioned by a discrete one, or combinations of the former with the latter. Secondly, such an approach requires having to pick a parametric distribution for each variable. Although there are methods to automate this choice for you, they are expensive and are far from being foolproof.
The second approach is to simply discretise the continuous variables. Although this might seem naive, it is generally a good enough approach and definitely makes things simpler implementationwise. There are many ways to go about discretising a continuous attribute. For instance, you can apply a quantilebased discretization function. You could also round each number to it's closest integer. In some cases you might be able to apply a manual rule. For instance, you can convert a numeric temperature to "cold", "mild", and "hot".
To summarize, we prefer to give the user the flexibility to discretize the variables by herself. Indeed, most of the time the best procedure depends on the problem at hand and cannot be automated adequatly.
Toy networks
Several toy networks are available to fool around with:

🚨 load_alarm
— the alarm network introduced by Judea Pearl. 
🐉 load_asia
— a popular example introduced in Local computations with probabilities on graphical structures and their application to expert systems. 
🎓 load_grades
— an example from Stanford's CS 228 class. 
💦 load_sprinkler
— the network used in chapter 14 of Artificial Intelligence: A Modern Approach (3rd edition).
Here is some example usage:
>>> bn = hh.load_sprinkler()
>>> bn.nodes
['Cloudy', 'Rain', 'Sprinkler', 'Wet grass']
>>> pprint(bn.parents)
{'Rain': ['Cloudy'],
'Sprinkler': ['Cloudy'],
'Wet grass': ['Sprinkler', 'Rain']}
>>> pprint(bn.children)
{'Cloudy': ['Sprinkler', 'Rain'],
'Rain': ['Wet grass'],
'Sprinkler': ['Wet grass']}
Development
# Download and navigate to the source code
$ git clone https://github.com/MaxHalford/hedgehog
$ cd hedgehog
# Create a virtual environment
$ python3 m venv env
$ source env/bin/activate
# Install in development mode
$ pip install e ".[dev]"
$ python setup.py develop
# Run tests
$ pytest
License
This project is free and opensource software licensed under the MIT license.