-
Notifications
You must be signed in to change notification settings - Fork 13
Python Library Usage
The XCSF Python library has been designed to be as familiar as possible to users of scikit-learn and Keras. This document contains a brief guide to the functions available and the ways in which it can be customised.
Older versions:
Documentation can be displayed within Python:
import xcsf
help(xcsf.xcsf)
Specifying the input dimensionality x_dim
, the output dimensionality
y_dim
, and the number of actions (or classes) n_actions
is required.
Other parameters may be specified when changing from the default value.
import xcsf
xcs = xcsf.XCS(x_dim=1, y_dim=1, n_actions=1)
def __init__(
self,
#############################################
# Required
#############################################
x_dim: int = 1, # input dimensionality
y_dim: int = 1, # output dimensionality
n_actions: int = 1, # number of actions or classes; 1 for supervised learning
#############################################
# General
#############################################
omp_num_threads: int = 8, # number of CPU cores to use
random_state: int | None = None, # sets random number seed; uses current time if None or negative
population_file: str = "", # JSON file with an initial population set
pop_init: bool = True, # whether to seed the population with random rules
max_trials: int = 100000, # number of trials to execute for each xcs.fit()
perf_trials: int = 1000, # number of trials to average performance output
pop_size: int = 2000, # maximum population size
set_subsumption: bool = False, # whether to perform set subsumption
theta_sub: int = 100, # minimum experience of a classifier to become a subsumer
#############################################
# Error Function
#############################################
loss_func: str = "mae",
# "mae" = mean absolute error
# "mse" = mean squared error
# "rmse" = root mean squared error
# "log" = log loss (cross-entropy)
# "binary_log" = binary log loss
# "onehot" = one-hot encoding classification error
# "huber" = Huber error
huber_delta: float = 1, # delta parameter for Huber error calculation
#############################################
# Multi-step Problems
#############################################
teletransportation: int = 50, # num steps to reset a multistep problem if goal not found
gamma: float = 0.95, # discount factor in calculating the reward for multistep problems
p_explore: float = 0.9, # probability of exploring vs. exploiting in a multistep trial
#############################################
# Classifier
#############################################
e0: float = 0.01, # target error, under which accuracy is set to 1
alpha: float = 0.1, # accuracy offset for rules above E0 (1=disabled)
nu: float = 5, # accuracy slope for rules with error above E0
beta: float = 0.1, # learning rate for updating error, fitness, and set size
delta: float = 0.1, # fraction of least fit classifiers to increase deletion vote
theta_del: int = 20, # min experience before fitness used in probability of deletion
init_fitness: float = 0.01, # initial classifier fitness
init_error: float = 0, # initial classifier error
m_probation: int = 10000, # trials since creation a rule must match at least 1 input or be deleted
stateful: bool = True, # whether classifiers should retain state across trials
compaction: bool = False, # if enabled and sys err < E0, the largest of 2 roulette spins is deleted
#############################################
# Evolutionary Algorithm
#############################################
ea: dict = {
"select_type": "roulette", # parental selection; options = {"roulette", "tournament"}
"select_size": 0.4, # fraction of set size for tournament parental selection
"theta_ea": 50, # average set time between EA invocations
"lambda": 2, # number of offspring to create each EA invocation (use multiples of 2)
"p_crossover": 0.8, # probability of applying crossover (usage depends on representation)
"err_reduc": 1.0, # amount to reduce an offspring error (1=disabled)
"fit_reduc": 0.1, # amount to reduce an offspring fitness (1=disabled)
"subsumption": False, # whether to try and subsume offspring classifiers
"pred_reset": False, # whether to reset offspring predictions instead of copying
},
#############################################
# Action: See below for options.
#############################################
action: dict = {
"type": "integer",
},
#############################################
# Condition: See below for options.
#############################################
condition: dict = {
"type": "hyperrectangle_csr",
"args": {
"eta": 0.0,
"min": 0.0,
"max": 1.0,
"spread_min": 0.1,
},
},
#############################################
# Prediction: See below for options.
#############################################
prediction: dict = {
"type": "nlms_linear",
"args": {
"x0": 1.0,
"eta": 0.1,
"evolve_eta": True,
"eta_min": 1e-05,
},
},
) -> None: ...
If n_actions > 1
the system will be configured for Reinforcement Learning and
action sets will be constructed from the match sets using the chosen representation as in traditional XCS.
If n_actions = 1
the system will be configured for Supervised Learning and the
action component will be removed. That is, only match sets will be created as in traditional XCSF.
The loss_func
is used when calculating each classifier's error for fitness determination, and therefore
affects the reproductive success with the evolutionary algorithm (EA). It is also used when measuring the system
performance; for example, when comparing the fitness weighted match set predictions with the true values
in score()
.
However, the loss_func
is not used when updating classifier predictions since these updates depend on the representation chosen.
For example, Constant predictions minimise the mean absolute error, whereas NLMS
and RLS minimise the mean squared error. Neural Networks
also minimise the mean squared error, however if using a one-hot encoding and a softmax layer, it will minimise
the mean log loss (cross-entropy).
A given random_state
is not guaranteed to generate reproducible behaviour across different platforms
(e.g., Linux vs. Windows) due to compiler optimisations, etc.
The default hyperparameters are not intended as general values suitable for all problems and must be set appropriately for the specific learning task. This is best done in a systematic way, e.g., as either a grid search or using a hyperparameter tuning framework such as Optuna. However, below are some general hints that may help finding optimal values.
One of the most important hyperparameters pop_size
will vary considerably depending on the representation chosen (e.g.,
evolving hyperrectangle conditions will require larger rule sets than neural networks) and also based on problem complexity.
Faster convergence will generally be seen with larger values, although with a higher resultant computational cost. Note that LCS
are notorious for requiring much larger population sizes than typical EAs. For example, Butz et al. (2008) use pop_size = 6400
to solve a number of sine function regression tasks with
hyperellipsoidal conditions and RLS predictions. The first priority should be to achieve acceptable and stable performance, and
then rule compaction algorithms can be applied subsequently if necessary.
Another of the most important hyperparameters is max_trials
, which specifies the total number of "trials" to execute. For supervised learning, a trial is a single iteration where a single sample is drawn and a match set created and updated; see Fitting for more information. For reinforcement learning, a trial (also known as an episode) consists of one or more steps, where each step involves creating a match set and action sets based on the current environment state. Similar to pop_size
, the value of max_trials
will vary significantly based on both the representation and problem complexity.
Increasing theta_ea
will slow down the rate of EA invocation. In simple problems this will lead to slower convergence,
but in more complex ones it may be necessary to provide enough samples for the classifier predictions to converge correctly
and therefore provide a more accurate error/fitness calculation. In simple problems, values of 25 or 50 are typically used,
whereas in more complex problems, values of 100, 500, or even larger may be required.
The hyperparameter theta_del
is closely related to theta_ea
because it controls the minimum experience a classifier can
have before fitness is used in the probability of deletion. As a general rule of thumb, these two parameters should usually
be similar values. Also closely related is beta
which controls the update rate for classifier metrics such as error and
set size. Therefore, small values of theta_ea
will require larger values of beta
to more quickly update classifier
metrics, but with larger values of theta_ea
it can be reduced to allow a smoother and more accurate fitness determination.
A good starting point for beta
is 0.1; in very simple problems a value of 0.2 may converge faster, but in more complex
problems values of 0.05 and smaller may be beneficial with larger theta_ea
.
For regression tasks, it is typical to set alpha
to 1, which disables the accuracy offset since there is almost always
no cliff edge in the fitness landscape as compared with predicting a different label for classification. In reinforcement
learning however, the offset can be useful for stabilising performance and typically a value of 0.1 is used.
The hyperparameter nu
controls the exponent used in the accuracy calculation, which affects the slope of the accuracy curve.
Larger values therefore increase the difference in fitness between rules of similar errors. This increases the pressure
toward evolving accurate rules and consequently increase the pressure to niche during the early stages of learning. In some
early papers comparing roulette wheel and tournament selection for EA parental select_type
there was a debate about whether
tournament selection was faster and more robust, but it was later shown that when nu
is correctly set the differences are
minimal. nu
can therefore be seen to have a similar affect on roulette wheel selection as select_size
does for tournament.
Generally nu = 5
provides a good performance without compromising diversity by increasing the fitness pressure too high,
however there are occasionally some problems where higher values such as 10, 20, or even 50, can increase performance.
A value of 0.1 for delta
is almost always used, which results in the 10% least fit classifiers in the population receiving
a bigger deletion vote and are therefore more likely to be removed from the population. Disabling this by using a value of 0
will likely slow convergence.
The parameters listed above can be modified on an instantiated xcsf.XCS
object using set_params()
For example:
xcs.set_params(beta=0.5)
Note that while EA parameters require specifying a dictionary, only the values that should be
modified need to be present, for example only changing theta_ea
:
xcs.set_params(ea={"theta_ea": 100})
However, setting the action
, condition
, and prediction
requires providing a complete dict
including the type and args.
Library Stub:
def set_params(**kwargs) -> xcsf.XCS: ...
The use of always matching conditions results in the match set being equal to the population set, i.e., [M] = [P]. The EA and classifier updates are thus performed within [P], and global models are designed (e.g., neural networks) that cover the entire state-space. This configuration operates as a more traditional EA, which can be useful for debugging and benchmarking.
Additionally, a single global model (e.g., a linear regression) can be fit by
also setting pop_size = 1
and disabling the EA by setting
the invocation frequency to a larger number than will ever be executed, e.g.,
ea = {"theta_ea": 5000000}
. This can also be useful for debugging and benchmarking.
condition = {
"type": "dummy",
}
With ternary bitstrings, each classifier's condition is represented as x_dim
multiplied by the number of encoding bits
.
For binary problems, the number of encoding bits is simply: bits = 1
.
For real-valued inputs, the values are binarised to the specified number of bits
with the
assumption that the inputs are in the range [0,1].
For example with bits = 2
, an input vector [0.23,0.76,0.45,0.5]
will be converted to
[0,0,1,1,0,1,0,1]
before being tested for matching with the ternary bitstring
using the alphabet {0,1,#}
where the don't care symbol #
matches either bit.
Uniform crossover is applied with probability p_crossover
and a single
self-adaptive mutation rate (log normal) is used.
condition = {
"type": "ternary",
"args": {
"bits": 2, # number of bits per float to binarise inputs
"p_dontcare": 0.5, # don't care probability during covering
}
}
Related Literature:
- S. W. Wilson (1995) Classifier fitness based on accuracy
Hyperellipsoids currently use the center-spread representation (and axis-rotation is not yet implemented.)
Hyperrectangles currently implement the center-spread and unordered-bound representations.
With the hyperrectangle center-spread representation, each classifier condition is represented as a concatenation
of interval predicates, x_dim
and
With the hyperrectangle unordered-bound representation, each classifier condition is represented as a concatenation
of interval predicates, x_dim
and
Uniform crossover is applied with probability p_crossover
. A single
self-adaptive mutation rate (log normal) specifies the standard
deviation used to sample a random Gaussian (with zero mean) which is added to
each center and spread value (or bound for unordered-bounds).
For center-spread representations, if eta > 0
each classifier's centers are
adjusted at rate
condition = {
"type": "hyperrectangle_csr", # center-spread
# "type": "hyperrectangle_ubr", # unordered-bound
# "type": "hyperellipsoid", # center-spread
"args": {
"min": 0, # minimum value of a center/bound
"max": 1, # maximum value of a center/bound
"spread_min": 0.1, # minimum initial spread
"eta": 0, # gradient descent rate for moving centers to mean inputs matched
}
}
The input features will need to be numpy arrays of type numpy.float64
and when using hyperrectangles or hyperellipsoids it is recommended to scale the input features in the range [0,1], for example as below. Monitor the average number of matching classifiers with the mset
metric to make sure that a sufficient number of rules are covering the input domain.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
x = scaler.fit_transform(x)
Related Literature:
- S. W. Wilson (2000) Get real! XCS with continuous-valued inputs
- C. Stone and L. Bull (2003) For real! XCS with continuous-valued inputs
- M. V. Butz (2005) Kernel-based, ellipsoidal conditions in the real-valued XCS classifier system
- M. V. Butz, P.-L. Lanzi, and S. W. Wilson (2006) Hyper-ellipsoidal conditions in XCS: rotation, linear approximation, and solution structure
- K. Tamee, L. Bull, and O. Pinngern (2007) Towards clustering with XCS
GP trees currently use
arithmetic operators from the set {+,-,/,*}
. Return values from each node
are clamped [-1000,1000]. The rule matches if the output node is greater than
0.5. Subsumption is not implemented.
Sub-tree crossover is applied with probability p_crossover
. A single
self-adaptive mutation rate (rate selection) is used to specify the
per allele probability of performing mutation where terminals are randomly
replaced with other terminals and functions randomly replaced with other
functions.
condition = {
"type": "tree_gp",
"args": {
"min_constant": 0, # minimum value of a constant
"max_constant": 1, # maximum value of a constant
"n_constants": 100, # number of (global) constants available
"init_depth": 5, # initial depth of a tree
"max_len": 10000, # maximum initial length of a tree
}
}
See also: Visualising GP Trees.
Related Literature:
- M. Ahluwalia and L. Bull (1999) A genetic programming-based classifier system.
- P.-L. Lanzi (1999) Extending the representation of classifier conditions Part II: From messy coding to S-expressions
- P.-L. Lanzi (2003) XCS with stack-based genetic programming
- C. Ioannides and W. Browne (2007) Investigating scaling of an abstracted LCS utilising ternary and S-expression alphabets
- S. W. Wilson (2008) Classifier conditions using gene expression programming
- M. Iqbal, W. N. Browne, and M. Zhang (2014) Reusing building blocks of extracted knowledge to solve complex, large-scale Boolean problems
Temporally dynamic graphs
with fuzzy symbolic functions selected
from the CFMQVS set: {fuzzy NOT, fuzzy AND, fuzzy OR}
. Each graph is initialised
with a randomly selected function assigned to each node and random connectivity
(including recurrent connections) and is synchronously updated in parallel for T cycles
before sampling the output node(s). These graphs can exhibit inherent memory by
retaining state across inputs. Inputs must be in the range [0,1].
Currently implements a fixed number of nodes with the connectivity and update cycles evolved along with the function for each node. Log normal self-adaptive mutation is used for node function and connectivity and uniform self-adaptive mutation for the number of update cycles.
When used as conditions, the number of nodes n
must be at least 1 and the
rule matches a given input if the state of that node is greater than 0.5 after
updating the graph T times. When used as condition + action rules, the action
is encoded as binary (discretising the node outputs with threshold 0.5); for
example with 8 actions, a minimum of 3 additional nodes are required.
Subsumption is not implemented.
condition = {
"type": "dgp",
# "type": "rule_dgp" # conditions + actions in single DGP graphs
"args": {
"max_k": 2, # number of connections per node
"max_t": 10, # maximum number of cycles to update graphs
"n": 20, # number of nodes in the graph
"evolve_cycles": True, # whether to evolve the number of update cycles
}
}
See also: Visualising DGP Graphs.
Related Literature:
- R. J. Preen and L. Bull (2013) Dynamical genetic programming in XCSF
- R. J. Preen and L. Bull (2014) Discrete and fuzzy dynamical genetic programming in the XCSF learning classifier system
- M. Iqbal, W. N. Browne, and M. Zhang (2017) Extending XCS with cyclic graphs for scalability on complex Boolean problems
Condition output layers should be set to a single neuron, i.e., "n_init": 1
. A classifier matches an input if this output neuron is greater than 0.5.
When used to represent conditions and actions within a single network ("rules")
the output layers should be "n_init": 1 + binary
where binary is the
number of outputs required to output binary actions. For example, for 8
actions, 3 binary outputs are required and the output layer should contain 4
neurons. Again, the neuron states of the action outputs are discretised with
threshold 0.5. Subsumption is not implemented.
See Neural Network Initialisation.
condition = {
"type": "neural",
"args": layer_args,
}
condition = {
"type": "rule_neural" # conditions + actions in single neural nets
"args": layer_args,
}
Related Literature:
- L. Bull (2002) On using constructivism in neural classifier systems
- R. J. Preen, S. W. Wilson, and L. Bull (2021) Autoencoding with a classifier system
- R. J. Preen and L. Bull (2021) Deep learning with a classifier system: Initial results
A constant integer value. A single self-adaptive mutation rate (log normal) specifies the probability of randomly reselecting the value.
action = {
"type": "integer",
}
Related Literature:
- S. W. Wilson (1995) Classifier fitness based on accuracy
Output layer should be a softmax.
See Neural Network Initialisation.
action = {
"type": "neural",
"args": layer_args,
}
Related Literature:
- T. O'Hara and L. Bull (2005) A memetic accuracy-based neural learning classifier system
- P.-L. Lanzi and D. Loiacono (2007) Classifier systems that compute action mappings
- D. Howard, L. Bull, and P.-L. Lanzi (2015) A cognitive architecture based on a learning classifier system with spiking classifiers
Original XCS behaviour can be specified with (piece-wise) constant predictions where each classifier maintains a scalar
- if
$cl.exp < 1 / \beta$ :$cl.p \leftarrow (cl.p \times (cl.exp - 1) + y) / cl.exp$
- otherwise:
$cl.p \leftarrow cl.p + \beta (y - cl.p)$
beta = 0.1 # classifier update rate includes constant predictions
prediction = {
"type": "constant",
}
Related Literature:
- S. W. Wilson (1995) Classifier fitness based on accuracy
Each classifier maintains a vector of weights that are used to compute the prediction from the input state x_dim
. Therefore, the length of the weight vector is equal to (x_dim + 1) * y_dim
.
For quadratic prediction, there are 1 (offset) +
For linear prediction, each classifier computes its prediction:
Where x0
and
When updating the classifier, each weight is adjusted by:
Where eta
.
The classifier weight vector is then updated:
If eta
is evolved, the rate is initialised uniformly random [eta_min, eta]
.
Offspring inherit the rate and a single (log normal) self-adaptive
mutation rate specifies the standard deviation used to sample a random Gaussian
(with zero mean) which is added to eta
(similar to
evolution strategies).
prediction = {
"type": "nlms_linear", # options: {"nlms_linear", "nlms_quadratic"}
"args": {
"x0": 1, # offset value
"eta": 0.1, # gradient descent update rate (maximum value, if evolved)
"eta_min": 0.0001, # minimum gradient descent update rate (if evolved)
"evolve_eta": True, # whether to evolve the gradient descent rate
}
}
Related Literature:
- S. W. Wilson (2001) Function approximation with a classifier system
- S. W. Wilson (2002) Classifiers that approximate functions
- P.-L. Lanzi, D. Loiacono, S. W. Wilson, and D. E. Goldberg (2005) XCS with computed prediction for the learning of Boolean functions
- P.-L. Lanzi, D. Loiacono, S. W. Wilson, and D. E. Goldberg (2005) XCS with computed prediction in multistep environments
- P.-L. Lanzi, D. Loiacono, S. W. Wilson, and D. E. Goldberg (2005) Extending XCSF beyond linear approximation
RLS predictions use the same number of weights as for NLMS and are computed in the same way. However, the weight vectors (coefficients) are updated using the recursive least squares algorithm via the use of a gain matrix, which converges extremely quickly, although with a high computational cost.
prediction = {
"type": "rls_linear", # options: {"rls_linear", "rls_quadratic"}
"args": {
"x0": 1, # offset value
"scale_factor": 1000, # initial diagonal values of the gain-matrix
"lambda": 1, # forget rate (small values may be unstable)
}
}
Related Literature:
- P.-L. Lanzi, D. Loiacono, S. W. Wilson, and D. E. Goldberg (2006) Prediction update algorithms for XCSF: RLS, Kalman filter, and gain adaptation
- D. Loiacono and P.-L. Lanzi (2007) Recursive least squares and quadratic prediction in continuous multistep problems
- M. V. Butz, P.-L. Lanzi, and S. W. Wilson (2008) Function approximation with XCS: Hyperellipsoidal conditions, recursive least squares, and compaction
- D. Loiacono and P.-L. Lanzi (2008) Computed prediction in binary multistep problems
- D. Loiacono and P.-L. Lanzi (2009) Recursive least squares and quadratic prediction in continuous multistep problems
Output layer should be "n_init": y_dim
.
See Neural Network Initialisation.
prediction = {
"type": "neural",
"args": layer_args,
}
Related Literature:
- P.-L. Lanzi and D. Loiacono (2006) XCSF with neural prediction
- T. O'Hara and L. Bull (2007) Backpropagation in accuracy-based neural learning classifier systems
- H. H. Dam, H. A. Abbass, C. Lokan, and X. Yao (2008) Neural-based learning classifier systems
- R. J. Preen, S. W. Wilson, and L. Bull (2021) Autoencoding with a classifier system
- R. J. Preen and L. Bull (2021) Deep learning with a classifier system: Initial results
Neural networks support self-adaptive mutation with a number of different operators that may be enabled, however crossover is not implemented. Stochastic gradient descent may be enabled for predictions. Details may be found in:
- R. J. Preen, S. W. Wilson, and L. Bull (2021) Autoencoding with a classifier system
- R. J. Preen and L. Bull (2021) Deep learning with a classifier system: Initial results
layer_args = {
"layer_0": { # first hidden layer
"type": "connected", # layer type
..., # layer specific parameters
},
..., # as many layers as desired
"layer_n": { # output layer
"type": "connected", # layer type
..., # layer specific parameters
},
}
Note: Neuron states are clamped [-100,100] before activations are applied. Weights are clamped [-10,10].
"logistic", # logistic [0,1]
"relu", # rectified linear unit [0,inf]
"tanh", # tanh [-1,1]
"linear", # linear [-inf,inf]
"gaussian", # Gaussian (0,1]
"sin", # sine [-1,1]
"cos", # cosine [-1,1]
"softplus", # soft plus [0,inf]
"leaky", # leaky rectified linear unit [-inf,inf]
"selu", # scaled exponential linear unit [-1.7581,inf]
"loggy", # logistic [-1,1]
layer_args = {
"layer_0": {
"type": "connected", # layer type
"activation": "relu", # activation function
"evolve_weights": True, # whether to evolve weights
"evolve_connect": True, # whether to evolve connectivity
"evolve_functions": True, # whether to evolve activation function
"evolve_neurons": True, # whether to evolve the number of neurons
"max_neuron_grow": 5, # maximum number of neurons to add or remove per mut
"n_init": 10, # initial number of neurons
"n_max": 100, # maximum number of neurons (if evolved)
"sgd_weights": True, # whether to use gradient descent (only for predictions)
"evolve_eta": True, # whether to evolve the gradient descent rate
"eta": 0.1, # gradient descent update rate (maximum value, if evolved)
"eta_min": 0.0001, # minimum gradient descent update rate (if evolved)
"momentum": 0.9, # momentum for gradient descent update
"decay": 0, # weight decay during gradient descent update
},
}
layer_args = {
"layer_0": {
"type": "recurrent",
..., # other parameters same as for connected layers
}
}
layer_args = {
"layer_0": {
"type": "lstm",
"activation": "tanh", # activation function
"recurrent_activation": "logistic", # recurrent activation function
..., # other parameters same as for connected layers
}
}
Softmax layers can be composed of a linear connected layer and softmax:
layer_args = {
"layer_0": {
"type": "connected",
"activation": "linear",
"n_init": N_ACTIONS, # number of (softmax) outputs
..., # other parameters same as for connected layers
},
"layer_1": {
"type": "softmax",
"scale": 1, # softmax temperature
},
}
layer_args = {
"layer_0": {
"type": "dropout",
"probability": 0.2, # probability of dropping an input
}
}
Gaussian noise adding layers.
layer_args = {
"layer_0": {
"type": "noise",
"probability": 0.2, # probability of adding noise to an input
"scale": 1.0, # standard deviation of Gaussian noise added
}
}
Convolutional layers require image inputs and produce image outputs. If used as the first layer, the width, height, and number of channels must be specified. RGB images must be in the array format: R1,R2,R3,...,G1,G2,G3,...,B1,B2,B3,..., this is in contrast with Keras for example, which requires R1,G1,B1,R2,G2,B2, etc.
If "evolve_neurons": True
the number of filters will be evolved using an
initial number of filters "n_init"
and maximum number "n_max"
.
layer_args = {
"layer_0": {
"type": "convolutional",
"activation": "relu", # activation function
"height": 16, # input height
"width": 16, # input width
"channels": 1, # number of input channels
"n_init": 6, # number of convolutional kernel filters
"size": 3, # the size of the convolution window
"stride": 1, # the stride of the convolution window
"pad": 1, # the padding of the convolution window
..., # other parameters same as for connected layers
},
"layer_1": {
"type": "convolutional",
..., # parameters same as above; height, width, channels not needed
},
}
Max-pooling layers require image inputs and produce image outputs. If used as the first layer, the width, height, and number of channels must be specified.
layer_args = {
"layer_0": {
"type": "maxpool",
"height": 16, # input height
"width": 16, # input width
"channels": 1, # number of input channels
"size": 2, # the size of the maxpooling operation
"stride": 2, # the stride of the maxpooling operation
"pad": 0, # the padding of the maxpooling operation
},
"layer_1": {
"type": "maxpool",
"size": 2,
"stride": 2,
"pad": 0,
},
}
Average-pooling layers require image inputs. If used as the first layer, the width, height, and number of channels must be specified. Outputs an average for each input channel.
layer_args = {
"layer_0": {
"type": "avgpool",
"height": 16, # input height
"width": 16, # input width
"channels": 1, # number of input channels
},
"layer_1": {
"type": "avgpool",
},
}
Upsampling layers require image inputs and produce image outputs. If used as the first layer, the width, height, and number of channels must be specified.
layer_args = {
"layer_0": {
"type": "upsample",
"height": 16, # input height
"width": 16, # input width
"channels": 1, # number of input channels
"stride": 2, # the stride of the upsampling operation
},
"layer_1": {
"type": "upsample",
"stride": 2,
},
}
When using layers that include random numbers during forward propagation (e.g., dropout and noise), results are not guaranteed to be reproducible with a given random_state
if parallel CPU processing is enabled.
XCSF provides support for pickle and also provides the following functions for serializing to a binary file.
Example saving the entire current state of XCSF to a binary file:
xcs.save("saved_name.bin")
Example loading the entire state of XCSF from a binary file:
xcs.load("saved_name.bin")
Functions return the total number of elements written or read.
Library Stub:
def save(self, filename: str) -> int: ...
def load(self, filename: str) -> int: ...
Example storing the current XCSF population in memory for later retrieval, overwriting any previously stored population:
xcs.store()
Example retrieving the previously stored XCSF population from memory:
xcs.retrieve()
Library Stub:
def store(self) -> None: ...
def retrieve(self) -> None: ...
Example printing the current XCSF parameters:
print(json.dumps(xcs.internal_params(), indent=4))
Example printing the current XCSF population:
xcs.print_pset()
Library Stub:
def print_pset(self, condition: bool = True, action: bool = True, prediction: bool = True) -> None: ...
The learning performance metrics dictionary is updated after perf_trials
number of trials during fit()
and can be accessed through the following method:
metrics: dict = xcs.get_metrics()
trials: list[int] = metrics["trials"] # cumulative total number of learning trials performed at each update
train: list[float] = metrics["train"] # system error on the training set at each update
val: list[float] = metrics["val"] # system error on the validation set at each update
psize: list[float] = metrics["psize"] # population set size (macro-classifiers) at each update
msize: list[float] = metrics["msize"] # average match set size (macro-classifiers) at each update
mfrac: list[float] = metrics["mfrac"] # fraction of samples matched by the best rule at each update
Library Stub:
# General
def get_metrics(self) -> dict: ... # returns a dictionary of performance metrics
def aset_size(self) -> float: ... # returns the mean action set size
def mfrac(self) -> float: ... # returns the mean fraction of inputs matched by the best rule
def mset_size(self) -> float: ... # returns the mean match set size
def pset_mean_cond_size(self) -> float: ... # returns the mean condition size
def pset_mean_pred_size(self) -> float: ... # returns the mean prediction size
def pset_num(self) -> int: ... # returns the mean population numerosity
def pset_size(self) -> int: ... # returns the mean population size
def time(self) -> int: ... # returns the current EA time
# Neural network specific - population set averages
# "layer" argument is an integer specifying the location of a layer: first layer=0
def pset_mean_cond_connections(self, layer: int) -> float: ... # returns the number of active connections for a condition layer
def pset_mean_cond_layers(self) -> float: ... # returns the mean number of layers in the condition networks
def pset_mean_cond_neurons(self, layer: int) -> float: ... # returns the mean number of neurons for a condition layer
def pset_mean_pred_connections(self, layer: int) -> float: ... # returns the number of active connections for a prediction layer
def pset_mean_pred_eta(self, layer: int) -> float: ... # returns the mean eta for a prediction layer
def pset_mean_pred_layers(self) -> float: ... # returns the mean number of layers in the prediction networks
def pset_mean_pred_neurons(self, layer: int) -> float: ... # returns the mean number of neurons for a prediction layer
import json
json_string = xcs.json() # returns the current population as JSON
parsed = json.loads(json_string) # convert to dict
Then to print the current population:
print(json.dumps(parsed, indent=4))
Example printing ternary conditions, integer actions, and fitnesses:
fitness = [cl["fitness"] for cl in parsed["classifiers"]]
ternary = [cl["condition"]["string"] for cl in parsed["classifiers"]]
actions = [cl["action"]["action"] for cl in parsed["classifiers"]]
for i in range(len(fitness)):
print("%s %d %.5f" % (ternary[i], actions[i], fitness[i]))
Printing and returning the individual weights from neural networks is disabled by default.
To enable, change the flags in the neural_json_export()
functions in cond_neural.c
, pred_neural.c
, etc.
Library Stub:
def json(self, condition: bool = True, action: bool = True, prediction: bool = True) -> str: ...
xcs.get_params() # returns external parameters (used for sklearn)
xcs.internal_params() # returns the internal parameters actually used by XCSF
Example getting and printing the current parameters:
print(json.dumps(xcs.internal_params(), indent=4))
Library Stub:
def get_params(self) -> dict: ...
def internal_params(self) -> dict: ...
Classifiers can be inserted into the population in a number of ways.
Firstly, the initial population can be set with a JSON file containing an
existing population set. This can be specified in the constructor with the
argument population_file
.
The json_insert_cl()
function can be used to insert a single new classifier
into the population. The new classifier is initialised with a random
condition, action, prediction, and then any supplied properties overwrite these
values. This means that all properties are optional. If the population set
numerosity exceeds pop_size
after inserting the rule, the standard roulette
wheel deletion mechanism will be invoked to maintain the population limit.
GP trees and neural networks are not yet implemented.
See notebook example.
Note: when manually adding classifiers, be careful that the keys are correct because if an exact match is not found it will be ignored silently.
Multiple classifiers can be added through the same mechanism as a single JSON
string with json_insert()
.
Additionally, the entire population set can be written in JSON format to a plain text file:
xcs.json_write("pset.json")
And read into the population with:
xcs.json_read("pset.json")
Notes:
Make sure to set warm_start=True
in fit()
to continue using the population.
This is not the recommended way to backup the system to persistent storage since temporary memory buffers (e.g., update matrices) and parameters are not saved and reloaded. For this purpose, see Saving and Loading XCSF.
Library Stub:
def json_insert(self, clset_json: str) -> None: ...
def json_insert_cl(self, cl_json: str) -> None: ...
def json_read(self, filename: str) -> None: ...
def json_write(self, filename: str) -> None: ...
The TreeViz
class from viz.py
will generate a tree with
graphviz. The first argument must be the
tree array; and the second, the filename to save the output as a pdf.
Optionally accepts a list of strings representing the feature_names
.
Optionally accepts a string note
, which will add a note/caption at the
bottom.
Example plotting the first classifier condition:
import json
from xcsf.utils.viz import TreeViz
parsed = json.loads(xcs.json())
trees = [cl["condition"]["tree"]["array"] for cl in parsed["classifiers"]]
TreeViz(trees[0], "test")
Note this will require the graphviz package installed with:
$ pip install graphviz
TreeViz Stub:
def __init__(self,
tree: list[str],
filename: str,
note: str | None = None,
feature_names: list[str] | None = None,
) -> None: ...
The DGPViz
class from viz.py
will generate a graph with
graphviz. The first argument must be the
graph; and the second, the filename to save the output as a pdf. Optionally
accepts a list of strings representing the feature_names
. Optionally accepts
a string note
, which will add a note/caption at the bottom.
Example plotting the first classifier condition and passing the error as a note:
import json
from xcsf.utils.viz import DGPViz
parsed = json.loads(xcs.json())
errors = [cl["error"] for cl in parsed["classifiers"]]
graphs = [cl["condition"]["graph"] for cl in parsed["classifiers"]]
note = "Error = %.5f" % errors[0]
DGPViz(graphs[0], "test", note=note)
DGPViz Stub:
def __init__(self,
graph: dict,
filename: str,
note: str | None = None,
feature_names: list[str] | None = None,
) -> None: ...
Initialise XCSF with y_dim = 1
for predictions to estimate the scalar reward.
The number of actions n_actions
must be greater than 1 otherwise the system
will be configured for supervised learning.
import xcsf
xcs = xcsf.XCS(x_dim=X_DIM, y_dim=1, n_actions=N_ACTIONS)
The standard method involves the basic loop as shown below. state
must be a
1-D numpy array representing the feature values of a single instance; reward
must be a scalar value representing the current environmental reward for having
performed the action; and done
must be a boolean value representing whether
the environment is currently in a terminal state.
state = env.reset()
xcs.init_trial()
for cnt in range(xcs.TELETRANSPORTATION):
xcs.init_step()
action = xcs.decision(state, explore) # explore specifies whether to explore/exploit
next_state, reward, done = env.step(action)
xcs.update(reward, done) # update the current action set and/or previous action set
err += xcs.error(reward, done, env.max_payoff()) # system prediction error
xcs.end_step()
if done:
break
state = next_state
cnt += 1
xcs.end_trial()
See notebook example.
Library Stub:
def init_step(self) -> None: ...
def init_trial(self) -> None: ...
def end_step(self) -> None: ...
def end_trial(self) -> None: ...
def error(self) -> float: ...
def update(self, reward: float, done: bool) -> None: ...
def decision(
self,
state: np.ndarray[Any, np.dtype[np.float64]], # shape = (x_dim, )
explore: bool,
) -> int: ...
The fit()
function may be used as below to execute one single-step learning
trial, i.e., creation of the match and action sets, updating the action set and
running the EA as appropriate. The vector state
must be a 1-D numpy array
representing the feature values of a single instance; action
must be an
integer representing the selected action (and therefore the action set to
update); and reward
must be a scalar value representing the current
environmental reward for having performed the action.
xcs.fit(state, action, reward)
The entire prediction array for a given state can be returned using the
supervised predict()
function, which must receive a 2-D numpy array. For
example:
prediction_array = xcs.predict(state.reshape(1,-1))[0]
See notebook example.
Library Stub:
@typing.overload
def fit(
self,
state: np.ndarray[Any, np.dtype[np.float64]], # shape = (x_dim, )
action: int,
reward: float,
) -> float: ...
def predict(
self,
X_predict: np.ndarray[Any, np.dtype[np.float64]], # shape = (n_samples, x_dim)
) -> np.ndarray[Any, np.dtype[np.float64]]: ... # shape = (n_samples, y_dim)
The supervised fit()
and predict()
functions can be used for reinforcement
learning without action sets, i.e., [A] = [M].
See notebook example using experience replay.
Related Literature:
- A. Stein, R. Maier, L. Rosenbauer, and J. Hähner (2020) XCS classifier system with experience replay
- L. Rosenbauer, A. Stein, D. Pätzel, and J. Hähner (2020) XCSF with experience replay for automatic test case prioritization
Initialise XCSF with n_actions=1
, set x_dim
and y_dim
as needed. Set
conditions and
predictions as desired. In this configuration
[A] = [M]. That is, no action sets are actually constructed: the action component
of classifiers is not used. A match set [M] is constructed and the prediction
array of length y_dim
is calculated using the usual fitness weighted method.
Updates are performed in [M].
Example:
import xcsf
xcs = xcsf.XCS(x_dim=10, y_dim=2, n_actions=1)
The fit()
function may be used as below to execute max_trials
number of learning iterations (i.e., match set creations) using a supplied training set. Note that while the training data is supplied as a batch, learning proceeds in the usual online way: one sample at a time. To execute a single trial simply pass a batch size of one by reshaping the data and set max_trials=1
and warm_start=True
.
If shuffle=True
, for each iteration a single random training example is selected (with replacement) and a single match set is created and updated. This is known as online learning or stochastic training. It's similar to doing SGD with a batch size of 1, where training examples are sampled randomly with replacement. These samples are drawn max_trials
number of times in total, but for displaying metrics, the system performance is averaged over the most recent perf_trials
number of iterations.
Here we refer to an "epoch" as perf_trials
number of iterations. The system performance will thus be displayed after each epoch and any callbacks used will run then. The total number of possible epochs will be equal to ceil(max_trials/perf_trials)
as defined in the constructor.
The input arrays X_train
and y_train
must be 2-D numpy arrays of the shape (n_samples, x_dim) and (n_samples, y_dim).
Example:
xcs.fit(
X_train, # numpy array of shape (n_samples, x_dim).
y_train, # numpy array of shape (n_samples, y_dim).
shuffle=True, # whether to randomly draw samples
warm_start=False, # whether to continue using the existing population
verbose=True, # whether to display performance
validation_data=(X_val, y_val), # optional validation data
callbacks=[callback], # optional callbacks
)
Library Stub:
@typing.overload
def fit(
self,
X_train: np.ndarray[Any, np.dtype[np.float64]], # shape = (n_samples, x_dim)
y_train: np.ndarray[Any, np.dtype[np.float64]], # shape = (n_samples, y_dim)
shuffle: bool = True,
warm_start: bool = False,
verbose: bool = True,
validation_data: tuple[np.ndarray, np.ndarray] | None = None,
callbacks: list | None = None,
) -> xcsf.XCS: ...
Several callbacks can be provided to fit()
to perform tasks based on the learning performance at runtime.
Callbacks are executed after each "epoch", where an epoch is equal to perf_trials
number of learning trials.
Stop training when a monitored metric has stopped improving. Either training or validation error can be chosen.
If the training error is selected, the match set fitness weighted prediction for each of the most recent perf_trials
samples is compared with the true value and the mean error as defined by loss_func
is used.
If the validation error is selected, the mean system error is computed over the entire validation data provided
(the same as if score()
had been called with the validation set).
The patience
and start_from
parameters are defined in number of trials but since the callback is only checked
after perf_trials
, their values should be a multiple of perf_trials
.
Example:
callback = xcsf.EarlyStoppingCallback(
# note: perf_trials is considered an "epoch" for callbacks
monitor="val", # which metric to monitor: {"train", "val"}
patience=20000, # trials with no improvement after which training will be stopped
restore_best=True, # whether to restore the best population after terminating
min_delta=0, # minimum change to qualify as an improvement
start_from=0, # trials to wait before starting to monitor improvement
verbose=True, # whether to display when an action is taken
)
xcs.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[callback])
Callback to save XCSF at some frequency.
Example:
callback = xcsf.CheckpointCallback(
# note: perf_trials is considered an "epoch" for callbacks
monitor="val", # which metric to monitor: {"train", "val"}
filename="xcsf.bin", # name of the file to save XCSF
save_best_only=False, # Whether to only save the best population.
save_freq=0, # Trial frequency to (possibly) make checkpoints.
verbose=True, # whether to display when an action is taken
)
xcs.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[callback])
The score()
function may be used as below to calculate the prediction error
over a single pass of a supplied data set without updates or the EA being
invoked (e.g., for scoring a validation set). An argument N
may be
supplied that specifies the maximum number of iterations performed; if this
value is less than the number of instances supplied, samples will be drawn
randomly. Returns a scalar representing the error. 2-D numpy arrays are expected
as inputs.
Note that if the match set is empty for a given sample, the prediction will be
set to zeros. An optional argument cover
can be used to specify the values to
use as system output instead of returning zeros. cover
must be an array of
length y_dim
.
val_error = xcs.score(X_val, y_val)
val_error = xcs.score(X_val, y_val, N=1000, cover=[0.1])
Library Stub:
def score(
self,
X_val: np.ndarray[Any, np.dtype[np.float64]], # shape = (n_samples, x_dim)
y_val: np.ndarray[Any, np.dtype[np.float64]], # shape = (n_samples, y_dim)
N: int = 0, # max number of samples to use
cover: Optional[np.ndarray[Any, np.dtype[np.float64]]], # shape = (1, y_dim)
) -> float: ...
The predict()
function may be used as below to calculate the XCSF predictions
for a supplied data set. That is, the fitness weighted match set prediction for each sample.
No updates or EA invocations are performed. The input vector must be a 2-D numpy array
of the shape (n_samples, x_dim). Returns a 2-D numpy array of shape (n_samples, y_dim).
Note that similar to score()
, if the match set is empty for a given sample,
the prediction will be set to zeros. An optional argument cover
can be used
to specify the values to use as system output instead of returning zeros.
cover
must be an array of length y_dim
.
predictions = xcs.predict(X_test)
predictions = xcs.predict(X_test, cover=[0.1])
Library Stub:
def predict(
self,
X_test: np.ndarray[Any, np.dtype[np.float64]], # shape = (n_samples, x_dim)
cover: Optional[np.ndarray[Any, np.dtype[np.float64]]], # shape = (1, y_dim)
) -> np.ndarray[Any, np.dtype[np.float64]]: ... # shape = (n_samples, y_dim)
Currently 3 self-adaptive mutation methods are implemented and their use is
defined within the various implementations of conditions, actions, and
predictions. The smallest allowable mutation rate MU_EPSILON = 0.0005
.
- Uniform adaptation: selects rates from a uniform random distribution. Initially the rate is drawn at random ~U[MU_EPSILON,1]. Offspring inherit the parent's rate, but with 10% probability the rate is randomly redrawn.
- Log normal adaptation: selects rates using a log normal method (similar to
evolution strategies).
Initially the rate is selected at random from a uniform distribution
~U[MU_EPSILON,1]. Offspring inherit the parent's rate, before
applying log normal adaptation:
$\mu \leftarrow \mu e^{\mathcal{N}(0,1)}$ . - Rate selection adaptation: selects rates from the following set of 10 values:
{0.0005, 0.001, 0.002, 0.003, 0.005, 0.01, 0.015, 0.02, 0.05, 0.1}
. Initially the rate is selected at random. Offspring inherit the parent's rate, but with 10% probability the rate is randomly reselected.
Related Literature:
- L. Bull and J. Hurst (2003) A neural learning classifier system with self-adaptive constructivism
- G. D. Howard, L. Bull, and P.-L. Lanzi (2008) Self-adaptive constructivism in neural XCS and XCSF
- M. V. Butz, P. O. Stalph, and P.-L. Lanzi (2008) Self-adaptive mutation in XCSF
- M. Serpell and J. E. Smith (2010) Self-adaptation of mutation operator and probability for permutation representations in genetic algorithms
This project is released under the terms of the GNU General Public License v3.0 (GPLv3).