Code integration
To integrate DeepCASE into your own project, you can use it as a standalone module. DeepCASE offers rich functionality that is easy to integrate into other projects. Here we show some simple examples on how to use the DeepCASE package in your own python code. For a complete documentation we refer to the Reference guide.
Note
The code used in this section is also available in the GitHub repository under examples/example.py
.
Import
To import components from DeepCASE simply use the following format
from deepcase(.<module>) import <Object>
For example, the following code imports the Preprocessor, ContextBuilder, and Interpreter.
from deepcase.preprocessing import Preprocessor
from deepcase.context_builder import ContextBuilder
from deepcase.interpreter import Interpreter
Loading data
DeepCASE can load sequences from .csv
and specifically formatted .txt
files (see Preprocessor class).
# Create preprocessor
preprocessor = Preprocessor(
length = 10, # 10 events in context
timeout = 86400, # Ignore events older than 1 day (60*60*24 = 86400 seconds)
)
# Load data from file
context, events, labels, mapping = preprocessor.csv('data/example.csv')
In case no labels were explicitly provided as an argument, and no labels could be extracted from the file, we may set labels for each sequence manually.
Note that we assign the labels as a numpy array, which requires importing numpy using import numpy as np
.
# In case no labels are provided, set labels to -1
if label is None:
labels = np.full(events.shape[0], -1, dtype=int)
By default, the Tensors returned by the Preprocessor are set to the cpu
device.
If you have a system that supports cuda
Tensors you can cast the Tensors to cuda using the following code.
Note that the check in this code requires you to import PyTorch using import torch
.
# Cast to cuda if available
if torch.cuda.is_available():
events = events .to('cuda')
context = context.to('cuda')
Splitting data
Once we have loaded the data, we will split it into train and test data. This step is not necessarily required, depending on the setup you use, but we will use the training and test data in the remainder of this example.
# Split into train and test sets (20:80) by time - assuming events are ordered chronologically
events_train = events [:events.shape[0]//5 ]
events_test = events [ events.shape[0]//5:]
context_train = context[:events.shape[0]//5 ]
context_test = context[ events.shape[0]//5:]
label_train = label [:events.shape[0]//5 ]
label_test = label [ events.shape[0]//5:]
ContextBuilder
First we create an instance of DeepCASE’s ContextBuilder using the following code:
# Create ContextBuilder
context_builder = ContextBuilder(
input_size = 100, # Number of input features to expect
output_size = 100, # Same as input size
hidden_size = 128, # Number of nodes in hidden layer, in paper we set this to 128
max_length = 10, # Length of the context, should be same as context in Preprocessor
)
# Cast to cuda if available
if torch.cuda.is_available():
context_builder = context_builder.to('cuda')
Once the context_builder
is created, we train it using the fit()
method.
# Train the ContextBuilder
context_builder.fit(
X = context_train, # Context to train with
y = events_train.reshape(-1, 1), # Events to train with, note that these should be of shape=(n_events, 1)
epochs = 10, # Number of epochs to train with
batch_size = 128, # Number of samples in each training batch, in paper this was 128
learning_rate = 0.01, # Learning rate to train with, in paper this was 0.01
verbose = True, # If True, prints progress
)
I/O methods
We can load and save the ContextBuilder to and from a file using the following code:
# Save ContextBuilder to file
context_builder.save('path/to/file.save')
# Load ContextBuilder from file
context_builder = ContextBuilder.load('path/to/file.save')
Interpreter
Once we fitted the context_builder
, we create in Interpreter instance using the following code:
# Create Interpreter
interpreter = Interpreter(
context_builder = context_builder, # ContextBuilder used to fit data
features = 100, # Number of input features to expect, should be same as ContextBuilder
eps = 0.1, # Epsilon value to use for DBSCAN clustering, in paper this was 0.1
min_samples = 5, # Minimum number of samples to use for DBSCAN clustering, in paper this was 5
threshold = 0.2, # Confidence threshold used for determining if attention from the ContextBuilder can be used, in paper this was 0.2
)
Once the interpreter
is created, we can use it to cluster samples using the cluster()
method.
# Cluster samples with the interpreter
clusters = interpreter.cluster(
X = context_train, # Context to train with
y = events_train.reshape(-1, 1), # Events to train with, note that these should be of shape=(n_events, 1)
iterations = 100, # Number of iterations to use for attention query, in paper this was 100
batch_size = 1024, # Batch size to use for attention query, used to limit CUDA memory usage
verbose = True, # If True, prints progress
)
I/O methods
We can load and save the Interpreter to and from a file using the following code:
# Save Interpreter to file
interpreter.save('path/to/file.save')
# Load Interpreter from file
interpreter = Interpreter.load(
'path/to/file.save',
context_builder = context_builder, # When loading the Interpreter, make sure it is linked to the same ContextBuilder used for training.
)
Manual Mode
When we have used the Interpreter to cluster samples, we can assign a score to the individual clusters.
Assigning a score is done through the score()
method, however, this method has two requirements for assigning a score:
that all sequences used to create clusters are assigned a score.
that all sequences in the same cluster are assigned the same score.
Therefore, to make sure these two conditions hold, we first call the score_clusters()
method and use the result for the score()
method.
# Compute scores for each cluster based on individual labels per sequence
scores = interpreter.score_clusters(
scores = labels_train, # Labels used to compute score (either as loaded by Preprocessor, or put your own labels here)
strategy = "max", # Strategy to use for scoring (one of "max", "min", "avg")
NO_SCORE = -1, # Any sequence with this score will be ignored in the strategy.
# If assigned a cluster, the sequence will inherit the cluster score.
# If the sequence is not present in a cluster, it will receive a score of NO_SCORE.
)
# Assign scores to clusters in interpreter
# Note that all sequences should be given a score and each sequence in the
# same cluster should have the same score.
interpreter.score(
scores = scores, # Scores to assign to sequences
verbose = True, # If True, prints progress
)
Semi-automatic Mode
Once we used the Interpreter for clustering and assigned a score to each cluster, we can use the :py:meth`predict()` method to predict labels of new sequences. When no cluster could be matched, the :py:meth`predict()` method gives one of three scores for a cluster:
-1
, if the ContextBuilder is not confident enough for a prediction.
-2
, if theevent
was not in the training dataset.
-3
, if the nearest cluster is a larger distance thanepsilon
away from the nearest sequence.
# Compute predicted scores
prediction = interpreter.predict(
X = context_test, # Context to predict
y = events_test.reshape(-1, 1), # Events to predict, note that these should be of shape=(n_events, 1)
iterations = 100, # Number of iterations to use for attention query, in paper this was 100
batch_size = 1024, # Batch size to use for attention query, used to limit CUDA memory usage
verbose = True, # If True, prints progress
)