Title: | Artificial Intelligence for Education |
---|---|
Description: | In social and educational settings, the use of Artificial Intelligence (AI) is a challenging task. Relevant data is often only available in handwritten forms, or the use of data is restricted by privacy policies. This often leads to small data sets. Furthermore, in the educational and social sciences, data is often unbalanced in terms of frequencies. To support educators as well as educational and social researchers in using the potentials of AI for their work, this package provides a unified interface for neural nets in 'PyTorch' to deal with natural language problems. In addition, the package ships with a shiny app, providing a graphical user interface. This allows the usage of AI for people without skills in writing python/R scripts. The tools integrate existing mathematical and statistical methods for dealing with small data sets via pseudo-labeling (e.g. Cascante-Bonilla et al. (2020) <doi:10.48550/arXiv.2001.06001>) and imbalanced data via the creation of synthetic cases (e.g. Bunkhumpornpat et al. (2012) <doi:10.1007/s10489-011-0287-y>). Performance evaluation of AI is connected to measures from content analysis which educational and social researchers are generally more familiar with (e.g. Berding & Pargmann (2022) <doi:10.30819/5581>, Gwet (2014) <ISBN:978-0-9708062-8-4>, Krippendorff (2019) <doi:10.4135/9781071878781>). Estimation of energy consumption and CO2 emissions during model training is done with the 'python' library 'codecarbon'. Finally, all objects created with this package allow to share trained AI models with other people. |
Authors: | Berding Florian [aut, cre] |
Maintainer: | Berding Florian <[email protected]> |
License: | GPL-3 |
Version: | 1.0.2 |
Built: | 2025-02-05 12:45:47 UTC |
Source: | https://github.com/fberding/aifeducation |
R6
class for creation and definition of .AIFE*Transformer-like
classesThis base class is used to create and define .AIFE*Transformer-like
classes. It serves as a skeleton
for a future concrete transformer and cannot be used to create an object of itself (an attempt to call new
-method
will produce an error).
See p.1 Base Transformer Class in Transformers for Developers for details.
The create
-method is a basic algorithm that is used to create a new transformer, but cannot be
called directly.
The train
-method is a basic algorithm that is used to train and tune the transformer but cannot be
called directly.
There are already implemented concrete (child) transformers (e.g.
BERT
, DeBERTa-V2
, etc.), to implement a new one see p.4 Implement A Custom Transformer in
Transformers for Developers
params
A list containing transformer's parameters ('static', 'dynamic' and 'dependent' parameters)
list()
containing all the transformer parameters. Can be set with set_model_param()
.
Regardless of the transformer, the following parameters are always included:
ml_framework
text_dataset
sustain_track
sustain_iso_code
sustain_region
sustain_interval
trace
pytorch_safetensors
log_dir
log_write_interval
In the case of create it also contains (see create
-method for details):
model_dir
vocab_size
max_position_embeddings
hidden_size
hidden_act
hidden_dropout_prob
attention_probs_dropout_prob
intermediate_size
num_attention_heads
In the case of train it also contains (see train
-method for details):
output_dir
model_dir_path
p_mask
whole_word
val_size
n_epoch
batch_size
chunk_size
min_seq_len
full_sequences_only
learning_rate
n_workers
multi_process
keras_trace
pytorch_trace
Depending on the transformer and the method used class may contain different parameters:
vocab_do_lower_case
num_hidden_layer
add_prefix_space
etc.
temp
A list containing temporary transformer's parameters
list()
containing all the temporary local variables that need to be accessed between the step functions. Can
be set with set_model_temp()
.
For example, it can be a variable tok_new
that stores the tokenizer from
steps_for_creation$create_tokenizer_draft
. To train the tokenizer, access the variable tok_new
in
steps_for_creation$calculate_vocab
through the temp
list of this class.
new()
An object of this class cannot be created. Thus, method's call will produce an error.
.AIFEBaseTransformer$new()
This method returns an error.
set_title()
Setter for the title. Sets a new value for the title
private attribute.
.AIFEBaseTransformer$set_title(title)
title
string
A new title.
This method returns nothing.
set_model_param()
Setter for the parameters. Adds a new parameter and its value to the params
list.
.AIFEBaseTransformer$set_model_param(param_name, param_value)
param_name
string
Parameter's name.
param_value
any
Parameter's value.
This method returns nothing.
set_model_temp()
Setter for the temporary model's parameters. Adds a new temporary parameter and its value to the
temp
list.
.AIFEBaseTransformer$set_model_temp(temp_name, temp_value)
temp_name
string
Parameter's name.
temp_value
any
Parameter's value.
This method returns nothing.
set_SFC_check_max_pos_emb()
Setter for the check_max_pos_emb
element of the private steps_for_creation
list. Sets a new
fun
function as the check_max_pos_emb
step.
.AIFEBaseTransformer$set_SFC_check_max_pos_emb(fun)
fun
function()
A new function.
This method returns nothing.
set_SFC_create_tokenizer_draft()
Setter for the create_tokenizer_draft
element of the private steps_for_creation
list. Sets a
new fun
function as the create_tokenizer_draft
step.
.AIFEBaseTransformer$set_SFC_create_tokenizer_draft(fun)
fun
function()
A new function.
This method returns nothing.
set_SFC_calculate_vocab()
Setter for the calculate_vocab
element of the private steps_for_creation
list. Sets a new fun
function as the calculate_vocab
step.
.AIFEBaseTransformer$set_SFC_calculate_vocab(fun)
fun
function()
A new function.
This method returns nothing.
set_SFC_save_tokenizer_draft()
Setter for the save_tokenizer_draft
element of the private steps_for_creation
list. Sets a new
fun
function as the save_tokenizer_draft
step.
.AIFEBaseTransformer$set_SFC_save_tokenizer_draft(fun)
fun
function()
A new function.
This method returns nothing.
set_SFC_create_final_tokenizer()
Setter for the create_final_tokenizer
element of the private steps_for_creation
list. Sets a new
fun
function as the create_final_tokenizer
step.
.AIFEBaseTransformer$set_SFC_create_final_tokenizer(fun)
fun
function()
A new function.
This method returns nothing.
set_SFC_create_transformer_model()
Setter for the create_transformer_model
element of the private steps_for_creation
list. Sets a
new fun
function as the create_transformer_model
step.
.AIFEBaseTransformer$set_SFC_create_transformer_model(fun)
fun
function()
A new function.
This method returns nothing.
set_required_SFC()
Setter for all required elements of the private steps_for_creation
list. Executes setters for all
required creation steps.
.AIFEBaseTransformer$set_required_SFC(required_SFC)
required_SFC
list()
A list of all new required steps.
This method returns nothing.
set_SFT_load_existing_model()
Setter for the load_existing_model
element of the private steps_for_training
list. Sets a new
fun
function as the load_existing_model
step.
.AIFEBaseTransformer$set_SFT_load_existing_model(fun)
fun
function()
A new function.
This method returns nothing.
set_SFT_cuda_empty_cache()
Setter for the cuda_empty_cache
element of the private steps_for_training
list. Sets a new
fun
function as the cuda_empty_cache
step.
.AIFEBaseTransformer$set_SFT_cuda_empty_cache(fun)
fun
function()
A new function.
This method returns nothing.
set_SFT_create_data_collator()
Setter for the create_data_collator
element of the private steps_for_training
list. Sets a new
fun
function as the create_data_collator
step. Use this method to make a custom data collator for a
transformer.
.AIFEBaseTransformer$set_SFT_create_data_collator(fun)
fun
function()
A new function.
This method returns nothing.
create()
This method creates a transformer configuration based on the child-transformer architecture and a
vocabulary using the python libraries transformers
and tokenizers
.
This method adds the following parameters to the temp
list:
log_file
raw_text_dataset
pt_safe_save
value_top
total_top
message_top
This method uses the following parameters from the temp
list:
log_file
raw_text_dataset
tokenizer
.AIFEBaseTransformer$create( ml_framework, model_dir, text_dataset, vocab_size, max_position_embeddings, hidden_size, num_attention_heads, intermediate_size, hidden_act, hidden_dropout_prob, attention_probs_dropout_prob, sustain_track, sustain_iso_code, sustain_region, sustain_interval, trace, pytorch_safetensors, log_dir, log_write_interval )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
model_dir
string
Path to the directory where the model should be saved.
text_dataset
Object of class LargeDataSetForText.
vocab_size
int
Size of the vocabulary.
max_position_embeddings
int
Number of maximum position embeddings. This parameter also determines the maximum length of a sequence which
can be processed with the model.
hidden_size
int
Number of neurons in each layer. This parameter determines the dimensionality of the resulting text
embedding.
num_attention_heads
int
Number of attention heads.
intermediate_size
int
Number of neurons in the intermediate layer of the attention mechanism.
hidden_act
string
Name of the activation function.
hidden_dropout_prob
double
Ratio of dropout.
attention_probs_dropout_prob
double
Ratio of dropout for attention probabilities.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead, it saves the configuration and vocabulary of the new model to disk.
train()
This method can be used to train or fine-tune a transformer based on BERT
architecture with the
help of the python libraries transformers
, datasets
, and tokenizers
.
This method adds the following parameters to the temp
list:
log_file
loss_file
from_pt
from_tf
load_safe
raw_text_dataset
pt_safe_save
value_top
total_top
message_top
This method uses the following parameters from the temp
list:
log_file
raw_text_dataset
tokenized_dataset
tokenizer
.AIFEBaseTransformer$train( ml_framework, output_dir, model_dir_path, text_dataset, p_mask, whole_word, val_size, n_epoch, batch_size, chunk_size, full_sequences_only, min_seq_len, learning_rate, n_workers, multi_process, sustain_track, sustain_iso_code, sustain_region, sustain_interval, trace, keras_trace, pytorch_trace, pytorch_safetensors, log_dir, log_write_interval )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
output_dir
string
Path to the directory where the final model should be saved. If the directory does not exist, it will be
created.
model_dir_path
string
Path to the directory where the original model is stored.
text_dataset
Object of class LargeDataSetForText.
p_mask
double
Ratio that determines the number of words/tokens used for masking.
whole_word
bool
TRUE
: whole word masking should be applied.
FALSE
: token masking is used.
val_size
double
Ratio that determines the amount of token chunks used for validation.
n_epoch
int
Number of epochs for training.
batch_size
int
Size of batches.
chunk_size
int
Size of every chunk for training.
full_sequences_only
bool
TRUE
for using only chunks with a sequence length equal to chunk_size
.
min_seq_len
int
Only relevant if full_sequences_only = FALSE
. Value determines the minimal sequence length included in
training process.
learning_rate
double
Learning rate for adam optimizer.
n_workers
int
Number of workers. Only relevant if ml_framework = "tensorflow"
.
multi_process
bool
TRUE
if multiple processes should be activated. Only relevant if ml_framework = "tensorflow"
.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
keras_trace
int
keras_trace = 0
: does not print any information about the training process from keras on the console.
keras_trace = 1
: prints a progress bar.
keras_trace = 2
: prints one line of information for every epoch. Only relevant if ml_framework = "tensorflow"
.
pytorch_trace
int
pytorch_trace = 0
: does not print any information about the training process from pytorch on the console.
pytorch_trace = 1
: prints a progress bar.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead, it saves the configuration and vocabulary of the new model to disk.
clone()
The objects of this class are cloneable with this method.
.AIFEBaseTransformer$clone(deep = FALSE)
deep
Whether to make a deep clone.
Hugging Face transformers documantation:
Other Transformers for developers:
.AIFEBertTransformer
,
.AIFEDebertaTransformer
,
.AIFEFunnelTransformer
,
.AIFELongformerTransformer
,
.AIFEMpnetTransformer
,
.AIFERobertaTransformer
,
.AIFETrObj
R6
class for creation and training of BERT
transformersThis class has the following methods:
create
: creates a new transformer based on BERT
.
train
: trains and fine-tunes a BERT
model.
New models can be created using the .AIFEBertTransformer$create
method.
To train the model, pass the directory of the model to the method .AIFEBertTransformer$train
.
Pre-Trained models that can be fine-tuned using this method are available at https://huggingface.co/.
The model is trained using dynamic masking, as opposed to the original paper, which used static masking.
aifeducation::.AIFEBaseTransformer
-> .AIFEBertTransformer
aifeducation::.AIFEBaseTransformer$set_SFC_calculate_vocab()
aifeducation::.AIFEBaseTransformer$set_SFC_check_max_pos_emb()
aifeducation::.AIFEBaseTransformer$set_SFC_create_final_tokenizer()
aifeducation::.AIFEBaseTransformer$set_SFC_create_tokenizer_draft()
aifeducation::.AIFEBaseTransformer$set_SFC_create_transformer_model()
aifeducation::.AIFEBaseTransformer$set_SFC_save_tokenizer_draft()
aifeducation::.AIFEBaseTransformer$set_SFT_create_data_collator()
aifeducation::.AIFEBaseTransformer$set_SFT_cuda_empty_cache()
aifeducation::.AIFEBaseTransformer$set_SFT_load_existing_model()
aifeducation::.AIFEBaseTransformer$set_model_param()
aifeducation::.AIFEBaseTransformer$set_model_temp()
aifeducation::.AIFEBaseTransformer$set_required_SFC()
aifeducation::.AIFEBaseTransformer$set_title()
new()
Creates a new transformer based on BERT
and sets the title.
.AIFEBertTransformer$new()
This method returns nothing.
create()
This method creates a transformer configuration based on the BERT
base architecture and a
vocabulary based on WordPiece
by using the python libraries transformers
and tokenizers
.
This method adds the following 'dependent' parameters to the base class's inherited params
list:
vocab_do_lower_case
num_hidden_layer
.AIFEBertTransformer$create( ml_framework = "pytorch", model_dir, text_dataset, vocab_size = 30522, vocab_do_lower_case = FALSE, max_position_embeddings = 512, hidden_size = 768, num_hidden_layer = 12, num_attention_heads = 12, intermediate_size = 3072, hidden_act = "gelu", hidden_dropout_prob = 0.1, attention_probs_dropout_prob = 0.1, sustain_track = FALSE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, trace = TRUE, pytorch_safetensors = TRUE, log_dir = NULL, log_write_interval = 2 )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
model_dir
string
Path to the directory where the model should be saved.
text_dataset
Object of class LargeDataSetForText.
vocab_size
int
Size of the vocabulary.
vocab_do_lower_case
bool
TRUE
if all words/tokens should be lower case.
max_position_embeddings
int
Number of maximum position embeddings. This parameter also determines the maximum length of a sequence which
can be processed with the model.
hidden_size
int
Number of neurons in each layer. This parameter determines the dimensionality of the resulting text
embedding.
num_hidden_layer
int
Number of hidden layers.
num_attention_heads
int
Number of attention heads.
intermediate_size
int
Number of neurons in the intermediate layer of the attention mechanism.
hidden_act
string
Name of the activation function.
hidden_dropout_prob
double
Ratio of dropout.
attention_probs_dropout_prob
double
Ratio of dropout for attention probabilities.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead, it saves the configuration and vocabulary of the new model to disk.
train()
This method can be used to train or fine-tune a transformer based on BERT
architecture with the
help of the python libraries transformers
, datasets
, and tokenizers
.
.AIFEBertTransformer$train( ml_framework = "pytorch", output_dir, model_dir_path, text_dataset, p_mask = 0.15, whole_word = TRUE, val_size = 0.1, n_epoch = 1, batch_size = 12, chunk_size = 250, full_sequences_only = FALSE, min_seq_len = 50, learning_rate = 0.003, n_workers = 1, multi_process = FALSE, sustain_track = FALSE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, trace = TRUE, keras_trace = 1, pytorch_trace = 1, pytorch_safetensors = TRUE, log_dir = NULL, log_write_interval = 2 )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
output_dir
string
Path to the directory where the final model should be saved. If the directory does not exist, it will be
created.
model_dir_path
string
Path to the directory where the original model is stored.
text_dataset
Object of class LargeDataSetForText.
p_mask
double
Ratio that determines the number of words/tokens used for masking.
whole_word
bool
TRUE
: whole word masking should be applied.
FALSE
: token masking is used.
val_size
double
Ratio that determines the amount of token chunks used for validation.
n_epoch
int
Number of epochs for training.
batch_size
int
Size of batches.
chunk_size
int
Size of every chunk for training.
full_sequences_only
bool
TRUE
for using only chunks with a sequence length equal to chunk_size
.
min_seq_len
int
Only relevant if full_sequences_only = FALSE
. Value determines the minimal sequence length included in
training process.
learning_rate
double
Learning rate for adam optimizer.
n_workers
int
Number of workers. Only relevant if ml_framework = "tensorflow"
.
multi_process
bool
TRUE
if multiple processes should be activated. Only relevant if ml_framework = "tensorflow"
.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
keras_trace
int
keras_trace = 0
: does not print any information about the training process from keras on the console.
keras_trace = 1
: prints a progress bar.
keras_trace = 2
: prints one line of information for every epoch. Only relevant if ml_framework = "tensorflow"
.
pytorch_trace
int
pytorch_trace = 0
: does not print any information about the training process from pytorch on the console.
pytorch_trace = 1
: prints a progress bar.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead the trained or fine-tuned model is saved to disk.
clone()
The objects of this class are cloneable with this method.
.AIFEBertTransformer$clone(deep = FALSE)
deep
Whether to make a deep clone.
This model uses a WordPiece
tokenizer like BERT
and can be trained with whole word masking. The transformer
library may display a warning, which can be ignored.
Devlin, J., Chang, M.‑W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North (pp. 4171–4186). Association for Computational Linguistics. doi:10.18653/v1/N19-1423
Hugging Face documentation
https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForMaskedLM
https://huggingface.co/docs/transformers/model_doc/bert#transformers.TFBertForMaskedLM
Other Transformers for developers:
.AIFEBaseTransformer
,
.AIFEDebertaTransformer
,
.AIFEFunnelTransformer
,
.AIFELongformerTransformer
,
.AIFEMpnetTransformer
,
.AIFERobertaTransformer
,
.AIFETrObj
R6
class for creation and training of DeBERTa-V2
transformersThis class has the following methods:
create
: creates a new transformer based on DeBERTa-V2
.
train
: trains and fine-tunes a DeBERTa-V2
model.
New models can be created using the .AIFEDebertaTransformer$create
method.
To train the model, pass the directory of the model to the method .AIFEDebertaTransformer$train
.
Pre-Trained models which can be fine-tuned with this function are available at https://huggingface.co/.
Training of this model makes use of dynamic masking.
aifeducation::.AIFEBaseTransformer
-> .AIFEDebertaTransformer
aifeducation::.AIFEBaseTransformer$set_SFC_calculate_vocab()
aifeducation::.AIFEBaseTransformer$set_SFC_check_max_pos_emb()
aifeducation::.AIFEBaseTransformer$set_SFC_create_final_tokenizer()
aifeducation::.AIFEBaseTransformer$set_SFC_create_tokenizer_draft()
aifeducation::.AIFEBaseTransformer$set_SFC_create_transformer_model()
aifeducation::.AIFEBaseTransformer$set_SFC_save_tokenizer_draft()
aifeducation::.AIFEBaseTransformer$set_SFT_create_data_collator()
aifeducation::.AIFEBaseTransformer$set_SFT_cuda_empty_cache()
aifeducation::.AIFEBaseTransformer$set_SFT_load_existing_model()
aifeducation::.AIFEBaseTransformer$set_model_param()
aifeducation::.AIFEBaseTransformer$set_model_temp()
aifeducation::.AIFEBaseTransformer$set_required_SFC()
aifeducation::.AIFEBaseTransformer$set_title()
new()
Creates a new transformer based on DeBERTa-V2
and sets the title.
.AIFEDebertaTransformer$new()
This method returns nothing.
create()
This method creates a transformer configuration based on the DeBERTa-V2
base architecture and a
vocabulary based on the SentencePiece
tokenizer using the python transformers
and tokenizers
libraries.
This method adds the following 'dependent' parameters to the base class's inherited params
list:
vocab_do_lower_case
num_hidden_layer
.AIFEDebertaTransformer$create( ml_framework = "pytorch", model_dir, text_dataset, vocab_size = 128100, vocab_do_lower_case = FALSE, max_position_embeddings = 512, hidden_size = 1536, num_hidden_layer = 24, num_attention_heads = 24, intermediate_size = 6144, hidden_act = "gelu", hidden_dropout_prob = 0.1, attention_probs_dropout_prob = 0.1, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, trace = TRUE, pytorch_safetensors = TRUE, log_dir = NULL, log_write_interval = 2 )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
model_dir
string
Path to the directory where the model should be saved.
text_dataset
Object of class LargeDataSetForText.
vocab_size
int
Size of the vocabulary.
vocab_do_lower_case
bool
TRUE
if all words/tokens should be lower case.
max_position_embeddings
int
Number of maximum position embeddings. This parameter also determines the maximum length of a sequence which
can be processed with the model.
hidden_size
int
Number of neurons in each layer. This parameter determines the dimensionality of the resulting text
embedding.
num_hidden_layer
int
Number of hidden layers.
num_attention_heads
int
Number of attention heads.
intermediate_size
int
Number of neurons in the intermediate layer of the attention mechanism.
hidden_act
string
Name of the activation function.
hidden_dropout_prob
double
Ratio of dropout.
attention_probs_dropout_prob
double
Ratio of dropout for attention probabilities.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead, it saves the configuration and vocabulary of the new model to disk.
train()
This method can be used to train or fine-tune a transformer based on DeBERTa-V2
architecture with
the help of the python libraries transformers
, datasets
, and tokenizers
.
.AIFEDebertaTransformer$train( ml_framework = "pytorch", output_dir, model_dir_path, text_dataset, p_mask = 0.15, whole_word = TRUE, val_size = 0.1, n_epoch = 1, batch_size = 12, chunk_size = 250, full_sequences_only = FALSE, min_seq_len = 50, learning_rate = 0.03, n_workers = 1, multi_process = FALSE, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, trace = TRUE, keras_trace = 1, pytorch_trace = 1, pytorch_safetensors = TRUE, log_dir = NULL, log_write_interval = 2 )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
output_dir
string
Path to the directory where the final model should be saved. If the directory does not exist, it will be
created.
model_dir_path
string
Path to the directory where the original model is stored.
text_dataset
Object of class LargeDataSetForText.
p_mask
double
Ratio that determines the number of words/tokens used for masking.
whole_word
bool
TRUE
: whole word masking should be applied.
FALSE
: token masking is used.
val_size
double
Ratio that determines the amount of token chunks used for validation.
n_epoch
int
Number of epochs for training.
batch_size
int
Size of batches.
chunk_size
int
Size of every chunk for training.
full_sequences_only
bool
TRUE
for using only chunks with a sequence length equal to chunk_size
.
min_seq_len
int
Only relevant if full_sequences_only = FALSE
. Value determines the minimal sequence length included in
training process.
learning_rate
double
Learning rate for adam optimizer.
n_workers
int
Number of workers. Only relevant if ml_framework = "tensorflow"
.
multi_process
bool
TRUE
if multiple processes should be activated. Only relevant if ml_framework = "tensorflow"
.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
keras_trace
int
keras_trace = 0
: does not print any information about the training process from keras on the console.
keras_trace = 1
: prints a progress bar.
keras_trace = 2
: prints one line of information for every epoch. Only relevant if ml_framework = "tensorflow"
.
pytorch_trace
int
pytorch_trace = 0
: does not print any information about the training process from pytorch on the console.
pytorch_trace = 1
: prints a progress bar.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead the trained or fine-tuned model is saved to disk.
clone()
The objects of this class are cloneable with this method.
.AIFEDebertaTransformer$clone(deep = FALSE)
deep
Whether to make a deep clone.
For this model a WordPiece
tokenizer is created. The standard implementation of DeBERTa
version 2 from
HuggingFace uses a SentencePiece
tokenizer. Thus, please use AutoTokenizer
from the transformers
library to
work with this model.
He, P., Liu, X., Gao, J. & Chen, W. (2020). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. doi:10.48550/arXiv.2006.03654
Hugging Face documentatio
https://huggingface.co/docs/transformers/model_doc/deberta-v2
https://huggingface.co/docs/transformers/model_doc/deberta-v2#transformers.DebertaV2ForMaskedLM
https://huggingface.co/docs/transformers/model_doc/deberta-v2#transformers.TFDebertaV2ForMaskedLM
Other Transformers for developers:
.AIFEBaseTransformer
,
.AIFEBertTransformer
,
.AIFEFunnelTransformer
,
.AIFELongformerTransformer
,
.AIFEMpnetTransformer
,
.AIFERobertaTransformer
,
.AIFETrObj
R6
class for creation and training of Funnel
transformersThis class has the following methods:
create
: creates a new transformer based on Funnel
.
train
: trains and fine-tunes a Funnel
model.
New models can be created using the .AIFEFunnelTransformer$create
method.
Model is created with separete_cls = TRUE
, truncate_seq = TRUE
, and pool_q_only = TRUE
.
To train the model, pass the directory of the model to the method .AIFEFunnelTransformer$train
.
Pre-Trained models which can be fine-tuned with this function are available at https://huggingface.co/.
Training of the model makes use of dynamic masking.
aifeducation::.AIFEBaseTransformer
-> .AIFEFunnelTransformer
aifeducation::.AIFEBaseTransformer$set_SFC_calculate_vocab()
aifeducation::.AIFEBaseTransformer$set_SFC_check_max_pos_emb()
aifeducation::.AIFEBaseTransformer$set_SFC_create_final_tokenizer()
aifeducation::.AIFEBaseTransformer$set_SFC_create_tokenizer_draft()
aifeducation::.AIFEBaseTransformer$set_SFC_create_transformer_model()
aifeducation::.AIFEBaseTransformer$set_SFC_save_tokenizer_draft()
aifeducation::.AIFEBaseTransformer$set_SFT_create_data_collator()
aifeducation::.AIFEBaseTransformer$set_SFT_cuda_empty_cache()
aifeducation::.AIFEBaseTransformer$set_SFT_load_existing_model()
aifeducation::.AIFEBaseTransformer$set_model_param()
aifeducation::.AIFEBaseTransformer$set_model_temp()
aifeducation::.AIFEBaseTransformer$set_required_SFC()
aifeducation::.AIFEBaseTransformer$set_title()
new()
Creates a new transformer based on Funnel
and sets the title.
.AIFEFunnelTransformer$new()
This method returns nothing.
create()
This method creates a transformer configuration based on the Funnel
transformer base architecture
and a vocabulary based on WordPiece
using the python transformers
and tokenizers
libraries.
This method adds the following 'dependent' parameters to the base class's inherited params
list:
vocab_do_lower_case
target_hidden_size
block_sizes
num_decoder_layers
pooling_type
activation_dropout
.AIFEFunnelTransformer$create( ml_framework = "pytorch", model_dir, text_dataset, vocab_size = 30522, vocab_do_lower_case = FALSE, max_position_embeddings = 512, hidden_size = 768, target_hidden_size = 64, block_sizes = c(4, 4, 4), num_attention_heads = 12, intermediate_size = 3072, num_decoder_layers = 2, pooling_type = "mean", hidden_act = "gelu", hidden_dropout_prob = 0.1, attention_probs_dropout_prob = 0.1, activation_dropout = 0, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, trace = TRUE, pytorch_safetensors = TRUE, log_dir = NULL, log_write_interval = 2 )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
model_dir
string
Path to the directory where the model should be saved.
text_dataset
Object of class LargeDataSetForText.
vocab_size
int
Size of the vocabulary.
vocab_do_lower_case
bool
TRUE
if all words/tokens should be lower case.
max_position_embeddings
int
Number of maximum position embeddings. This parameter also determines the maximum length of a sequence which
can be processed with the model.
hidden_size
int
Number of neurons in each layer. This parameter determines the dimensionality of the resulting text
embedding.
target_hidden_size
int
Number of neurons in the final layer. This parameter determines the dimensionality of the resulting text
embedding.
block_sizes
vector
of int
determining the number and sizes of each block.
num_attention_heads
int
Number of attention heads.
intermediate_size
int
Number of neurons in the intermediate layer of the attention mechanism.
num_decoder_layers
int
Number of decoding layers.
pooling_type
string
Type of pooling.
"mean"
for pooling with mean.
"max"
for pooling with maximum values.
hidden_act
string
Name of the activation function.
hidden_dropout_prob
double
Ratio of dropout.
attention_probs_dropout_prob
double
Ratio of dropout for attention probabilities.
activation_dropout
float
Dropout probability between the layers of the feed-forward blocks.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead, it saves the configuration and vocabulary of the new model to disk.
train()
This method can be used to train or fine-tune a transformer based on Funnel
Transformer
architecture with the help of the python libraries transformers
, datasets
, and tokenizers
.
.AIFEFunnelTransformer$train( ml_framework = "pytorch", output_dir, model_dir_path, text_dataset, p_mask = 0.15, whole_word = TRUE, val_size = 0.1, n_epoch = 1, batch_size = 12, chunk_size = 250, full_sequences_only = FALSE, min_seq_len = 50, learning_rate = 0.003, n_workers = 1, multi_process = FALSE, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, trace = TRUE, keras_trace = 1, pytorch_trace = 1, pytorch_safetensors = TRUE, log_dir = NULL, log_write_interval = 2 )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
output_dir
string
Path to the directory where the final model should be saved. If the directory does not exist, it will be
created.
model_dir_path
string
Path to the directory where the original model is stored.
text_dataset
Object of class LargeDataSetForText.
p_mask
double
Ratio that determines the number of words/tokens used for masking.
whole_word
bool
TRUE
: whole word masking should be applied.
FALSE
: token masking is used.
val_size
double
Ratio that determines the amount of token chunks used for validation.
n_epoch
int
Number of epochs for training.
batch_size
int
Size of batches.
chunk_size
int
Size of every chunk for training.
full_sequences_only
bool
TRUE
for using only chunks with a sequence length equal to chunk_size
.
min_seq_len
int
Only relevant if full_sequences_only = FALSE
. Value determines the minimal sequence length included in
training process.
learning_rate
double
Learning rate for adam optimizer.
n_workers
int
Number of workers. Only relevant if ml_framework = "tensorflow"
.
multi_process
bool
TRUE
if multiple processes should be activated. Only relevant if ml_framework = "tensorflow"
.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
keras_trace
int
keras_trace = 0
: does not print any information about the training process from keras on the console.
keras_trace = 1
: prints a progress bar.
keras_trace = 2
: prints one line of information for every epoch. Only relevant if ml_framework = "tensorflow"
.
pytorch_trace
int
pytorch_trace = 0
: does not print any information about the training process from pytorch on the console.
pytorch_trace = 1
: prints a progress bar.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead the trained or fine-tuned model is saved to disk.
clone()
The objects of this class are cloneable with this method.
.AIFEFunnelTransformer$clone(deep = FALSE)
deep
Whether to make a deep clone.
The model uses a configuration with truncate_seq = TRUE
to avoid implementation problems with tensorflow.
This model uses a WordPiece
tokenizer like BERT
and can be trained with whole word masking. The transformer
library may display a warning, which can be ignored.
Dai, Z., Lai, G., Yang, Y. & Le, Q. V. (2020). Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. doi:10.48550/arXiv.2006.03236
Hugging Face documentation
https://huggingface.co/docs/transformers/model_doc/funnel#funnel-transformer
https://huggingface.co/docs/transformers/model_doc/funnel#transformers.FunnelModel
https://huggingface.co/docs/transformers/model_doc/funnel#transformers.TFFunnelModel
Other Transformers for developers:
.AIFEBaseTransformer
,
.AIFEBertTransformer
,
.AIFEDebertaTransformer
,
.AIFELongformerTransformer
,
.AIFEMpnetTransformer
,
.AIFERobertaTransformer
,
.AIFETrObj
R6
class for creation and training of Longformer
transformersThis class has the following methods:
create
: creates a new transformer based on Longformer
.
train
: trains and fine-tunes a Longformer
model.
New models can be created using the .AIFELongformerTransformer$create
method.
To train the model, pass the directory of the model to the method .AIFELongformerTransformer$train
.
Pre-Trained models which can be fine-tuned with this function are available at https://huggingface.co/.
Training of this model makes use of dynamic masking.
aifeducation::.AIFEBaseTransformer
-> .AIFELongformerTransformer
aifeducation::.AIFEBaseTransformer$set_SFC_calculate_vocab()
aifeducation::.AIFEBaseTransformer$set_SFC_check_max_pos_emb()
aifeducation::.AIFEBaseTransformer$set_SFC_create_final_tokenizer()
aifeducation::.AIFEBaseTransformer$set_SFC_create_tokenizer_draft()
aifeducation::.AIFEBaseTransformer$set_SFC_create_transformer_model()
aifeducation::.AIFEBaseTransformer$set_SFC_save_tokenizer_draft()
aifeducation::.AIFEBaseTransformer$set_SFT_create_data_collator()
aifeducation::.AIFEBaseTransformer$set_SFT_cuda_empty_cache()
aifeducation::.AIFEBaseTransformer$set_SFT_load_existing_model()
aifeducation::.AIFEBaseTransformer$set_model_param()
aifeducation::.AIFEBaseTransformer$set_model_temp()
aifeducation::.AIFEBaseTransformer$set_required_SFC()
aifeducation::.AIFEBaseTransformer$set_title()
new()
Creates a new transformer based on Longformer
and sets the
title.
.AIFELongformerTransformer$new()
This method returns nothing
create()
This method creates a transformer configuration based on the Longformer
base architecture and a
vocabulary based on Byte-Pair Encoding
(BPE) tokenizer using the python transformers
and tokenizers
libraries.
This method adds the following 'dependent' parameters to the base class's inherited params
list:
add_prefix_space
trim_offsets
num_hidden_layer
attention_window
.AIFELongformerTransformer$create( ml_framework = "pytorch", model_dir, text_dataset, vocab_size = 30522, add_prefix_space = FALSE, trim_offsets = TRUE, max_position_embeddings = 512, hidden_size = 768, num_hidden_layer = 12, num_attention_heads = 12, intermediate_size = 3072, hidden_act = "gelu", hidden_dropout_prob = 0.1, attention_probs_dropout_prob = 0.1, attention_window = 512, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, trace = TRUE, pytorch_safetensors = TRUE, log_dir = NULL, log_write_interval = 2 )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
model_dir
string
Path to the directory where the model should be saved.
text_dataset
Object of class LargeDataSetForText.
vocab_size
int
Size of the vocabulary.
add_prefix_space
bool
TRUE
if an additional space should be inserted to the leading words.
trim_offsets
bool
TRUE
trims the whitespaces from the produced offsets.
max_position_embeddings
int
Number of maximum position embeddings. This parameter also determines the maximum length of a sequence which
can be processed with the model.
hidden_size
int
Number of neurons in each layer. This parameter determines the dimensionality of the resulting text
embedding.
num_hidden_layer
int
Number of hidden layers.
num_attention_heads
int
Number of attention heads.
intermediate_size
int
Number of neurons in the intermediate layer of the attention mechanism.
hidden_act
string
Name of the activation function.
hidden_dropout_prob
double
Ratio of dropout.
attention_probs_dropout_prob
double
Ratio of dropout for attention probabilities.
attention_window
int
Size of the window around each token for attention mechanism in every layer.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead, it saves the configuration and vocabulary of the new model to disk.
train()
This method can be used to train or fine-tune a transformer based on Longformer
Transformer
architecture with the help of the python libraries transformers
, datasets
, and tokenizers
.
.AIFELongformerTransformer$train( ml_framework = "pytorch", output_dir, model_dir_path, text_dataset, p_mask = 0.15, val_size = 0.1, n_epoch = 1, batch_size = 12, chunk_size = 250, full_sequences_only = FALSE, min_seq_len = 50, learning_rate = 0.03, n_workers = 1, multi_process = FALSE, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, trace = TRUE, keras_trace = 1, pytorch_trace = 1, pytorch_safetensors = TRUE, log_dir = NULL, log_write_interval = 2 )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
output_dir
string
Path to the directory where the final model should be saved. If the directory does not exist, it will be
created.
model_dir_path
string
Path to the directory where the original model is stored.
text_dataset
Object of class LargeDataSetForText.
p_mask
double
Ratio that determines the number of words/tokens used for masking.
val_size
double
Ratio that determines the amount of token chunks used for validation.
n_epoch
int
Number of epochs for training.
batch_size
int
Size of batches.
chunk_size
int
Size of every chunk for training.
full_sequences_only
bool
TRUE
for using only chunks with a sequence length equal to chunk_size
.
min_seq_len
int
Only relevant if full_sequences_only = FALSE
. Value determines the minimal sequence length included in
training process.
learning_rate
double
Learning rate for adam optimizer.
n_workers
int
Number of workers. Only relevant if ml_framework = "tensorflow"
.
multi_process
bool
TRUE
if multiple processes should be activated. Only relevant if ml_framework = "tensorflow"
.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
keras_trace
int
keras_trace = 0
: does not print any information about the training process from keras on the console.
keras_trace = 1
: prints a progress bar.
keras_trace = 2
: prints one line of information for every epoch. Only relevant if ml_framework = "tensorflow"
.
pytorch_trace
int
pytorch_trace = 0
: does not print any information about the training process from pytorch on the console.
pytorch_trace = 1
: prints a progress bar.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead the trained or fine-tuned model is saved to disk.
clone()
The objects of this class are cloneable with this method.
.AIFELongformerTransformer$clone(deep = FALSE)
deep
Whether to make a deep clone.
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long-Document Transformer. doi:10.48550/arXiv.2004.05150
Hugging Face Documentation
https://huggingface.co/docs/transformers/model_doc/longformer
https://huggingface.co/docs/transformers/model_doc/longformer#transformers.LongformerModel
https://huggingface.co/docs/transformers/model_doc/longformer#transformers.TFLongformerModel
Other Transformers for developers:
.AIFEBaseTransformer
,
.AIFEBertTransformer
,
.AIFEDebertaTransformer
,
.AIFEFunnelTransformer
,
.AIFEMpnetTransformer
,
.AIFERobertaTransformer
,
.AIFETrObj
R6
class for creation and training of MPNet
transformersThis class has the following methods:
create
: creates a new transformer based on MPNet
.
train
: trains and fine-tunes a MPNet
model.
New models can be created using the .AIFEMpnetTransformer$create
method.
To train the model, pass the directory of the model to the method .AIFEMpnetTransformer$train
.
aifeducation::.AIFEBaseTransformer
-> .AIFEMpnetTransformer
special_tokens_list
list
List for special tokens with the following elements:
cls
- CLS token representation (<s>
)
pad
- pad token representation (<pad>
)
sep
- sep token representation (</s>
)
unk
- unk token representation (<unk>
)
mask
- mask token representation (<mask>
)
aifeducation::.AIFEBaseTransformer$set_SFC_calculate_vocab()
aifeducation::.AIFEBaseTransformer$set_SFC_check_max_pos_emb()
aifeducation::.AIFEBaseTransformer$set_SFC_create_final_tokenizer()
aifeducation::.AIFEBaseTransformer$set_SFC_create_tokenizer_draft()
aifeducation::.AIFEBaseTransformer$set_SFC_create_transformer_model()
aifeducation::.AIFEBaseTransformer$set_SFC_save_tokenizer_draft()
aifeducation::.AIFEBaseTransformer$set_SFT_create_data_collator()
aifeducation::.AIFEBaseTransformer$set_SFT_cuda_empty_cache()
aifeducation::.AIFEBaseTransformer$set_SFT_load_existing_model()
aifeducation::.AIFEBaseTransformer$set_model_param()
aifeducation::.AIFEBaseTransformer$set_model_temp()
aifeducation::.AIFEBaseTransformer$set_required_SFC()
aifeducation::.AIFEBaseTransformer$set_title()
new()
Creates a new transformer based on MPNet
and sets the title.
.AIFEMpnetTransformer$new()
This method returns nothing.
create()
This method creates a transformer configuration based on the MPNet
base architecture.
This method adds the following 'dependent' parameters to the base class's inherited params
list:
vocab_do_lower_case
num_hidden_layer
.AIFEMpnetTransformer$create( ml_framework = "pytorch", model_dir, text_dataset, vocab_size = 30522, vocab_do_lower_case = FALSE, max_position_embeddings = 512, hidden_size = 768, num_hidden_layer = 12, num_attention_heads = 12, intermediate_size = 3072, hidden_act = "gelu", hidden_dropout_prob = 0.1, attention_probs_dropout_prob = 0.1, sustain_track = FALSE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, trace = TRUE, pytorch_safetensors = TRUE, log_dir = NULL, log_write_interval = 2 )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
model_dir
string
Path to the directory where the model should be saved.
text_dataset
Object of class LargeDataSetForText.
vocab_size
int
Size of the vocabulary.
vocab_do_lower_case
bool
TRUE
if all words/tokens should be lower case.
max_position_embeddings
int
Number of maximum position embeddings. This parameter also determines the maximum length of a sequence which
can be processed with the model.
hidden_size
int
Number of neurons in each layer. This parameter determines the dimensionality of the resulting text
embedding.
num_hidden_layer
int
Number of hidden layers.
num_attention_heads
int
Number of attention heads.
intermediate_size
int
Number of neurons in the intermediate layer of the attention mechanism.
hidden_act
string
Name of the activation function.
hidden_dropout_prob
double
Ratio of dropout.
attention_probs_dropout_prob
double
Ratio of dropout for attention probabilities.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead, it saves the configuration and vocabulary of the new model to disk.
train()
This method can be used to train or fine-tune a transformer based on MPNet
architecture with the
help of the python libraries transformers
, datasets
, and tokenizers
.
This method adds the following 'dependent' parameter to the base class's inherited params
list:
p_perm
.AIFEMpnetTransformer$train( ml_framework = "pytorch", output_dir, model_dir_path, text_dataset, p_mask = 0.15, p_perm = 0.15, whole_word = TRUE, val_size = 0.1, n_epoch = 1, batch_size = 12, chunk_size = 250, full_sequences_only = FALSE, min_seq_len = 50, learning_rate = 0.003, n_workers = 1, multi_process = FALSE, sustain_track = FALSE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, trace = TRUE, keras_trace = 1, pytorch_trace = 1, pytorch_safetensors = TRUE, log_dir = NULL, log_write_interval = 2 )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
output_dir
string
Path to the directory where the final model should be saved. If the directory does not exist, it will be
created.
model_dir_path
string
Path to the directory where the original model is stored.
text_dataset
Object of class LargeDataSetForText.
p_mask
double
Ratio that determines the number of words/tokens used for masking.
p_perm
double
Ratio that determines the number of words/tokens used for permutation.
whole_word
bool
TRUE
: whole word masking should be applied.
FALSE
: token masking is used.
val_size
double
Ratio that determines the amount of token chunks used for validation.
n_epoch
int
Number of epochs for training.
batch_size
int
Size of batches.
chunk_size
int
Size of every chunk for training.
full_sequences_only
bool
TRUE
for using only chunks with a sequence length equal to chunk_size
.
min_seq_len
int
Only relevant if full_sequences_only = FALSE
. Value determines the minimal sequence length included in
training process.
learning_rate
double
Learning rate for adam optimizer.
n_workers
int
Number of workers. Only relevant if ml_framework = "tensorflow"
.
multi_process
bool
TRUE
if multiple processes should be activated. Only relevant if ml_framework = "tensorflow"
.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
keras_trace
int
keras_trace = 0
: does not print any information about the training process from keras on the console.
keras_trace = 1
: prints a progress bar.
keras_trace = 2
: prints one line of information for every epoch. Only relevant if ml_framework = "tensorflow"
.
pytorch_trace
int
pytorch_trace = 0
: does not print any information about the training process from pytorch on the console.
pytorch_trace = 1
: prints a progress bar.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead the trained or fine-tuned model is saved to disk.
clone()
The objects of this class are cloneable with this method.
.AIFEMpnetTransformer$clone(deep = FALSE)
deep
Whether to make a deep clone.
Using this class with tensorflow
is not supported. Supported framework is pytorch
.
Song,K., Tan, X., Qin, T., Lu, J. & Liu, T.-Y. (2020). MPNet: Masked and Permuted Pre-training for Language Understanding. doi:10.48550/arXiv.2004.09297
Hugging Face documentation
https://huggingface.co/docs/transformers/model_doc/mpnet#transformers.MPNetForMaskedLM
https://huggingface.co/docs/transformers/model_doc/mpnet#transformers.TFMPNetForMaskedLM
Other Transformers for developers:
.AIFEBaseTransformer
,
.AIFEBertTransformer
,
.AIFEDebertaTransformer
,
.AIFEFunnelTransformer
,
.AIFELongformerTransformer
,
.AIFERobertaTransformer
,
.AIFETrObj
R6
class for creation and training of RoBERTa
transformersThis class has the following methods:
create
: creates a new transformer based on RoBERTa
.
train
: trains and fine-tunes a RoBERTa
model.
New models can be created using the .AIFERobertaTransformer$create
method.
To train the model, pass the directory of the model to the method .AIFERobertaTransformer$train
.
Pre-Trained models which can be fine-tuned with this function are available at https://huggingface.co/.
Training of this model makes use of dynamic masking.
aifeducation::.AIFEBaseTransformer
-> .AIFERobertaTransformer
aifeducation::.AIFEBaseTransformer$set_SFC_calculate_vocab()
aifeducation::.AIFEBaseTransformer$set_SFC_check_max_pos_emb()
aifeducation::.AIFEBaseTransformer$set_SFC_create_final_tokenizer()
aifeducation::.AIFEBaseTransformer$set_SFC_create_tokenizer_draft()
aifeducation::.AIFEBaseTransformer$set_SFC_create_transformer_model()
aifeducation::.AIFEBaseTransformer$set_SFC_save_tokenizer_draft()
aifeducation::.AIFEBaseTransformer$set_SFT_create_data_collator()
aifeducation::.AIFEBaseTransformer$set_SFT_cuda_empty_cache()
aifeducation::.AIFEBaseTransformer$set_SFT_load_existing_model()
aifeducation::.AIFEBaseTransformer$set_model_param()
aifeducation::.AIFEBaseTransformer$set_model_temp()
aifeducation::.AIFEBaseTransformer$set_required_SFC()
aifeducation::.AIFEBaseTransformer$set_title()
new()
Creates a new transformer based on RoBERTa
and sets the title.
.AIFERobertaTransformer$new()
This method returns nothing.
create()
This method creates a transformer configuration based on the RoBERTa
base architecture and a
vocabulary based on Byte-Pair Encoding
(BPE) tokenizer using the python transformers
and tokenizers
libraries.
This method adds the following 'dependent' parameters to the base class' inherited params
list:
add_prefix_space
trim_offsets
num_hidden_layer
.AIFERobertaTransformer$create( ml_framework = "pytorch", model_dir, text_dataset, vocab_size = 30522, add_prefix_space = FALSE, trim_offsets = TRUE, max_position_embeddings = 512, hidden_size = 768, num_hidden_layer = 12, num_attention_heads = 12, intermediate_size = 3072, hidden_act = "gelu", hidden_dropout_prob = 0.1, attention_probs_dropout_prob = 0.1, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, trace = TRUE, pytorch_safetensors = TRUE, log_dir = NULL, log_write_interval = 2 )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
model_dir
string
Path to the directory where the model should be saved.
text_dataset
Object of class LargeDataSetForText.
vocab_size
int
Size of the vocabulary.
add_prefix_space
bool
TRUE
if an additional space should be inserted to the leading words.
trim_offsets
bool
TRUE
trims the whitespaces from the produced offsets.
max_position_embeddings
int
Number of maximum position embeddings. This parameter also determines the maximum length of a sequence which
can be processed with the model.
hidden_size
int
Number of neurons in each layer. This parameter determines the dimensionality of the resulting text
embedding.
num_hidden_layer
int
Number of hidden layers.
num_attention_heads
int
Number of attention heads.
intermediate_size
int
Number of neurons in the intermediate layer of the attention mechanism.
hidden_act
string
Name of the activation function.
hidden_dropout_prob
double
Ratio of dropout.
attention_probs_dropout_prob
double
Ratio of dropout for attention probabilities.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead, it saves the configuration and vocabulary of the new model to disk.
train()
This method can be used to train or fine-tune a transformer based on RoBERTa
Transformer
architecture with the help of the python libraries transformers
, datasets
, and tokenizers
.
.AIFERobertaTransformer$train( ml_framework = "pytorch", output_dir, model_dir_path, text_dataset, p_mask = 0.15, val_size = 0.1, n_epoch = 1, batch_size = 12, chunk_size = 250, full_sequences_only = FALSE, min_seq_len = 50, learning_rate = 0.03, n_workers = 1, multi_process = FALSE, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, trace = TRUE, keras_trace = 1, pytorch_trace = 1, pytorch_safetensors = TRUE, log_dir = NULL, log_write_interval = 2 )
ml_framework
string
Framework to use for training and inference.
ml_framework = "tensorflow"
: for 'tensorflow'.
ml_framework = "pytorch"
: for 'pytorch'.
output_dir
string
Path to the directory where the final model should be saved. If the directory does not exist, it will be
created.
model_dir_path
string
Path to the directory where the original model is stored.
text_dataset
Object of class LargeDataSetForText.
p_mask
double
Ratio that determines the number of words/tokens used for masking.
val_size
double
Ratio that determines the amount of token chunks used for validation.
n_epoch
int
Number of epochs for training.
batch_size
int
Size of batches.
chunk_size
int
Size of every chunk for training.
full_sequences_only
bool
TRUE
for using only chunks with a sequence length equal to chunk_size
.
min_seq_len
int
Only relevant if full_sequences_only = FALSE
. Value determines the minimal sequence length included in
training process.
learning_rate
double
Learning rate for adam optimizer.
n_workers
int
Number of workers. Only relevant if ml_framework = "tensorflow"
.
multi_process
bool
TRUE
if multiple processes should be activated. Only relevant if ml_framework = "tensorflow"
.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library codecarbon.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A
list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
string
Region within a country. Only available for USA and Canada. See the documentation of codecarbon for more
information https://mlco2.github.io/codecarbon/parameters.html.
sustain_interval
integer
Interval in seconds for measuring power usage.
trace
bool
TRUE
if information about the progress should be printed to the console.
keras_trace
int
keras_trace = 0
: does not print any information about the training process from keras on the console.
keras_trace = 1
: prints a progress bar.
keras_trace = 2
: prints one line of information for every epoch. Only relevant if ml_framework = "tensorflow"
.
pytorch_trace
int
pytorch_trace = 0
: does not print any information about the training process from pytorch on the console.
pytorch_trace = 1
: prints a progress bar.
pytorch_safetensors
bool
Only relevant for pytorch models.
TRUE
: a 'pytorch' model is saved in safetensors format.
FALSE
(or 'safetensors' is not available): model is saved in the standard pytorch format (.bin).
log_dir
Path to the directory where the log files should be saved.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant
if log_dir
is not NULL
.
This method does not return an object. Instead the trained or fine-tuned model is saved to disk.
clone()
The objects of this class are cloneable with this method.
.AIFERobertaTransformer$clone(deep = FALSE)
deep
Whether to make a deep clone.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. doi:10.48550/arXiv.1907.11692
Hugging Face Documentation
https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaModel
https://huggingface.co/docs/transformers/model_doc/roberta#transformers.TFRobertaModel
Other Transformers for developers:
.AIFEBaseTransformer
,
.AIFEBertTransformer
,
.AIFEDebertaTransformer
,
.AIFEFunnelTransformer
,
.AIFELongformerTransformer
,
.AIFEMpnetTransformer
,
.AIFETrObj
R6
object of the AIFETransformerMaker
classObject for creating the transformers with different types. See AIFETransformerMaker class for details.
aife_transformer_maker
aife_transformer_maker
An object of class AIFETransformerMaker
(inherits from R6
) of length 3.
Other Transformer:
AIFETrType
,
AIFETransformerMaker
# Use 'make' method of the 'aifeducation::aife_transformer_maker' object # Pass string with the type of transformers # Allowed types are "bert", "deberta_v2", "funnel", etc. See aifeducation::AIFETrType list my_bert <- aife_transformer_maker$make("bert") # Or use elements of the 'aifeducation::AIFETrType' list my_longformer <- aife_transformer_maker$make(AIFETrType$longformer) # Run 'create' or 'train' methods of the transformer in order to create a # new transformer or train the newly created one, respectively # my_bert$create(...) # my_bert$train(...) # my_longformer$create(...) # my_longformer$train(...)
# Use 'make' method of the 'aifeducation::aife_transformer_maker' object # Pass string with the type of transformers # Allowed types are "bert", "deberta_v2", "funnel", etc. See aifeducation::AIFETrType list my_bert <- aife_transformer_maker$make("bert") # Or use elements of the 'aifeducation::AIFETrType' list my_longformer <- aife_transformer_maker$make(AIFETrType$longformer) # Run 'create' or 'train' methods of the transformer in order to create a # new transformer or train the newly created one, respectively # my_bert$create(...) # my_bert$train(...) # my_longformer$create(...) # my_longformer$train(...)
Abstract class for all models that do not rely on the python library 'transformers'.
Objects of this containing fields and methods used in several other classes in 'ai for education'. This class is not designed for a direct application and should only be used by developers.
model
('tensorflow_model' or 'pytorch_model')
Field for storing the 'tensorflow' or 'pytorch' model after loading.
model_config
('list()')
List for storing information about the configuration of the model.
last_training
('list()')
List for storing the history, the configuration, and the results of the last
training. This information will be overwritten if a new training is started.
last_training$start_time
: Time point when training started.
last_training$learning_time
: Duration of the training process.
last_training$finish_time
: Time when the last training finished.
last_training$history
: History of the last training.
last_training$data
: Object of class table
storing the initial frequencies of the passed data.
last_training$config
: List storing the configuration used for the last training.
get_model_info()
Method for requesting the model information.
AIFEBaseModel$get_model_info()
list
of all relevant model information.
get_text_embedding_model()
Method for requesting the text embedding model information.
AIFEBaseModel$get_text_embedding_model()
list
of all relevant model information on the text embedding model underlying the model.
set_publication_info()
Method for setting publication information of the model.
AIFEBaseModel$set_publication_info(authors, citation, url = NULL)
authors
List of authors.
citation
Free text citation.
url
URL of a corresponding homepage.
Function does not return a value. It is used for setting the private members for publication information.
get_publication_info()
Method for requesting the bibliographic information of the model.
AIFEBaseModel$get_publication_info()
list
with all saved bibliographic information.
set_model_license()
Method for setting the license of the model.
AIFEBaseModel$set_model_license(license = "CC BY")
license
string
containing the abbreviation of the license or the license text.
Function does not return a value. It is used for setting the private member for the software license of the model.
get_model_license()
Method for getting the license of the model.
AIFEBaseModel$get_model_license()
license
string
containing the abbreviation of the license or the license text.
string
representing the license for the model.
set_documentation_license()
Method for setting the license of the model's documentation.
AIFEBaseModel$set_documentation_license(license = "CC BY")
license
string
containing the abbreviation of the license or the license text.
Function does not return a value. It is used for setting the private member for the documentation license of the model.
get_documentation_license()
Method for getting the license of the model's documentation.
AIFEBaseModel$get_documentation_license()
license
string
containing the abbreviation of the license or the license text.
Returns the license as a string
.
set_model_description()
Method for setting a description of the model.
AIFEBaseModel$set_model_description( eng = NULL, native = NULL, abstract_eng = NULL, abstract_native = NULL, keywords_eng = NULL, keywords_native = NULL )
eng
string
A text describing the training, its theoretical and empirical background, and output in
English.
native
string
A text describing the training , its theoretical and empirical background, and output in
the native language of the model.
abstract_eng
string
A text providing a summary of the description in English.
abstract_native
string
A text providing a summary of the description in the native language of the
model.
keywords_eng
vector
of keyword in English.
keywords_native
vector
of keyword in the native language of the model.
Function does not return a value. It is used for setting the private members for the description of the model.
get_model_description()
Method for requesting the model description.
AIFEBaseModel$get_model_description()
list
with the description of the classifier in English and the native language.
save()
Method for saving a model.
AIFEBaseModel$save(dir_path, folder_name)
dir_path
string
Path of the directory where the model should be saved.
folder_name
string
Name of the folder that should be created within the directory.
Function does not return a value. It saves the model to disk.
load()
Method for importing a model.
AIFEBaseModel$load(dir_path)
dir_path
string
Path of the directory where the model is saved.
Function does not return a value. It is used to load the weights of a model.
get_package_versions()
Method for requesting a summary of the R and python packages' versions used for creating the model.
AIFEBaseModel$get_package_versions()
Returns a list
containing the versions of the relevant R and python packages.
get_sustainability_data()
Method for requesting a summary of tracked energy consumption during training and an estimate of the resulting CO2 equivalents in kg.
AIFEBaseModel$get_sustainability_data()
Returns a list
containing the tracked energy consumption, CO2 equivalents in kg, information on the
tracker used, and technical information on the training infrastructure.
get_ml_framework()
Method for requesting the machine learning framework used for the model.
AIFEBaseModel$get_ml_framework()
Returns a string
describing the machine learning framework used for the classifier.
get_text_embedding_model_name()
Method for requesting the name (unique id) of the underlying text embedding model.
AIFEBaseModel$get_text_embedding_model_name()
Returns a string
describing name of the text embedding model.
check_embedding_model()
Method for checking if the provided text embeddings are created with the same TextEmbeddingModel as the model.
AIFEBaseModel$check_embedding_model(text_embeddings)
text_embeddings
Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
TRUE
if the underlying TextEmbeddingModel are the same. FALSE
if the models differ.
count_parameter()
Method for counting the trainable parameters of a model.
AIFEBaseModel$count_parameter()
Returns the number of trainable parameters of the model.
is_configured()
Method for checking if the model was successfully configured. An object can only be used if this
value is TRUE
.
AIFEBaseModel$is_configured()
bool
TRUE
if the model is fully configured. FALSE
if not.
get_private()
Method for requesting all private fields and methods. Used for loading and updating an object.
AIFEBaseModel$get_private()
Returns a list
with all private fields and methods.
get_all_fields()
Return all fields.
AIFEBaseModel$get_all_fields()
Method returns a list
containing all public and private fields
of the object.
clone()
The objects of this class are cloneable with this method.
AIFEBaseModel$clone(deep = FALSE)
deep
Whether to make a deep clone.
R6
class for transformer creationThis class was developed to make the creation of transformers easier for users. Pass the transformer's
type to the make
method and get desired transformer. Now run the create
or/and train
methods of the new
transformer.
The already created aife_transformer_maker object of this class can be used.
See p.3 Transformer Maker in Transformers for Developers for details.
See .AIFEBaseTransformer class for details.
make()
Creates a new transformer with the passed type.
AIFETransformerMaker$make(type)
type
string
A type of the new transformer. Allowed types are bert, roberta, deberta_v2, funnel, longformer, mpnet. See
AIFETrType list.
If success - a new transformer, otherwise - an error (passed type is invalid).
clone()
The objects of this class are cloneable with this method.
AIFETransformerMaker$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other Transformer:
AIFETrType
,
aife_transformer_maker
# Create transformer maker tr_maker <- AIFETransformerMaker$new() # Use 'make' method of the 'tr_maker' object # Pass string with the type of transformers # Allowed types are "bert", "deberta_v2", "funnel", etc. See aifeducation::AIFETrType list my_bert <- tr_maker$make("bert") # Or use elements of the 'aifeducation::AIFETrType' list my_longformer <- tr_maker$make(AIFETrType$longformer) # Run 'create' or 'train' methods of the transformer in order to create a # new transformer or train the newly created one, respectively # my_bert$create(...) # my_bert$train(...) # my_longformer$create(...) # my_longformer$train(...)
# Create transformer maker tr_maker <- AIFETransformerMaker$new() # Use 'make' method of the 'tr_maker' object # Pass string with the type of transformers # Allowed types are "bert", "deberta_v2", "funnel", etc. See aifeducation::AIFETrType list my_bert <- tr_maker$make("bert") # Or use elements of the 'aifeducation::AIFETrType' list my_longformer <- tr_maker$make(AIFETrType$longformer) # Run 'create' or 'train' methods of the transformer in order to create a # new transformer or train the newly created one, respectively # my_bert$create(...) # my_bert$train(...) # my_longformer$create(...) # my_longformer$train(...)
This list contains transformer types. Elements of the list can be used in the public make
of the
AIFETransformerMaker R6
class as input parameter type
.
It has the following elements:
bert
= 'bert'
roberta
= 'roberta'
deberta_v2
= 'deberta_v2'
funnel
= 'funnel'
longformer
= 'longformer'
mpnet
= 'mpnet'
Elements can be used like AIFETrType$bert
, AIFETrType$deberta_v2
, AIFETrType$funnel
, etc.
AIFETrType
AIFETrType
An object of class list
of length 6.
Other Transformer:
AIFETransformerMaker
,
aife_transformer_maker
Function for getting the number of cores that should be used
for parallel processing of tasks. The number of cores is set to 75 % of the
available cores. If the environment variable CI
is set to "true"
or if the
process is running on cran 2
is returned.
auto_n_cores()
auto_n_cores()
Returns int
as the number of cores.
Other Utils:
clean_pytorch_log_transformers()
,
create_config_state()
,
create_dir()
,
generate_id()
,
get_file_extension()
,
get_py_package_versions()
,
is.null_or_na()
,
output_message()
,
print_message()
,
run_py_file()
Function for calculating recall, precision, and f1.
calc_standard_classification_measures(true_values, predicted_values)
calc_standard_classification_measures(true_values, predicted_values)
true_values |
|
predicted_values |
|
Returns a matrix which contains the cases categories in the rows and the measures (precision, recall, f1) in the columns.
Other classifier_utils:
get_coder_metrics()
This function checks if all python modules necessary for the package aifeducation to work are available.
check_aif_py_modules(trace = TRUE, check = "pytorch")
check_aif_py_modules(trace = TRUE, check = "pytorch")
trace |
|
check |
|
The function prints a table with all relevant packages and shows which modules are available or unavailable.
If all relevant modules are available, the functions returns TRUE
. In all other cases it returns FALSE
Other Installation and Configuration:
install_aifeducation()
,
install_py_modules()
,
set_transformers_logger()
Function for preparing and cleaning the log created by an object of class Trainer from the python library 'transformer's.
clean_pytorch_log_transformers(log)
clean_pytorch_log_transformers(log)
log |
|
Returns a data.frame
containing epochs, loss, and val_loss.
Other Utils:
auto_n_cores()
,
create_config_state()
,
create_dir()
,
generate_id()
,
get_file_extension()
,
get_py_package_versions()
,
is.null_or_na()
,
output_message()
,
print_message()
,
run_py_file()
This function calculates different version of Cohen's Kappa.
cohens_kappa(rater_one, rater_two)
cohens_kappa(rater_one, rater_two)
rater_one |
|
rater_two |
|
Returns a list
containing the results for Cohen' Kappa if no weights
are applied (kappa_unweighted
), if weights are applied and the weights increase
linear (kappa_linear
), and if weights are applied and the weights increase quadratic
(kappa_squared
).
Cohen, J (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220. doi:10.1037/h0026256
Cohen, J (1960). A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1), 37–46. doi:10.1177/001316446002000104
Other performance measures:
fleiss_kappa()
,
kendalls_w()
,
kripp_alpha()
Check whether the passed dir_path
directory exists. If not, creates a new directory and prints a msg
message if trace
is TRUE
.
create_dir(dir_path, trace, msg = "Creating Directory", msg_fun = TRUE)
create_dir(dir_path, trace, msg = "Creating Directory", msg_fun = TRUE)
dir_path |
|
trace |
|
msg |
|
msg_fun |
|
TRUE
or FALSE
depending on whether the shiny app is active.
Other Utils:
auto_n_cores()
,
clean_pytorch_log_transformers()
,
create_config_state()
,
generate_id()
,
get_file_extension()
,
get_py_package_versions()
,
is.null_or_na()
,
output_message()
,
print_message()
,
run_py_file()
Function for creating synthetic cases in order to balance the data for training with TEClassifierRegular or TEClassifierProtoNet]. This is an auxiliary function for use with get_synthetic_cases_from_matrix to allow parallel computations.
create_synthetic_units_from_matrix( matrix_form, target, required_cases, k, method, cat, k_s, max_k )
create_synthetic_units_from_matrix( matrix_form, target, required_cases, k, method, cat, k_s, max_k )
matrix_form |
Named |
target |
Named |
required_cases |
|
k |
|
method |
|
cat |
|
k_s |
|
max_k |
|
Returns a list
which contains the text embeddings of the new synthetic cases as a named data.frame
and
their labels as a named factor
.
Other data_management_utils:
get_n_chunks()
,
get_synthetic_cases_from_matrix()
Abstract class for managing the data and samples during training a classifier. DataManagerClassifier is used with TEClassifierRegular and TEClassifierProtoNet.
Objects of this class are used for ensuring the correct data management for training different types of classifiers. Objects of this class are also used for data augmentation by creating synthetic cases with different techniques.
config
('list')
Field for storing configuration of the DataManagerClassifier.
state
('list')
Field for storing the current state of the DataManagerClassifier.
datasets
('list')
Field for storing the data sets used during training. All elements of the list are data sets of class
datasets.arrow_dataset.Dataset
. The following data sets are available:
data_labeled: all cases which have a label.
data_unlabeled: all cases which have no label.
data_labeled_synthetic: all synthetic cases with their corresponding labels.
data_labeled_pseudo: subset of data_unlabeled if pseudo labels were estimated by a classifier.
name_idx
('named vector')
Field for storing the pairs of indexes and names of every case. The pairs for labeled and unlabeled data are
separated.
samples
('list')
Field for storing the assignment of every cases to a train, validation or test data set depending on the
concrete fold. Only the indexes and not the names are stored. In addition, the list contains the assignment for
the final training which excludes a test data set. If the DataManagerClassifier uses i
folds the sample for
the final training can be requested with i+1
.
new()
Creating a new instance of this class.
DataManagerClassifier$new( data_embeddings, data_targets, folds = 5, val_size = 0.25, class_levels, one_hot_encoding = TRUE, add_matrix_map = TRUE, sc_methods = "dbsmote", sc_min_k = 1, sc_max_k = 10, trace = TRUE, n_cores = auto_n_cores() )
data_embeddings
Object of class EmbeddedText or LargeDataSetForTextEmbeddings from which the DataManagerClassifier should be created.
data_targets
factor
containing the labels for cases stored in data_embeddings
. Factor must be named
and has to use the same names used in data_embeddings
. Missing values are supported and should be supplied
(e.g., for pseudo labeling).
folds
int
determining the number of cross-fold samples. Value must be at least 2.
val_size
double
between 0 and 1, indicating the proportion of cases of each class which should be used
for the validation sample. The remaining cases are part of the training data.
class_levels
vector
containing the possible levels of the labels.
one_hot_encoding
bool
If TRUE
all labels are converted to one hot encoding.
add_matrix_map
bool
If TRUE
all embeddings are transformed into a two dimensional matrix. The number
of rows equals the number of cases. The number of columns equals times*features
.
sc_methods
string
determining the technique used for creating synthetic cases.
sc_min_k
int
determining the minimal number of neighbors during the creating of synthetic cases.
sc_max_k
int
determining the minimal number of neighbors during the creating of synthetic cases.
trace
bool
If TRUE
information on the process are printed to the console.
n_cores
int
Number of cores which should be used during the calculation of synthetic cases.
Method returns an initialized object of class DataManagerClassifier.
get_config()
Method for requesting the configuration of the DataManagerClassifier.
DataManagerClassifier$get_config()
Returns a list
storing the configuration of the DataManagerClassifier.
get_labeled_data()
Method for requesting the complete labeled data set.
DataManagerClassifier$get_labeled_data()
Returns an object of class datasets.arrow_dataset.Dataset
containing all cases with labels.
get_unlabeled_data()
Method for requesting the complete unlabeled data set.
DataManagerClassifier$get_unlabeled_data()
Returns an object of class datasets.arrow_dataset.Dataset
containing all cases without labels.
get_samples()
Method for requesting the assignments to train, validation, and test data sets for every fold and the final training.
DataManagerClassifier$get_samples()
Returns a list
storing the assignments to a train, validation, and test data set for every fold. In the
case of the sample for the final training the test data set is always empty (NULL
).
set_state()
Method for setting the current state of the DataManagerClassifier.
DataManagerClassifier$set_state(iteration, step = NULL)
iteration
int
determining the current iteration of the training. That is iteration determines the fold
to use for training, validation, and testing. If i is the number of fold i+1 request the sample for the
final training. For requesting the sample for the final training iteration can take a string "final"
.
step
int
determining the step for estimating and using pseudo labels during training. Only relevant if
training is requested with pseudo labels.
Method does not return anything. It is used for setting the internal state of the DataManager.
get_n_folds()
Method for requesting the number of folds the DataManagerClassifier can use with the current data.
DataManagerClassifier$get_n_folds()
Returns the number of folds the DataManagerClassifier uses.
get_n_classes()
Method for requesting the number of classes.
DataManagerClassifier$get_n_classes()
Returns the number classes.
get_statistics()
Method for requesting descriptive sample statistics.
DataManagerClassifier$get_statistics()
Returns a table describing the absolute frequencies of the labeled and unlabeled data. The rows contain the length of the sequences while the columns contain the labels.
get_dataset()
Method for requesting a data set for training depending in the current state of the DataManagerClassifier.
DataManagerClassifier$get_dataset( inc_labeled = TRUE, inc_unlabeled = FALSE, inc_synthetic = FALSE, inc_pseudo_data = FALSE )
inc_labeled
bool
If TRUE
the data set includes all cases which have labels.
inc_unlabeled
bool
If TRUE
the data set includes all cases which have no labels.
inc_synthetic
bool
If TRUE
the data set includes all synthetic cases with their corresponding labels.
inc_pseudo_data
bool
If TRUE
the data set includes all cases which have pseudo labels.
Returns an object of class datasets.arrow_dataset.Dataset
containing the requested kind of data along
with all requested transformations for training. Please note that this method returns a data sets that is
designed for training only. The corresponding validation data set is requested with get_val_dataset
and the
corresponding test data set with get_test_dataset
.
get_val_dataset()
Method for requesting a data set for validation depending in the current state of the DataManagerClassifier.
DataManagerClassifier$get_val_dataset()
Returns an object of class datasets.arrow_dataset.Dataset
containing the requested kind of data along
with all requested transformations for validation. The corresponding data set for training can be requested
with get_dataset
and the corresponding data set for testing with get_test_dataset
.
get_test_dataset()
Method for requesting a data set for testing depending in the current state of the DataManagerClassifier.
DataManagerClassifier$get_test_dataset()
Returns an object of class datasets.arrow_dataset.Dataset
containing the requested kind of data along
with all requested transformations for validation. The corresponding data set for training can be requested
with get_dataset
and the corresponding data set for validation with get_val_dataset
.
create_synthetic()
Method for generating synthetic data used during training. The process uses all labeled data belonging to the current state of the DataManagerClassifier.
DataManagerClassifier$create_synthetic(trace = TRUE, inc_pseudo_data = FALSE)
trace
bool
If TRUE
information on the process are printed to the console.
inc_pseudo_data
bool
If TRUE
data with pseudo labels are used in addition to the labeled data for
generating synthetic cases.
This method does nothing return. It generates a new data set for synthetic cases which are stored as an
object of class datasets.arrow_dataset.Dataset
in the field datasets$data_labeled_synthetic
. Please note
that a call of this method will override an existing data set in the corresponding field.
add_replace_pseudo_data()
Method for adding data with pseudo labels generated by a classifier
DataManagerClassifier$add_replace_pseudo_data(inputs, labels)
inputs
array
or matrix
representing the input data.
labels
factor
containing the corresponding pseudo labels.
This method does nothing return. It generates a new data set for synthetic cases which are stored as an
object of class datasets.arrow_dataset.Dataset
in the field datasets$data_labeled_pseudo
. Please note that
a call of this method will override an existing data set in the corresponding field.
clone()
The objects of this class are cloneable with this method.
DataManagerClassifier$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other Data Management:
EmbeddedText
,
LargeDataSetForText
,
LargeDataSetForTextEmbeddings
Object of class R6
which stores the text embeddings generated by an object of class
TextEmbeddingModel. The text embeddings are stored within memory/RAM. In the case of a high number of documents
the data may not fit into memory/RAM. Thus, please use this object only for a small sample of texts. In general, it
is recommended to use an object of class LargeDataSetForTextEmbeddings which can deal with any number of texts.
Returns an object of class EmbeddedText. These objects are used for storing and managing the text embeddings created with objects of class TextEmbeddingModel. Objects of class EmbeddedText serve as input for objects of class TEClassifierRegular, TEClassifierProtoNet, and TEFeatureExtractor. The main aim of this class is to provide a structured link between embedding models and classifiers. Since objects of this class save information on the text embedding model that created the text embedding it ensures that only embedding generated with same embedding model are combined. Furthermore, the stored information allows objects to check if embeddings of the correct text embedding model are used for training and predicting.
embeddings
('data.frame()')
data.frame containing the text embeddings for all chunks. Documents are in the rows. Embedding dimensions are
in the columns.
configure()
Creates a new object representing text embeddings.
EmbeddedText$configure( model_name = NA, model_label = NA, model_date = NA, model_method = NA, model_version = NA, model_language = NA, param_seq_length = NA, param_chunks = NULL, param_features = NULL, param_overlap = NULL, param_emb_layer_min = NULL, param_emb_layer_max = NULL, param_emb_pool_type = NULL, param_aggregation = NULL, embeddings )
model_name
string
Name of the model that generates this embedding.
model_label
string
Label of the model that generates this embedding.
model_date
string
Date when the embedding generating model was created.
model_method
string
Method of the underlying embedding model.
model_version
string
Version of the model that generated this embedding.
model_language
string
Language of the model that generated this embedding.
param_seq_length
int
Maximum number of tokens that processes the generating model for a chunk.
param_chunks
int
Maximum number of chunks which are supported by the generating model.
param_features
int
Number of dimensions of the text embeddings.
param_overlap
int
Number of tokens that were added at the beginning of the sequence for the next chunk
by this model. #'
param_emb_layer_min
int
or string
determining the first layer to be included in the creation of
embeddings.
param_emb_layer_max
int
or string
determining the last layer to be included in the creation of
embeddings.
param_emb_pool_type
string
determining the method for pooling the token embeddings within each layer.
param_aggregation
string
Aggregation method of the hidden states. Deprecated. Only included for backward
compatibility.
embeddings
data.frame
containing the text embeddings.
Returns an object of class EmbeddedText which stores the text embeddings produced by an objects of class TextEmbeddingModel.
save()
Saves a data set to disk.
EmbeddedText$save(dir_path, folder_name, create_dir = TRUE)
dir_path
Path where to store the data set.
folder_name
string
Name of the folder for storing the data set.
create_dir
bool
If True
the directory will be created if it does not exist.
Method does not return anything. It write the data set to disk.
is_configured()
Method for checking if the model was successfully configured. An object can only be used if this
value is TRUE
.
EmbeddedText$is_configured()
bool
TRUE
if the model is fully configured. FALSE
if not.
load_from_disk()
loads an object of class EmbeddedText from disk and updates the object to the current version of the package.
EmbeddedText$load_from_disk(dir_path)
dir_path
Path where the data set set is stored.
Method does not return anything. It loads an object from disk.
get_model_info()
Method for retrieving information about the model that generated this embedding.
EmbeddedText$get_model_info()
list
contains all saved information about the underlying text embedding model.
get_model_label()
Method for retrieving the label of the model that generated this embedding.
EmbeddedText$get_model_label()
string
Label of the corresponding text embedding model
get_times()
Number of chunks/times of the text embeddings.
EmbeddedText$get_times()
Returns an int
describing the number of chunks/times of the text embeddings.
get_features()
Number of actual features/dimensions of the text embeddings.In the case a
feature extractor was used the number of features is smaller as the original number of
features. To receive the original number of features (the number of features before applying a
feature extractor) you can use the method get_original_features
of this class.
EmbeddedText$get_features()
Returns an int
describing the number of features/dimensions of the text embeddings.
get_original_features()
Number of original features/dimensions of the text embeddings.
EmbeddedText$get_original_features()
Returns an int
describing the number of features/dimensions if no
feature extractor) is used or before a feature extractor) is
applied.
is_compressed()
Checks if the text embedding were reduced by a feature extractor.
EmbeddedText$is_compressed()
Returns TRUE
if the number of dimensions was reduced by a feature extractor. If
not return FALSE
.
add_feature_extractor_info()
Method setting information on the feature extractor that was used to reduce the number of dimensions of the text embeddings. This information should only be used if a feature extractor was applied.
EmbeddedText$add_feature_extractor_info( model_name, model_label = NA, features = NA, method = NA, noise_factor = NA, optimizer = NA )
model_name
string
Name of the underlying TextEmbeddingModel.
model_label
string
Label of the underlying TextEmbeddingModel.
features
int
Number of dimension (features) for the compressed text embeddings.
method
string
Method that the TEFeatureExtractor applies for genereating the compressed text
embeddings.
noise_factor
double
Noise factor of the TEFeatureExtractor.
optimizer
string
Optimizer used during training the TEFeatureExtractor.
Method does nothing return. It sets information on a feature extractor.
get_feature_extractor_info()
Method for receiving information on the feature extractor that was used to reduce the number of dimensions of the text embeddings.
EmbeddedText$get_feature_extractor_info()
Returns a list
with information on the feature extractor. If no
feature extractor was used it returns NULL
.
convert_to_LargeDataSetForTextEmbeddings()
Method for converting this object to an object of class LargeDataSetForTextEmbeddings.
EmbeddedText$convert_to_LargeDataSetForTextEmbeddings()
Returns an object of class LargeDataSetForTextEmbeddings which uses memory mapping allowing to work with large data sets.
n_rows()
Number of rows.
EmbeddedText$n_rows()
Returns the number of rows of the text embeddings which represent the number of cases.
get_all_fields()
Return all fields.
EmbeddedText$get_all_fields()
Method returns a list
containing all public and private fields
of the object.
clone()
The objects of this class are cloneable with this method.
EmbeddedText$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other Data Management:
DataManagerClassifier
,
LargeDataSetForText
,
LargeDataSetForTextEmbeddings
This function calculates Fleiss' Kappa.
fleiss_kappa(rater_one, rater_two, additional_raters = NULL)
fleiss_kappa(rater_one, rater_two, additional_raters = NULL)
rater_one |
|
rater_two |
|
additional_raters |
|
Retuns the value for Fleiss' Kappa.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. doi:10.1037/h0031619
Other performance measures:
cohens_kappa()
,
kendalls_w()
,
kripp_alpha()
Function for generating an ID suffix for objects of class TextEmbeddingModel, TEClassifierRegular, and TEClassifierProtoNet.
generate_id(length = 16)
generate_id(length = 16)
length |
|
Returns a string
of the requested length.
Other Utils:
auto_n_cores()
,
clean_pytorch_log_transformers()
,
create_config_state()
,
create_dir()
,
get_file_extension()
,
get_py_package_versions()
,
is.null_or_na()
,
output_message()
,
print_message()
,
run_py_file()
Function for requesting a vector
containing the alpha-3 codes for most countries.
get_alpha_3_codes()
get_alpha_3_codes()
Returns a vector
containing the alpha-3 codes for most countries.
Other Auxiliary Functions:
matrix_to_array_c()
,
summarize_tracked_sustainability()
,
to_categorical_c()
This function calculates different reliability measures which are based on the empirical research method of content analysis.
get_coder_metrics( true_values = NULL, predicted_values = NULL, return_names_only = FALSE )
get_coder_metrics( true_values = NULL, predicted_values = NULL, return_names_only = FALSE )
true_values |
|
predicted_values |
|
return_names_only |
|
If return_names_only = FALSE
returns a vector
with the following reliability measures:
iota_index: Iota Index from the Iota Reliability Concept Version 2.
min_iota2: Minimal Iota from Iota Reliability Concept Version 2.
avg_iota2: Average Iota from Iota Reliability Concept Version 2.
max_iota2: Maximum Iota from Iota Reliability Concept Version 2.
min_alpha: Minmal Alpha Reliability from Iota Reliability Concept Version 2.
avg_alpha: Average Alpha Reliability from Iota Reliability Concept Version 2.
max_alpha: Maximum Alpha Reliability from Iota Reliability Concept Version 2.
static_iota_index: Static Iota Index from Iota Reliability Concept Version 2.
dynamic_iota_index: Dynamic Iota Index Iota Reliability Concept Version 2.
kalpha_nominal: Krippendorff's Alpha for nominal variables.
kalpha_ordinal: Krippendorff's Alpha for ordinal variables.
kendall: Kendall's coefficient of concordance W with correction for ties.
c_kappa_unweighted: Cohen's Kappa unweighted.
c_kappa_linear: Weighted Cohen's Kappa with linear increasing weights.
c_kappa_squared: Weighted Cohen's Kappa with quadratic increasing weights.
kappa_fleiss: Fleiss' Kappa for multiple raters without exact estimation.
percentage_agreement: Percentage Agreement.
balanced_accuracy: Average accuracy within each class.
gwet_ac: Gwet's AC1/AC2 agreement coefficient.
If return_names_only = TRUE
returns only the names of the vector elements.
Other classifier_utils:
calc_standard_classification_measures()
Function for requesting the file extension
get_file_extension(file_path)
get_file_extension(file_path)
file_path |
|
Returns the extension of a file as a string.
Other Utils:
auto_n_cores()
,
clean_pytorch_log_transformers()
,
create_config_state()
,
create_dir()
,
generate_id()
,
get_py_package_versions()
,
is.null_or_na()
,
output_message()
,
print_message()
,
run_py_file()
Function for calculating the number of chunks/sequences for every case.
get_n_chunks(text_embeddings, features, times)
get_n_chunks(text_embeddings, features, times)
text_embeddings |
|
features |
|
times |
|
Namedvector
of integers representing the number of chunks/sequences for every case.
Other data_management_utils:
create_synthetic_units_from_matrix()
,
get_synthetic_cases_from_matrix()
Function for requesting a summary of the versions of all critical python components.
get_py_package_versions()
get_py_package_versions()
Returns a list that contains the version number of python and
the versions of critical python packages. If a package is not available
version is set to NA
.
Other Utils:
auto_n_cores()
,
clean_pytorch_log_transformers()
,
create_config_state()
,
create_dir()
,
generate_id()
,
get_file_extension()
,
is.null_or_na()
,
output_message()
,
print_message()
,
run_py_file()
This function creates synthetic cases for balancing the training with an object of the class TEClassifierRegular or TEClassifierProtoNet.
get_synthetic_cases_from_matrix( matrix_form, times, features, target, sequence_length, method = c("smote"), min_k = 1, max_k = 6 )
get_synthetic_cases_from_matrix( matrix_form, times, features, target, sequence_length, method = c("smote"), min_k = 1, max_k = 6 )
matrix_form |
Named |
times |
|
features |
|
target |
Named |
sequence_length |
|
method |
|
min_k |
|
max_k |
|
list
with the following components:
syntetic_embeddings
: Named data.frame
containing the text embeddings of the synthetic cases.
syntetic_targets
: Named factor
containing the labels of the corresponding synthetic cases.
n_syntetic_units
: table
showing the number of synthetic cases for every label/category.
Other data_management_utils:
create_synthetic_units_from_matrix()
,
get_n_chunks()
Function for installing 'aifeducation' on a machine. This functions assumes that not 'python' and no 'miniconda' is installed. Only'pytorch' is installed.
install_aifeducation(install_aifeducation_studio = TRUE)
install_aifeducation(install_aifeducation_studio = TRUE)
install_aifeducation_studio |
|
Function does nothing return. It installs python, optional R packages, and necessary 'python' packages on a machine.
Other Installation and Configuration:
check_aif_py_modules()
,
install_py_modules()
,
set_transformers_logger()
Function for installing the necessary python modules.
install_py_modules( envname = "aifeducation", install = "pytorch", transformer_version = "<=4.46", tokenizers_version = "<=0.20.4", pandas_version = "<=2.2.3", datasets_version = "<=3.1.0", codecarbon_version = "<=2.8.2", safetensors_version = "<=0.4.5", torcheval_version = "<=0.0.7", accelerate_version = "<=1.1.1", pytorch_cuda_version = "12.1", python_version = "3.9", remove_first = FALSE )
install_py_modules( envname = "aifeducation", install = "pytorch", transformer_version = "<=4.46", tokenizers_version = "<=0.20.4", pandas_version = "<=2.2.3", datasets_version = "<=3.1.0", codecarbon_version = "<=2.8.2", safetensors_version = "<=0.4.5", torcheval_version = "<=0.0.7", accelerate_version = "<=1.1.1", pytorch_cuda_version = "12.1", python_version = "3.9", remove_first = FALSE )
envname |
|
install |
|
transformer_version |
|
tokenizers_version |
|
pandas_version |
|
datasets_version |
|
codecarbon_version |
|
safetensors_version |
|
torcheval_version |
|
accelerate_version |
|
pytorch_cuda_version |
|
python_version |
|
remove_first |
|
Returns no values or objects. Function is used for installing the necessary python libraries in a conda environment.
Other Installation and Configuration:
check_aif_py_modules()
,
install_aifeducation()
,
set_transformers_logger()
Function for checking if an object is NULL
or .
is.null_or_na(object)
is.null_or_na(object)
object |
An object to test. |
Returns FALSE
if the object is not NULL
and not NA
. Returns TRUE
in all other cases.
Other Utils:
auto_n_cores()
,
clean_pytorch_log_transformers()
,
create_config_state()
,
create_dir()
,
generate_id()
,
get_file_extension()
,
get_py_package_versions()
,
output_message()
,
print_message()
,
run_py_file()
This function calculates Kendall's coefficient of concordance w with and without correction.
kendalls_w(rater_one, rater_two, additional_raters = NULL)
kendalls_w(rater_one, rater_two, additional_raters = NULL)
rater_one |
|
rater_two |
|
additional_raters |
|
Returns a list
containing the results for Kendall's coefficient of concordance w
with and without correction.
Other performance measures:
cohens_kappa()
,
fleiss_kappa()
,
kripp_alpha()
This function calculates different Krippendorff's Alpha for nominal and ordinal variables.
kripp_alpha(rater_one, rater_two, additional_raters = NULL)
kripp_alpha(rater_one, rater_two, additional_raters = NULL)
rater_one |
|
rater_two |
|
additional_raters |
|
Returns a list
containing the results for Krippendorff's Alpha for
nominal and ordinal data.
Krippendorff, K. (2019). Content Analysis: An Introduction to Its Methodology (4th Ed.). SAGE
Other performance measures:
cohens_kappa()
,
fleiss_kappa()
,
kendalls_w()
This object contains public and private methods which may be useful for every large data sets. Objects of this class are not intended to be used directly. LargeDataSetForTextEmbeddings or LargeDataSetForText.
Returns a new object of this class.
n_cols()
Number of columns in the data set.
LargeDataSetBase$n_cols()
int
describing the number of columns in the data set.
n_rows()
Number of rows in the data set.
LargeDataSetBase$n_rows()
int
describing the number of rows in the data set.
get_colnames()
Get names of the columns in the data set.
LargeDataSetBase$get_colnames()
vector
containing the names of the columns as string
s.
get_dataset()
Get data set.
LargeDataSetBase$get_dataset()
Returns the data set of this object as an object of class datasets.arrow_dataset.Dataset
.
reduce_to_unique_ids()
Reduces the data set to a data set containing only unique ids. In the case an id exists multiple times in the data set the first case remains in the data set. The other cases are dropped.
Attention Calling this method will change the data set in place.
LargeDataSetBase$reduce_to_unique_ids()
Method does not return anything. It changes the data set of this object in place.
select()
Returns a data set which contains only the cases belonging to the specific indices.
LargeDataSetBase$select(indicies)
indicies
vector
of int
for selecting rows in the data set. Attention The indices are zero-based.
Returns a data set of class datasets.arrow_dataset.Dataset
with the selected rows.
get_ids()
Get ids
LargeDataSetBase$get_ids()
Returns a vector
containing the ids of every row as string
s.
save()
Saves a data set to disk.
LargeDataSetBase$save(dir_path, folder_name, create_dir = TRUE)
dir_path
Path where to store the data set.
folder_name
string
Name of the folder for storing the data set.
create_dir
bool
If True
the directory will be created if it does not exist.
Method does not return anything. It write the data set to disk.
load_from_disk()
loads an object of class LargeDataSetBase from disk 'and updates the object to the current version of the package.
LargeDataSetBase$load_from_disk(dir_path)
dir_path
Path where the data set set is stored.
Method does not return anything. It loads an object from disk.
load()
Loads a data set from disk.
LargeDataSetBase$load(dir_path)
dir_path
Path where the data set is stored.
Method does not return anything. It loads a data set from disk.
get_all_fields()
Return all fields.
LargeDataSetBase$get_all_fields()
Method returns a list
containing all public and private fields of the object.
clone()
The objects of this class are cloneable with this method.
LargeDataSetBase$clone(deep = FALSE)
deep
Whether to make a deep clone.
This object stores raw texts. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.
Returns a new object of this class.
aifeducation::LargeDataSetBase
-> LargeDataSetForText
aifeducation::LargeDataSetBase$get_all_fields()
aifeducation::LargeDataSetBase$get_colnames()
aifeducation::LargeDataSetBase$get_dataset()
aifeducation::LargeDataSetBase$get_ids()
aifeducation::LargeDataSetBase$load()
aifeducation::LargeDataSetBase$load_from_disk()
aifeducation::LargeDataSetBase$n_cols()
aifeducation::LargeDataSetBase$n_rows()
aifeducation::LargeDataSetBase$reduce_to_unique_ids()
aifeducation::LargeDataSetBase$save()
aifeducation::LargeDataSetBase$select()
new()
Method for creation of LargeDataSetForText instance. It can be initialized with init_data
parameter if passed (Uses add_from_data.frame()
method if init_data
is data.frame
).
LargeDataSetForText$new(init_data = NULL)
init_data
Initial data.frame
for dataset.
A new instance of this class initialized with init_data
if passed.
add_from_files_txt()
Method for adding raw texts saved within .txt files to the data set. Please note the the directory should contain one folder for each .txt file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .txt file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
LargeDataSetForText$add_from_files_txt( dir_path, batch_size = 500, log_file = NULL, log_write_interval = 2, log_top_value = 0, log_top_total = 1, log_top_message = NA, trace = TRUE )
dir_path
Path to the directory where the files are stored.
batch_size
int
determining the number of files to process at once.
log_file
string
Path to the file where the log should be saved. If no logging is desired set this
argument to NULL
.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_file
is not NULL
.
log_top_value
int
indicating the current iteration of the process.
log_top_total
int
determining the maximal number of iterations.
log_top_message
string
providing additional information of the process.
trace
bool
If TRUE
information on the progress is printed to the console.
The method does not return anything. It adds new raw texts to the data set.
add_from_files_pdf()
Method for adding raw texts saved within .pdf files to the data set. Please note the the directory should contain one folder for each .pdf file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .pdf file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
LargeDataSetForText$add_from_files_pdf( dir_path, batch_size = 500, log_file = NULL, log_write_interval = 2, log_top_value = 0, log_top_total = 1, log_top_message = NA, trace = TRUE )
dir_path
Path to the directory where the files are stored.
batch_size
int
determining the number of files to process at once.
log_file
string
Path to the file where the log should be saved. If no logging is desired set this
argument to NULL
.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_file
is not NULL
.
log_top_value
int
indicating the current iteration of the process.
log_top_total
int
determining the maximal number of iterations.
log_top_message
string
providing additional information of the process.
trace
bool
If TRUE
information on the progress is printed to the console.
The method does not return anything. It adds new raw texts to the data set.
add_from_files_xlsx()
Method for adding raw texts saved within .xlsx files to the data set. The method assumes that the texts are saved in the rows and that the columns store the id and the raw texts in the columns. In addition, a column for the bibliography information and the license can be added. The column names for these rows must be specified with the following arguments. They must be the same for all .xlsx files in the chosen directory. Id and raw texts are mandatory, bibliographic, license, license's url, license's text, and source's url are optional. Additional columns are dropped.
LargeDataSetForText$add_from_files_xlsx( dir_path, trace = TRUE, id_column = "id", text_column = "text", bib_entry_column = "bib_entry", license_column = "license", url_license_column = "url_license", text_license_column = "text_license", url_source_column = "url_source", log_file = NULL, log_write_interval = 2, log_top_value = 0, log_top_total = 1, log_top_message = NA )
dir_path
Path to the directory where the files are stored.
trace
bool
If TRUE
prints information on the progress to the console.
id_column
string
Name of the column storing the ids for the texts.
text_column
string
Name of the column storing the raw text.
bib_entry_column
string
Name of the column storing the bibliographic information of the texts.
license_column
string
Name of the column storing information about the licenses.
url_license_column
string
Name of the column storing information about the url to the license in the
internet.
text_license_column
string
Name of the column storing the license as text.
url_source_column
string
Name of the column storing information about about the url to the source in the
internet.
log_file
string
Path to the file where the log should be saved. If no logging is desired set this
argument to NULL
.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_file
is not NULL
.
log_top_value
int
indicating the current iteration of the process.
log_top_total
int
determining the maximal number of iterations.
log_top_message
string
providing additional information of the process.
The method does not return anything. It adds new raw texts to the data set.
add_from_data.frame()
Method for adding raw texts from a data.frame
LargeDataSetForText$add_from_data.frame(data_frame)
data_frame
Object of class data.frame
with at least the following columns "id","text","bib_entry",
"license", "url_license", "text_license", and "url_source". If "id" and7or "text" is missing an error occurs.
If the other columns are not present in the data.frame
they are added with empty values(NA
).
Additional columns are dropped.
The method does not return anything. It adds new raw texts to the data set.
get_private()
Method for requesting all private fields and methods. Used for loading and updating an object.
LargeDataSetForText$get_private()
Returns a list
with all private fields and methods.
clone()
The objects of this class are cloneable with this method.
LargeDataSetForText$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other Data Management:
DataManagerClassifier
,
EmbeddedText
,
LargeDataSetForTextEmbeddings
This object stores text embeddings which are usually produced by an object of class TextEmbeddingModel. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.
LargeDataSetForTextEmbeddings are used for storing and managing the text embeddings created with objects of class TextEmbeddingModel. Objects of class LargeDataSetForTextEmbeddings serve as input for objects of class TEClassifierRegular, TEClassifierProtoNet, and TEFeatureExtractor. The main aim of this class is to provide a structured link between embedding models and classifiers. Since objects of this class save information on the text embedding model that created the text embedding it ensures that only embedding generated with same embedding model are combined. Furthermore, the stored information allows objects to check if embeddings of the correct text embedding model are used for training and predicting.
Returns a new object of this class.
aifeducation::LargeDataSetBase
-> LargeDataSetForTextEmbeddings
LargeDataSetForTextEmbeddings$get_text_embedding_model_name()
LargeDataSetForTextEmbeddings$add_embeddings_from_EmbeddedText()
LargeDataSetForTextEmbeddings$add_embeddings_from_LargeDataSetForTextEmbeddings()
aifeducation::LargeDataSetBase$get_all_fields()
aifeducation::LargeDataSetBase$get_colnames()
aifeducation::LargeDataSetBase$get_dataset()
aifeducation::LargeDataSetBase$get_ids()
aifeducation::LargeDataSetBase$load()
aifeducation::LargeDataSetBase$n_cols()
aifeducation::LargeDataSetBase$n_rows()
aifeducation::LargeDataSetBase$reduce_to_unique_ids()
aifeducation::LargeDataSetBase$save()
aifeducation::LargeDataSetBase$select()
configure()
Creates a new object representing text embeddings.
LargeDataSetForTextEmbeddings$configure( model_name = NA, model_label = NA, model_date = NA, model_method = NA, model_version = NA, model_language = NA, param_seq_length = NA, param_chunks = NULL, param_features = NULL, param_overlap = NULL, param_emb_layer_min = NULL, param_emb_layer_max = NULL, param_emb_pool_type = NULL, param_aggregation = NULL )
model_name
string
Name of the model that generates this embedding.
model_label
string
Label of the model that generates this embedding.
model_date
string
Date when the embedding generating model was created.
model_method
string
Method of the underlying embedding model.
model_version
string
Version of the model that generated this embedding.
model_language
string
Language of the model that generated this embedding.
param_seq_length
int
Maximum number of tokens that processes the generating model for a chunk.
param_chunks
int
Maximum number of chunks which are supported by the generating model.
param_features
int
Number of dimensions of the text embeddings.
param_overlap
int
Number of tokens that were added at the beginning of the sequence for the next chunk
by this model.
param_emb_layer_min
int
or string
determining the first layer to be included in the creation of
embeddings.
param_emb_layer_max
int
or string
determining the last layer to be included in the creation of
embeddings.
param_emb_pool_type
string
determining the method for pooling the token embeddings within each layer.
param_aggregation
string
Aggregation method of the hidden states. Deprecated. Only included for backward
compatibility.
The method returns a new object of this class.
is_configured()
Method for checking if the model was successfully configured. An object can only be used if this
value is TRUE
.
LargeDataSetForTextEmbeddings$is_configured()
bool
TRUE
if the model is fully configured. FALSE
if not.
get_text_embedding_model_name()
Method for requesting the name (unique id) of the underlying text embedding model.
LargeDataSetForTextEmbeddings$get_text_embedding_model_name()
Returns a string
describing name of the text embedding model.
get_model_info()
Method for retrieving information about the model that generated this embedding.
LargeDataSetForTextEmbeddings$get_model_info()
list
containing all saved information about the underlying text embedding model.
load_from_disk()
loads an object of class LargeDataSetForTextEmbeddings from disk and updates the object to the current version of the package.
LargeDataSetForTextEmbeddings$load_from_disk(dir_path)
dir_path
Path where the data set set is stored.
Method does not return anything. It loads an object from disk.
get_model_label()
Method for retrieving the label of the model that generated this embedding.
LargeDataSetForTextEmbeddings$get_model_label()
string
Label of the corresponding text embedding model
add_feature_extractor_info()
Method setting information on the TEFeatureExtractor that was used to reduce the number of dimensions of the text embeddings. This information should only be used if a TEFeatureExtractor was applied.
LargeDataSetForTextEmbeddings$add_feature_extractor_info( model_name, model_label = NA, features = NA, method = NA, noise_factor = NA, optimizer = NA )
model_name
string
Name of the underlying TextEmbeddingModel.
model_label
string
Label of the underlying TextEmbeddingModel.
features
int
Number of dimension (features) for the compressed text embeddings.
method
string
Method that the TEFeatureExtractor applies for genereating the compressed text
embeddings.
noise_factor
double
Noise factor of the TEFeatureExtractor.
optimizer
string
Optimizer used during training the TEFeatureExtractor.
Method does nothing return. It sets information on a TEFeatureExtractor.
get_feature_extractor_info()
Method for receiving information on the TEFeatureExtractor that was used to reduce the number of dimensions of the text embeddings.
LargeDataSetForTextEmbeddings$get_feature_extractor_info()
Returns a list
with information on the TEFeatureExtractor. If no TEFeatureExtractor was used it
returns NULL
.
is_compressed()
Checks if the text embedding were reduced by a TEFeatureExtractor.
LargeDataSetForTextEmbeddings$is_compressed()
Returns TRUE
if the number of dimensions was reduced by a TEFeatureExtractor. If not return FALSE
.
get_times()
Number of chunks/times of the text embeddings.
LargeDataSetForTextEmbeddings$get_times()
Returns an int
describing the number of chunks/times of the text embeddings.
get_features()
Number of actual features/dimensions of the text embeddings.In the case a TEFeatureExtractor was
used the number of features is smaller as the original number of features. To receive the original number of
features (the number of features before applying a TEFeatureExtractor) you can use the method
get_original_features
of this class.
LargeDataSetForTextEmbeddings$get_features()
Returns an int
describing the number of features/dimensions of the text embeddings.
get_original_features()
Number of original features/dimensions of the text embeddings.
LargeDataSetForTextEmbeddings$get_original_features()
Returns an int
describing the number of features/dimensions if no TEFeatureExtractor) is used or
before a TEFeatureExtractor) is applied.
add_embeddings_from_array()
Method for adding new data to the data set from an array
. Please note that the method does not
check if cases already exist in the data set. To reduce the data set to unique cases call the method
reduce_to_unique_ids
.
LargeDataSetForTextEmbeddings$add_embeddings_from_array(embedding_array)
embedding_array
array
containing the text embeddings.
The method does not return anything. It adds new data to the data set.
add_embeddings_from_EmbeddedText()
Method for adding new data to the data set from an EmbeddedText. Please note that the method does
not check if cases already exist in the data set. To reduce the data set to unique cases call the method
reduce_to_unique_ids
.
LargeDataSetForTextEmbeddings$add_embeddings_from_EmbeddedText(EmbeddedText)
EmbeddedText
Object of class EmbeddedText.
The method does not return anything. It adds new data to the data set.
add_embeddings_from_LargeDataSetForTextEmbeddings()
Method for adding new data to the data set from an LargeDataSetForTextEmbeddings. Please note that
the method does not check if cases already exist in the data set. To reduce the data set to unique cases call
the method reduce_to_unique_ids
.
LargeDataSetForTextEmbeddings$add_embeddings_from_LargeDataSetForTextEmbeddings( dataset )
dataset
Object of class LargeDataSetForTextEmbeddings.
The method does not return anything. It adds new data to the data set.
convert_to_EmbeddedText()
Method for converting this object to an object of class EmbeddedText.
Attention This object uses memory mapping to allow the usage of data sets that do not fit into memory. By calling this method the data set will be loaded and stored into memory/RAM. This may lead to an out-of-memory error.
LargeDataSetForTextEmbeddings$convert_to_EmbeddedText()
LargeDataSetForTextEmbeddings an object of class EmbeddedText which is stored in the memory/RAM.
clone()
The objects of this class are cloneable with this method.
LargeDataSetForTextEmbeddings$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other Data Management:
DataManagerClassifier
,
EmbeddedText
,
LargeDataSetForText
Function for loading objects created with 'aifeducation'.
load_from_disk(dir_path)
load_from_disk(dir_path)
dir_path |
|
Returns an object of class TEClassifierRegular, TEClassifierProtoNet, TEFeatureExtractor, TextEmbeddingModel, LargeDataSetForTextEmbeddings, LargeDataSetForText or EmbeddedText.
Other Saving and Loading:
save_to_disk()
Function loads the target data for a long running task.
long_load_target_data(file_path, selectet_column)
long_load_target_data(file_path, selectet_column)
file_path |
|
selectet_column |
|
This function assumes that the target data is stored as a columns with the cases in the rows and the categories in the columns. The ids of the cases must be stored in a column called "id".
Returns a named factor containing the target data.
Other studio_utils:
create_data_embeddings_description()
Function written in C++ for reshaping a matrix containing sequential data into an array for use with keras.
matrix_to_array_c(matrix, times, features)
matrix_to_array_c(matrix, times, features)
matrix |
|
times |
|
features |
|
Returns an array. The first dimension corresponds to the cases, the second to the times, and the third to the features.
Other Auxiliary Functions:
get_alpha_3_codes()
,
summarize_tracked_sustainability()
,
to_categorical_c()
Prints a message msg
if trace
parameter is TRUE
with current date with message()
or cat()
function.
output_message(msg, trace, msg_fun)
output_message(msg, trace, msg_fun)
msg |
|
trace |
|
msg_fun |
|
This function returns nothing.
Other Utils:
auto_n_cores()
,
clean_pytorch_log_transformers()
,
create_config_state()
,
create_dir()
,
generate_id()
,
get_file_extension()
,
get_py_package_versions()
,
is.null_or_na()
,
print_message()
,
run_py_file()
message()
)Prints a message msg
if trace
parameter is TRUE
with current date with message()
function.
print_message(msg, trace)
print_message(msg, trace)
msg |
|
trace |
|
This function returns nothing.
Other Utils:
auto_n_cores()
,
clean_pytorch_log_transformers()
,
create_config_state()
,
create_dir()
,
generate_id()
,
get_file_extension()
,
get_py_package_versions()
,
is.null_or_na()
,
output_message()
,
run_py_file()
Used to run python files with reticulate::py_run_file()
from folder python
.
run_py_file(py_file_name)
run_py_file(py_file_name)
py_file_name |
|
This function returns nothing.
Other Utils:
auto_n_cores()
,
clean_pytorch_log_transformers()
,
create_config_state()
,
create_dir()
,
generate_id()
,
get_file_extension()
,
get_py_package_versions()
,
is.null_or_na()
,
output_message()
,
print_message()
Function for saving objects created with 'aifeducation'.
save_to_disk(object, dir_path, folder_name)
save_to_disk(object, dir_path, folder_name)
object |
Object of class TEClassifierRegular, TEClassifierProtoNet, TEFeatureExtractor, TextEmbeddingModel, LargeDataSetForTextEmbeddings, LargeDataSetForText or EmbeddedText which should be saved. |
dir_path |
|
folder_name |
|
Function does not return a value. It saves the model to disk.
No return value, called for side effects.
Other Saving and Loading:
load_from_disk()
This functions configurates 'tensorflow' to use only cpus.
set_config_cpu_only()
set_config_cpu_only()
This function does not return anything. It is used for its side effects.
os$environ$setdefault("CUDA_VISIBLE_DEVICES","-1")
Other Installation and Configuration Tensorflow:
set_config_gpu_low_memory()
,
set_config_os_environ_logger()
,
set_config_tf_logger()
This function changes the memory usage of the gpus to allow computations on machines with small memory. With this function, some computations of large models may be possible but the speed of computation decreases.
set_config_gpu_low_memory()
set_config_gpu_low_memory()
This function does not return anything. It is used for its side effects.
This function sets TF_GPU_ALLOCATOR to "cuda_malloc_async"
and sets memory growth to TRUE
.
Other Installation and Configuration Tensorflow:
set_config_cpu_only()
,
set_config_os_environ_logger()
,
set_config_tf_logger()
This function changes the level for logging information with 'tensorflow' via the os environment. This function must be called before importing 'tensorflow'.
set_config_os_environ_logger(level = "ERROR")
set_config_os_environ_logger(level = "ERROR")
level |
|
This function does not return anything. It is used for its side effects.
Other Installation and Configuration Tensorflow:
set_config_cpu_only()
,
set_config_gpu_low_memory()
,
set_config_tf_logger()
This function changes the level for logging information with 'tensorflow'.
set_config_tf_logger(level = "ERROR")
set_config_tf_logger(level = "ERROR")
level |
|
This function does not return anything. It is used for its side effects.
Other Installation and Configuration Tensorflow:
set_config_cpu_only()
,
set_config_gpu_low_memory()
,
set_config_os_environ_logger()
This function changes the level for logging information of the 'transformers' library. It influences the output printed to console for creating and training transformer models as well as TextEmbeddingModels.
set_transformers_logger(level = "ERROR")
set_transformers_logger(level = "ERROR")
level |
|
This function does not return anything. It is used for its side effects.
Other Installation and Configuration:
check_aif_py_modules()
,
install_aifeducation()
,
install_py_modules()
Functions starts a shiny app that represents Aifeducation Studio.
start_aifeducation_studio()
start_aifeducation_studio()
This function does nothing return. It is used to start a shiny app.
Abstract class for neural nets with 'keras'/'tensorflow' and 'pytorch'.
This object represents in implementation of a prototypical network for few-shot learning as described by Snell, Swersky, and Zemel (2017). The network uses a multi way contrastive loss described by Zhang et al. (2019). The network learns to scale the metric as described by Oreshkin, Rodriguez, and Lacoste (2018)
Objects of this class are used for assigning texts to classes/categories. For the creation and training of a
classifier an object of class EmbeddedText or LargeDataSetForTextEmbeddings and a factor
are necessary. The
object of class EmbeddedText or LargeDataSetForTextEmbeddings contains the numerical text representations (text
embeddings) of the raw texts generated by an object of class TextEmbeddingModel. The factor
contains the
classes/categories for every text. Missing values (unlabeled cases) are supported. For predictions an object of
class EmbeddedText or LargeDataSetForTextEmbeddings has to be used which was created with the same
TextEmbeddingModel as for training.
aifeducation::AIFEBaseModel
-> aifeducation::TEClassifierRegular
-> TEClassifierProtoNet
aifeducation::AIFEBaseModel$count_parameter()
aifeducation::AIFEBaseModel$get_all_fields()
aifeducation::AIFEBaseModel$get_documentation_license()
aifeducation::AIFEBaseModel$get_ml_framework()
aifeducation::AIFEBaseModel$get_model_description()
aifeducation::AIFEBaseModel$get_model_info()
aifeducation::AIFEBaseModel$get_model_license()
aifeducation::AIFEBaseModel$get_package_versions()
aifeducation::AIFEBaseModel$get_private()
aifeducation::AIFEBaseModel$get_publication_info()
aifeducation::AIFEBaseModel$get_sustainability_data()
aifeducation::AIFEBaseModel$get_text_embedding_model()
aifeducation::AIFEBaseModel$get_text_embedding_model_name()
aifeducation::AIFEBaseModel$is_configured()
aifeducation::AIFEBaseModel$load()
aifeducation::AIFEBaseModel$set_documentation_license()
aifeducation::AIFEBaseModel$set_model_description()
aifeducation::AIFEBaseModel$set_model_license()
aifeducation::AIFEBaseModel$set_publication_info()
aifeducation::TEClassifierRegular$check_embedding_model()
aifeducation::TEClassifierRegular$check_feature_extractor_object_type()
aifeducation::TEClassifierRegular$load_from_disk()
aifeducation::TEClassifierRegular$predict()
aifeducation::TEClassifierRegular$requires_compression()
aifeducation::TEClassifierRegular$save()
configure()
Creating a new instance of this class.
TEClassifierProtoNet$configure( ml_framework = "pytorch", name = NULL, label = NULL, text_embeddings = NULL, feature_extractor = NULL, target_levels = NULL, dense_size = 4, dense_layers = 0, rec_size = 4, rec_layers = 2, rec_type = "gru", rec_bidirectional = FALSE, embedding_dim = 2, self_attention_heads = 0, intermediate_size = NULL, attention_type = "fourier", add_pos_embedding = TRUE, rec_dropout = 0.1, repeat_encoder = 1, dense_dropout = 0.4, recurrent_dropout = 0.4, encoder_dropout = 0.1, optimizer = "adam" )
ml_framework
string
Currently only pytorch is supported (ml_framework="pytorch"
).
name
string
Name of the new classifier. Please refer to common name conventions. Free text can be used
with parameter label
.
label
string
Label for the new classifier. Here you can use free text.
text_embeddings
An object of class TextEmbeddingModel or LargeDataSetForTextEmbeddings.
feature_extractor
Object of class TEFeatureExtractor which should be used in order to reduce the number
of dimensions of the text embeddings. If no feature extractor should be applied set NULL
.
target_levels
vector
containing the levels (categories or classes) within the target data. Please not
that order matters. For ordinal data please ensure that the levels are sorted correctly with later levels
indicating a higher category/class. For nominal data the order does not matter.
dense_size
int
Number of neurons for each dense layer.
dense_layers
int
Number of dense layers.
rec_size
int
Number of neurons for each recurrent layer.
rec_layers
int
Number of recurrent layers.
rec_type
string
Type of the recurrent layers.rec_type="gru"
for Gated Recurrent Unit and
rec_type="lstm"
for Long Short-Term Memory.
rec_bidirectional
bool
If TRUE
a bidirectional version of the recurrent layers is used.
embedding_dim
int
determining the number of dimensions for the text embedding.
self_attention_heads
int
determining the number of attention heads for a self-attention layer. Only
relevant if attention_type="multihead"
.
intermediate_size
int
determining the size of the projection layer within a each transformer encoder.
attention_type
string
Choose the relevant attention type. Possible values are "fourier"
and
"multihead"
. Please note that you may see different values for a case for different input orders if you choose fourier
on linux.
add_pos_embedding
bool
TRUE
if positional embedding should be used.
rec_dropout
double
ranging between 0 and lower 1, determining the dropout between bidirectional
recurrent layers.
repeat_encoder
int
determining how many times the encoder should be added to the network.
dense_dropout
double
ranging between 0 and lower 1, determining the dropout between dense layers.
recurrent_dropout
double
ranging between 0 and lower 1, determining the recurrent dropout for each
recurrent layer. Only relevant for keras models.
encoder_dropout
double
ranging between 0 and lower 1, determining the dropout for the dense projection
within the encoder layers.
optimizer
string
"adam"
or "rmsprop"
.
Returns an object of class TEClassifierProtoNet which is ready for training.
train()
Method for training a neural net.
Training includes a routine for early stopping. In the case that loss<0.0001 and Accuracy=1.00 and Average Iota=1.00 training stops. The history uses the values of the last trained epoch for the remaining epochs.
After training the model with the best values for Average Iota, Accuracy, and Loss on the validation data set is used as the final model.
TEClassifierProtoNet$train( data_embeddings, data_targets, data_folds = 5, data_val_size = 0.25, use_sc = TRUE, sc_method = "dbsmote", sc_min_k = 1, sc_max_k = 10, use_pl = TRUE, pl_max_steps = 3, pl_max = 1, pl_anchor = 1, pl_min = 0, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, epochs = 40, batch_size = 35, Ns = 5, Nq = 3, loss_alpha = 0.5, loss_margin = 0.5, sampling_separate = FALSE, sampling_shuffle = TRUE, dir_checkpoint, trace = TRUE, ml_trace = 1, log_dir = NULL, log_write_interval = 10, n_cores = auto_n_cores() )
data_embeddings
Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
data_targets
factor
containing the labels for cases stored in data_embeddings
. Factor must be named
and has to use the same names used in data_embeddings
.
data_folds
int
determining the number of cross-fold samples.
data_val_size
double
between 0 and 1, indicating the proportion of cases of each class which should be
used for the validation sample during the estimation of the model. The remaining cases are part of the training
data.
use_sc
bool
TRUE
if the estimation should integrate synthetic cases. FALSE
if not.
sc_method
vector
containing the method for generating synthetic cases. Possible are sc_method="adas"
,
sc_method="smote"
, and sc_method="dbsmote"
.
sc_min_k
int
determining the minimal number of k which is used for creating synthetic units.
sc_max_k
int
determining the maximal number of k which is used for creating synthetic units.
use_pl
bool
TRUE
if the estimation should integrate pseudo-labeling. FALSE
if not.
pl_max_steps
int
determining the maximum number of steps during pseudo-labeling.
pl_max
double
between 0 and 1, setting the maximal level of confidence for considering a case for
pseudo-labeling.
pl_anchor
double
between 0 and 1 indicating the reference point for sorting the new cases of every
label. See notes for more details.
pl_min
double
between 0 and 1, setting the minimal level of confidence for considering a case for
pseudo-labeling.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library
'codecarbon'.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
Region within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html
sustain_interval
int
Interval in seconds for measuring power usage.
epochs
int
Number of training epochs.
batch_size
int
Size of the batches for training.
Ns
int
Number of cases for every class in the sample.
Nq
int
Number of cases for every class in the query.
loss_alpha
double
Value between 0 and 1 indicating how strong the loss should focus on pulling cases to
its corresponding prototypes or pushing cases away from other prototypes. The higher the value the more the
loss concentrates on pulling cases to its corresponding prototypes.
loss_margin
double
Value greater 0 indicating the minimal distance of every case from prototypes of
other classes
sampling_separate
bool
If TRUE
the cases for every class are divided into a data set for sample and for query.
These are never mixed. If TRUE
sample and query cases are drawn from the same data pool. That is, a case can be
part of sample in one epoch and in another epoch it can be part of query. It is ensured that a case is never part of
sample and query at the same time. In addition, it is ensured that every cases exists only once during
a training step.
sampling_shuffle
bool
If TRUE
cases a randomly drawn from the data during every step. If FALSE
the cases are not shuffled.
dir_checkpoint
string
Path to the directory where the checkpoint during training should be saved. If the
directory does not exist, it is created.
trace
bool
TRUE
, if information about the estimation phase should be printed to the console.
ml_trace
int
ml_trace=0
does not print any information about the training process from pytorch on the
console.
log_dir
string
Path to the directory where the log files should be saved. If no logging is desired set
this argument to NULL
.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_dir
is not NULL
.
n_cores
int
Number of cores which should be used during the calculation of synthetic cases. Only relevant if
use_sc=TRUE
.
balance_class_weights
bool
If TRUE
class weights are generated based on the frequencies of the
training data with the method Inverse Class Frequency'. If FALSE
each class has the weight 1.
balance_sequence_length
bool
If TRUE
sample weights are generated for the length of sequences based on
the frequencies of the training data with the method Inverse Class Frequency'. If FALSE
each sequences length
has the weight 1.
sc_max_k
: All values from sc_min_k up to sc_max_k are successively used. If
the number of sc_max_k
is too high, the value is reduced to a number that allows the calculating of synthetic
units.
pl_anchor:
With the help of this value, the new cases are sorted. For
this aim, the distance from the anchor is calculated and all cases are arranged into an ascending order.
Function does not return a value. It changes the object into a trained classifier.
embed()
Method for embedding documents. Please do not confuse this type of embeddings with the embeddings of texts created by an object of class TextEmbeddingModel. These embeddings embed documents according to their similarity to specific classes.
TEClassifierProtoNet$embed(embeddings_q = NULL, batch_size = 32)
embeddings_q
Object of class EmbeddedText or LargeDataSetForTextEmbeddings containing the text embeddings for all cases which should be embedded into the classification space.
batch_size
int
batch size.
Returns a list
containing the following elements
embeddings_q
: embeddings for the cases (query sample).
embeddings_prototypes
: embeddings of the prototypes which were learned during training. They represents the
center for the different classes.
plot_embeddings()
Method for creating a plot to visualize embeddings and their corresponding centers (prototypes).
TEClassifierProtoNet$plot_embeddings( embeddings_q, classes_q = NULL, batch_size = 12, alpha = 0.5, size_points = 3, size_points_prototypes = 8, inc_unlabeled = TRUE )
embeddings_q
Object of class EmbeddedText or LargeDataSetForTextEmbeddings containing the text embeddings for all cases which should be embedded into the classification space.
classes_q
Named factor
containg the true classes for every case. Please note that the names must match
the names/ids in embeddings_q
.
batch_size
int
batch size.
alpha
float
Value indicating how transparent the points should be (important
if many points overlap). Does not apply to points representing prototypes.
size_points
int
Size of the points excluding the points for prototypes.
size_points_prototypes
int
Size of points representing prototypes.
inc_unlabeled
bool
If TRUE
plot includes unlabeled cases as data points.
Returns a plot of class ggplot
visualizing embeddings.
clone()
The objects of this class are cloneable with this method.
TEClassifierProtoNet$clone(deep = FALSE)
deep
Whether to make a deep clone.
Oreshkin, B. N., Rodriguez, P. & Lacoste, A. (2018). TADAM: Task dependent adaptive metric for improved few-shot learning. https://doi.org/10.48550/arXiv.1805.10123
Snell, J., Swersky, K. & Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. https://doi.org/10.48550/arXiv.1703.05175
Zhang, X., Nie, J., Zong, L., Yu, H. & Liang, W. (2019). One Shot Learning with Margin. In Q. Yang, Z.-H. Zhou, Z. Gong, M.-L. Zhang & S.-J. Huang (Eds.), Lecture Notes in Computer Science. Advances in Knowledge Discovery and Data Mining (Vol. 11440, pp. 305–317). Springer International Publishing. https://doi.org/10.1007/978-3-030-16145-3_24
Other Classification:
TEClassifierRegular
Abstract class for neural nets with 'keras'/'tensorflow' and ' pytorch'.
Objects of this class are used for assigning texts to classes/categories. For the creation and training of a classifier an object of class EmbeddedText or LargeDataSetForTextEmbeddings on the one hand and a factor on the other hand are necessary.
The object of class EmbeddedText or LargeDataSetForTextEmbeddings contains the numerical text representations (text embeddings) of the raw texts generated by an object of class TextEmbeddingModel. For supporting large data sets it is recommended to use LargeDataSetForTextEmbeddings instead of EmbeddedText.
The factor
contains the classes/categories for every text. Missing values (unlabeled cases) are supported and can
be used for pseudo labeling.
For predictions an object of class EmbeddedText or LargeDataSetForTextEmbeddings has to be used which was created with the same TextEmbeddingModel as for training.
aifeducation::AIFEBaseModel
-> TEClassifierRegular
feature_extractor
('list()')
List for storing information and objects about the feature_extractor.
reliability
('list()')
List for storing central reliability measures of the last training.
reliability$test_metric
: Array containing the reliability measures for the test data for
every fold and step (in case of pseudo-labeling).
reliability$test_metric_mean
: Array containing the reliability measures for the test data.
The values represent the mean values for every fold.
reliability$raw_iota_objects
: List containing all iota_object generated with the package iotarelr
for every fold at the end of the last training for the test data.
reliability$raw_iota_objects$iota_objects_end
: List of objects with class iotarelr_iota2
containing the
estimated iota reliability of the second generation for the final model for every fold for the test data.
reliability$raw_iota_objects$iota_objects_end_free
: List of objects with class iotarelr_iota2
containing
the estimated iota reliability of the second generation for the final model for every fold for the test data.
Please note that the model is estimated without forcing the Assignment Error Matrix to be in line with the
assumption of weak superiority.
reliability$iota_object_end
: Object of class iotarelr_iota2
as a mean of the individual objects
for every fold for the test data.
reliability$iota_object_end_free
: Object of class iotarelr_iota2
as a mean of the individual objects
for every fold. Please note that the model is estimated without forcing the Assignment Error Matrix to be in
line with the assumption of weak superiority.
reliability$standard_measures_end
: Object of class list
containing the final measures for precision,
recall, and f1 for every fold.
reliability$standard_measures_mean
: matrix
containing the mean measures for precision, recall, and f1.
aifeducation::AIFEBaseModel$count_parameter()
aifeducation::AIFEBaseModel$get_all_fields()
aifeducation::AIFEBaseModel$get_documentation_license()
aifeducation::AIFEBaseModel$get_ml_framework()
aifeducation::AIFEBaseModel$get_model_description()
aifeducation::AIFEBaseModel$get_model_info()
aifeducation::AIFEBaseModel$get_model_license()
aifeducation::AIFEBaseModel$get_package_versions()
aifeducation::AIFEBaseModel$get_private()
aifeducation::AIFEBaseModel$get_publication_info()
aifeducation::AIFEBaseModel$get_sustainability_data()
aifeducation::AIFEBaseModel$get_text_embedding_model()
aifeducation::AIFEBaseModel$get_text_embedding_model_name()
aifeducation::AIFEBaseModel$is_configured()
aifeducation::AIFEBaseModel$load()
aifeducation::AIFEBaseModel$set_documentation_license()
aifeducation::AIFEBaseModel$set_model_description()
aifeducation::AIFEBaseModel$set_model_license()
aifeducation::AIFEBaseModel$set_publication_info()
configure()
Creating a new instance of this class.
TEClassifierRegular$configure( ml_framework = "pytorch", name = NULL, label = NULL, text_embeddings = NULL, feature_extractor = NULL, target_levels = NULL, dense_size = 4, dense_layers = 0, rec_size = 4, rec_layers = 2, rec_type = "gru", rec_bidirectional = FALSE, self_attention_heads = 0, intermediate_size = NULL, attention_type = "fourier", add_pos_embedding = TRUE, rec_dropout = 0.1, repeat_encoder = 1, dense_dropout = 0.4, recurrent_dropout = 0.4, encoder_dropout = 0.1, optimizer = "adam" )
ml_framework
string
Framework to use for training and inference. ml_framework="tensorflow"
for
'tensorflow' and ml_framework="pytorch"
for 'pytorch'
name
string
Name of the new classifier. Please refer to common name conventions. Free text can be used
with parameter label
.
label
string
Label for the new classifier. Here you can use free text.
text_embeddings
An object of class EmbeddedText or LargeDataSetForTextEmbeddings.
feature_extractor
Object of class TEFeatureExtractor which should be used in order to reduce the number
of dimensions of the text embeddings. If no feature extractor should be applied set NULL
.
target_levels
vector
containing the levels (categories or classes) within the target data. Please not
that order matters. For ordinal data please ensure that the levels are sorted correctly with later levels
indicating a higher category/class. For nominal data the order does not matter.
dense_size
int
Number of neurons for each dense layer.
dense_layers
int
Number of dense layers.
rec_size
int
Number of neurons for each recurrent layer.
rec_layers
int
Number of recurrent layers.
rec_type
string
Type of the recurrent layers. rec_type="gru"
for Gated Recurrent Unit and
rec_type="lstm"
for Long Short-Term Memory.
rec_bidirectional
bool
If TRUE
a bidirectional version of the recurrent layers is used.
self_attention_heads
int
determining the number of attention heads for a self-attention layer. Only
relevant if attention_type="multihead"
intermediate_size
int
determining the size of the projection layer within a each transformer encoder.
attention_type
string
Choose the relevant attention type. Possible values are fourier
and multihead
. Please note
that you may see different values for a case for different input orders if you choose fourier
on linux.
add_pos_embedding
bool
TRUE
if positional embedding should be used.
rec_dropout
int
ranging between 0 and lower 1, determining the dropout between bidirectional recurrent
layers.
repeat_encoder
int
determining how many times the encoder should be added to the network.
dense_dropout
int
ranging between 0 and lower 1, determining the dropout between dense layers.
recurrent_dropout
int
ranging between 0 and lower 1, determining the recurrent dropout for each
recurrent layer. Only relevant for keras models.
encoder_dropout
int
ranging between 0 and lower 1, determining the dropout for the dense projection
within the encoder layers.
optimizer
string
"adam"
or "rmsprop"
.
Returns an object of class TEClassifierRegular which is ready for training.
train()
Method for training a neural net.
Training includes a routine for early stopping. In the case that loss<0.0001 and Accuracy=1.00 and Average Iota=1.00 training stops. The history uses the values of the last trained epoch for the remaining epochs.
After training the model with the best values for Average Iota, Accuracy, and Loss on the validation data set is used as the final model.
TEClassifierRegular$train( data_embeddings, data_targets, data_folds = 5, data_val_size = 0.25, balance_class_weights = TRUE, balance_sequence_length = TRUE, use_sc = TRUE, sc_method = "dbsmote", sc_min_k = 1, sc_max_k = 10, use_pl = TRUE, pl_max_steps = 3, pl_max = 1, pl_anchor = 1, pl_min = 0, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, epochs = 40, batch_size = 32, dir_checkpoint, trace = TRUE, ml_trace = 1, log_dir = NULL, log_write_interval = 10, n_cores = auto_n_cores() )
data_embeddings
Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
data_targets
factor
containing the labels for cases stored in data_embeddings
. Factor must be named
and has to use the same names used in data_embeddings
.
data_folds
int
determining the number of cross-fold samples.
data_val_size
double
between 0 and 1, indicating the proportion of cases of each class which should be
used for the validation sample during the estimation of the model. The remaining cases are part of the training
data.
balance_class_weights
bool
If TRUE
class weights are generated based on the frequencies of the
training data with the method Inverse Class Frequency'. If FALSE
each class has the weight 1.
balance_sequence_length
bool
If TRUE
sample weights are generated for the length of sequences based on
the frequencies of the training data with the method Inverse Class Frequency'. If FALSE
each sequences length
has the weight 1.
use_sc
bool
TRUE
if the estimation should integrate synthetic cases. FALSE
if not.
sc_method
vector
containing the method for generating synthetic cases. Possible are sc_method="adas"
,
sc_method="smote"
, and sc_method="dbsmote"
.
sc_min_k
int
determining the minimal number of k which is used for creating synthetic units.
sc_max_k
int
determining the maximal number of k which is used for creating synthetic units.
use_pl
bool
TRUE
if the estimation should integrate pseudo-labeling. FALSE
if not.
pl_max_steps
int
determining the maximum number of steps during pseudo-labeling.
pl_max
double
between 0 and 1, setting the maximal level of confidence for considering a case for
pseudo-labeling.
pl_anchor
double
between 0 and 1 indicating the reference point for sorting the new cases of every
label. See notes for more details.
pl_min
double
between 0 and 1, setting the minimal level of confidence for considering a case for
pseudo-labeling.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library
'codecarbon'.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
Region within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html
sustain_interval
int
Interval in seconds for measuring power usage.
epochs
int
Number of training epochs.
batch_size
int
Size of the batches for training.
dir_checkpoint
string
Path to the directory where the checkpoint during training should be saved. If the
directory does not exist, it is created.
trace
bool
TRUE
, if information about the estimation phase should be printed to the console.
ml_trace
int
ml_trace=0
does not print any information about the training process from pytorch on the
console.
log_dir
string
Path to the directory where the log files should be saved. If no logging is desired set
this argument to NULL
.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_dir
is not NULL
.
n_cores
int
Number of cores which should be used during the calculation of synthetic cases. Only relevant if
use_sc=TRUE
.
sc_max_k
: All values from sc_min_k up to sc_max_k are successively used. If
the number of sc_max_k is too high, the value is reduced to a number that allows the calculating of synthetic
units.
pl_anchor
: With the help of this value, the new cases are sorted. For
this aim, the distance from the anchor is calculated and all cases are arranged into an ascending order.
Function does not return a value. It changes the object into a trained classifier.
predict()
Method for predicting new data with a trained neural net.
TEClassifierRegular$predict(newdata, batch_size = 32, ml_trace = 1)
newdata
Object of class TextEmbeddingModel or LargeDataSetForTextEmbeddings for which predictions
should be made. In addition, this method allows to use objects of class array
and
datasets.arrow_dataset.Dataset
. However, these should be used only by developers.
batch_size
int
Size of batches.
ml_trace
int
ml_trace=0
does not print any information on the process from the machine learning
framework.
Returns a data.frame
containing the predictions and the probabilities of the different labels for each
case.
check_embedding_model()
Method for checking if the provided text embeddings are created with the same TextEmbeddingModel as the classifier.
TEClassifierRegular$check_embedding_model( text_embeddings, require_compressed = FALSE )
text_embeddings
Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
require_compressed
TRUE
if a compressed version of the embeddings are necessary. Compressed embeddings
are created by an object of class TEFeatureExtractor.
TRUE
if the underlying TextEmbeddingModel is the same. FALSE
if the models differ.
check_feature_extractor_object_type()
Method for checking an object of class TEFeatureExtractor.
TEClassifierRegular$check_feature_extractor_object_type(feature_extractor)
feature_extractor
Object of class TEFeatureExtractor
This method does nothing returns. It raises an error if
the object is NULL
the object does not rely on the same machine learning framework as the classifier
the object is not trained.
requires_compression()
Method for checking if provided text embeddings must be compressed via a TEFeatureExtractor before processing.
TEClassifierRegular$requires_compression(text_embeddings)
text_embeddings
Object of class EmbeddedText, LargeDataSetForTextEmbeddings, array
or
datasets.arrow_dataset.Dataset
.
Return TRUE
if a compression is necessary and FALSE
if not.
save()
Method for saving a model.
TEClassifierRegular$save(dir_path, folder_name)
dir_path
string
Path of the directory where the model should be saved.
folder_name
string
Name of the folder that should be created within the directory.
Function does not return a value. It saves the model to disk.
load_from_disk()
loads an object from disk and updates the object to the current version of the package.
TEClassifierRegular$load_from_disk(dir_path)
dir_path
Path where the object set is stored.
Method does not return anything. It loads an object from disk.
clone()
The objects of this class are cloneable with this method.
TEClassifierRegular$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other Classification:
TEClassifierProtoNet
Abstract class for auto encoders with 'pytorch'.
Objects of this class are used for reducing the number of dimensions of text embeddings created by an object of class TextEmbeddingModel.
For training an object of class EmbeddedText or LargeDataSetForTextEmbeddings generated by an object of class TextEmbeddingModel is necessary. Passing raw texts is not supported.
For prediction an ob object class EmbeddedText or LargeDataSetForTextEmbeddings is necessary that was generated with the same TextEmbeddingModel as during training. Prediction outputs a new object of class EmbeddedText or LargeDataSetForTextEmbeddings which contains a text embedding with a lower number of dimensions.
All models use tied weights for the encoder and decoder layers (except method="lstm"
) and apply the estimation of
orthogonal weights. In addition, training tries to train the model to achieve uncorrelated features.
Objects of class TEFeatureExtractor are designed to be used with classifiers such as TEClassifierRegular and TEClassifierProtoNet.
aifeducation::AIFEBaseModel
-> TEFeatureExtractor
aifeducation::AIFEBaseModel$check_embedding_model()
aifeducation::AIFEBaseModel$count_parameter()
aifeducation::AIFEBaseModel$get_all_fields()
aifeducation::AIFEBaseModel$get_documentation_license()
aifeducation::AIFEBaseModel$get_ml_framework()
aifeducation::AIFEBaseModel$get_model_description()
aifeducation::AIFEBaseModel$get_model_info()
aifeducation::AIFEBaseModel$get_model_license()
aifeducation::AIFEBaseModel$get_package_versions()
aifeducation::AIFEBaseModel$get_private()
aifeducation::AIFEBaseModel$get_publication_info()
aifeducation::AIFEBaseModel$get_sustainability_data()
aifeducation::AIFEBaseModel$get_text_embedding_model()
aifeducation::AIFEBaseModel$get_text_embedding_model_name()
aifeducation::AIFEBaseModel$is_configured()
aifeducation::AIFEBaseModel$load()
aifeducation::AIFEBaseModel$save()
aifeducation::AIFEBaseModel$set_documentation_license()
aifeducation::AIFEBaseModel$set_model_description()
aifeducation::AIFEBaseModel$set_model_license()
aifeducation::AIFEBaseModel$set_publication_info()
configure()
Creating a new instance of this class.
TEFeatureExtractor$configure( ml_framework = "pytorch", name = NULL, label = NULL, text_embeddings = NULL, features = 128, method = "lstm", noise_factor = 0.2, optimizer = "adam" )
ml_framework
string
Framework to use for training and inference. Currently only ml_framework="pytorch"
is supported.
name
string
Name of the new classifier. Please refer to common name conventions. Free text can be used
with parameter label
.
label
string
Label for the new classifier. Here you can use free text.
text_embeddings
An object of class EmbeddedText or LargeDataSetForTextEmbeddings.
features
int
determining the number of dimensions to which the dimension of the text embedding should be
reduced.
method
string
Method to use for the feature extraction. "lstm"
for an extractor based on LSTM-layers or
"dense"
for dense layers.
noise_factor
double
between 0 and a value lower 1 indicating how much noise should be added for the
training of the feature extractor.
optimizer
string
"adam"
or "rmsprop"
.
Returns an object of class TEFeatureExtractor which is ready for training.
train()
Method for training a neural net.
TEFeatureExtractor$train( data_embeddings, data_val_size = 0.25, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15, epochs = 40, batch_size = 32, dir_checkpoint, trace = TRUE, ml_trace = 1, log_dir = NULL, log_write_interval = 10 )
data_embeddings
Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
data_val_size
double
between 0 and 1, indicating the proportion of cases which should be used for the
validation sample.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library
'codecarbon'.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
Region within a country. Only available for USA and Canada See the documentation of 'codecarbon' for more information. https://mlco2.github.io/codecarbon/parameters.html
sustain_interval
int
Interval in seconds for measuring power usage.
epochs
int
Number of training epochs.
batch_size
int
Size of batches.
dir_checkpoint
string
Path to the directory where the checkpoint during training should be saved. If the
directory does not exist, it is created.
trace
bool
TRUE
, if information about the estimation phase should be printed to the console.
ml_trace
int
ml_trace=0
does not print any information about the training process from pytorch on
the console. ml_trace=1
prints a progress bar.
log_dir
string
Path to the directory where the log files should be saved. If no logging is desired set
this argument to NULL
.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_dir
is not NULL
.
Function does not return a value. It changes the object into a trained classifier.
load_from_disk()
loads an object from disk and updates the object to the current version of the package.
TEFeatureExtractor$load_from_disk(dir_path)
dir_path
Path where the object set is stored.
Method does not return anything. It loads an object from disk.
extract_features()
Method for extracting features. Applying this method reduces the number of dimensions of the text
embeddings. Please note that this method should only be used if a small number of cases should be compressed
since the data is loaded completely into memory. For a high number of cases please use the method
extract_features_large
.
TEFeatureExtractor$extract_features(data_embeddings, batch_size)
data_embeddings
Object of class EmbeddedText,LargeDataSetForTextEmbeddings,
datasets.arrow_dataset.Dataset
or array
containing the text embeddings which should be reduced in their
dimensions.
batch_size
int
batch size.
Returns an object of class EmbeddedText containing the compressed embeddings.
extract_features_large()
Method for extracting features from a large number of cases. Applying this method reduces the number of dimensions of the text embeddings.
TEFeatureExtractor$extract_features_large( data_embeddings, batch_size, trace = FALSE )
data_embeddings
Object of class EmbeddedText or LargeDataSetForTextEmbeddings containing the text embeddings which should be reduced in their dimensions.
batch_size
int
batch size.
trace
bool
If TRUE
information about the progress is printed to the console.
Returns an object of class LargeDataSetForTextEmbeddings containing the compressed embeddings.
is_trained()
Check if the TEFeatureExtractor is trained.
TEFeatureExtractor$is_trained()
Returns TRUE
if the object is trained and FALSE
if not.
clone()
The objects of this class are cloneable with this method.
TEFeatureExtractor$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other Text Embedding:
TextEmbeddingModel
This R6
class stores a text embedding model which can be used to tokenize, encode, decode, and embed
raw texts. The object provides a unique interface for different text processing methods.
Objects of class TextEmbeddingModel transform raw texts into numerical representations which can be used for downstream tasks. For this aim objects of this class allow to tokenize raw texts, to encode tokens to sequences of integers, and to decode sequences of integers back to tokens.
last_training
('list()')
List for storing the history and the results of the last training. This
information will be overwritten if a new training is started.
tokenizer_statistics
('matrix()')
Matrix containing the tokenizer statistics for the creation of the tokenizer
and all training runs according to Kaya & Tantuğ (2024).
Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. https://doi.org/10.1016/j.iswa.2024.200335
configure()
Method for creating a new text embedding model
TextEmbeddingModel$configure( model_name = NULL, model_label = NULL, model_language = NULL, method = NULL, ml_framework = "pytorch", max_length = 0, chunks = 2, overlap = 0, emb_layer_min = "middle", emb_layer_max = "2_3_layer", emb_pool_type = "average", model_dir = NULL, trace = FALSE )
model_name
string
containing the name of the new model.
model_label
string
containing the label/title of the new model.
model_language
string
containing the language which the model
represents (e.g., English).
method
string
determining the kind of embedding model. Currently
the following models are supported:
method="bert"
for Bidirectional Encoder Representations from Transformers (BERT),
method="roberta"
for A Robustly Optimized BERT Pretraining Approach (RoBERTa),
method="longformer"
for Long-Document Transformer,
method="funnel"
for Funnel-Transformer,
method="deberta_v2"
for Decoding-enhanced BERT with Disentangled Attention (DeBERTa V2),
method="glove"`` for GlobalVector Clusters, and
method="lda"' for topic modeling. See
details for more information.
ml_framework
string
Framework to use for the model.
ml_framework="tensorflow"
for 'tensorflow' and ml_framework="pytorch"
for 'pytorch'. Only relevant for transformer models. To request bag-of-words model
set ml_framework=NULL
.
max_length
int
determining the maximum length of token
sequences used in transformer models. Not relevant for the other methods.
chunks
int
Maximum number of chunks. Must be at least 2.
overlap
int
determining the number of tokens which should be added
at the beginning of the next chunk. Only relevant for transformer models.
emb_layer_min
int
or string
determining the first layer to be included
in the creation of embeddings. An integer correspondents to the layer number. The first
layer has the number 1. Instead of an integer the following strings are possible:
"start"
for the first layer, "middle"
for the middle layer,
"2_3_layer"
for the layer two-third layer, and "last"
for the last layer.
emb_layer_max
int
or string
determining the last layer to be included
in the creation of embeddings. An integer correspondents to the layer number. The first
layer has the number 1. Instead of an integer the following strings are possible:
"start"
for the first layer, "middle"
for the middle layer,
"2_3_layer"
for the layer two-third layer, and "last"
for the last layer.
emb_pool_type
string
determining the method for pooling the token embeddings
within each layer. If "cls"
only the embedding of the CLS token is used. If
"average"
the token embedding of all tokens are averaged (excluding padding tokens).
"cls
is not supported for method="funnel"
.
model_dir
string
path to the directory where the
BERT model is stored.
trace
bool
TRUE
prints information about the progress.
FALSE
does not.
In the case of any transformer (e.g.method="bert"
,
method="roberta"
, and method="longformer"
),
a pretrained transformer model must be supplied via model_dir
.
Returns an object of class TextEmbeddingModel.
load_from_disk()
loads an object from disk and updates the object to the current version of the package.
TextEmbeddingModel$load_from_disk(dir_path)
dir_path
Path where the object set is stored.
Method does not return anything. It loads an object from disk.
load()
Method for loading a transformers model into R.
TextEmbeddingModel$load(dir_path)
dir_path
string
containing the path to the relevant
model directory.
Function does not return a value. It is used for loading a saved transformer model into the R interface.
save()
Method for saving a transformer model on disk.Relevant only for transformer models.
TextEmbeddingModel$save(dir_path, folder_name)
dir_path
string
containing the path to the relevant
model directory.
folder_name
string
Name for the folder created within the directory.
This folder contains all model files.
Function does not return a value. It is used for saving a transformer model to disk.
encode()
Method for encoding words of raw texts into integers.
TextEmbeddingModel$encode( raw_text, token_encodings_only = FALSE, to_int = TRUE, trace = FALSE )
raw_text
vector
containing the raw texts.
token_encodings_only
bool
If TRUE
, only the token
encodings are returned. If FALSE
, the complete encoding is returned
which is important for some transformer models.
to_int
bool
If TRUE
the integer ids of the tokens are
returned. If FALSE
the tokens are returned. Argument only applies
for transformer models and if token_encodings_only=TRUE
.
trace
bool
If TRUE
, information of the progress
is printed. FALSE
if not requested.
list
containing the integer or token sequences of the raw texts with
special tokens.
decode()
Method for decoding a sequence of integers into tokens
TextEmbeddingModel$decode(int_seqence, to_token = FALSE)
int_seqence
list
containing the integer sequences which
should be transformed to tokens or plain text.
to_token
bool
If FALSE
plain text is returned.
If TRUE
a sequence of tokens is returned. Argument only relevant
if the model is based on a transformer.
list
of token sequences
get_special_tokens()
Method for receiving the special tokens of the model
TextEmbeddingModel$get_special_tokens()
Returns a matrix
containing the special tokens in the rows
and their type, token, and id in the columns.
embed()
Method for creating text embeddings from raw texts.
This method should only be used if a small number of texts should be transformed
into text embeddings. For a large number of texts please use the method embed_large
.
In the case of using a GPU and running out of memory while using 'tensorflow' reduce the
batch size or restart R and switch to use cpu only via set_config_cpu_only
. In general,
not relevant for 'pytorch'.
TextEmbeddingModel$embed( raw_text = NULL, doc_id = NULL, batch_size = 8, trace = FALSE, return_large_dataset = FALSE )
raw_text
vector
containing the raw texts.
doc_id
vector
containing the corresponding IDs for every text.
batch_size
int
determining the maximal size of every batch.
trace
bool
TRUE
, if information about the progression
should be printed on console.
return_large_dataset
'bool' If TRUE
the retuned object is of class
LargeDataSetForTextEmbeddings. If FALSE
it is of class EmbeddedText
Method returns an object of class EmbeddedText or LargeDataSetForTextEmbeddings. This object contains the embeddings as a data.frame and information about the model creating the embeddings.
embed_large()
Method for creating text embeddings from raw texts.
TextEmbeddingModel$embed_large( large_datas_set, batch_size = 32, trace = FALSE, log_file = NULL, log_write_interval = 2 )
large_datas_set
Object of class LargeDataSetForText containing the raw texts.
batch_size
int
determining the maximal size of every batch.
trace
bool
TRUE
, if information about the progression
should be printed on console.
log_file
string
Path to the file where the log should be saved.
If no logging is desired set this argument to NULL
.
log_write_interval
int
Time in seconds determining the interval in which
the logger should try to update the log files. Only relevant if log_file
is not NULL
.
Method returns an object of class LargeDataSetForTextEmbeddings.
fill_mask()
Method for calculating tokens behind mask tokens.
TextEmbeddingModel$fill_mask(text, n_solutions = 5)
text
string
Text containing mask tokens.
n_solutions
int
Number estimated tokens for every mask.
Returns a list
containing a data.frame
for every
mask. The data.frame
contains the solutions in the rows and reports
the score, token id, and token string in the columns.
set_publication_info()
Method for setting the bibliographic information of the model.
TextEmbeddingModel$set_publication_info(type, authors, citation, url = NULL)
type
string
Type of information which should be changed/added.
developer
, and modifier
are possible.
authors
List of people.
citation
string
Citation in free text.
url
string
Corresponding URL if applicable.
Function does not return a value. It is used to set the private members for publication information of the model.
get_publication_info()
Method for getting the bibliographic information of the model.
TextEmbeddingModel$get_publication_info()
list
of bibliographic information.
set_model_license()
Method for setting the license of the model
TextEmbeddingModel$set_model_license(license = "CC BY")
license
string
containing the abbreviation of the license or
the license text.
Function does not return a value. It is used for setting the private member for the software license of the model.
get_model_license()
Method for requesting the license of the model
TextEmbeddingModel$get_model_license()
string
License of the model
set_documentation_license()
Method for setting the license of models' documentation.
TextEmbeddingModel$set_documentation_license(license = "CC BY")
license
string
containing the abbreviation of the license or
the license text.
Function does not return a value. It is used to set the private member for the documentation license of the model.
get_documentation_license()
Method for getting the license of the models' documentation.
TextEmbeddingModel$get_documentation_license()
license
string
containing the abbreviation of the license or
the license text.
set_model_description()
Method for setting a description of the model
TextEmbeddingModel$set_model_description( eng = NULL, native = NULL, abstract_eng = NULL, abstract_native = NULL, keywords_eng = NULL, keywords_native = NULL )
eng
string
A text describing the training of the classifier,
its theoretical and empirical background, and the different output labels
in English.
native
string
A text describing the training of the classifier,
its theoretical and empirical background, and the different output labels
in the native language of the model.
abstract_eng
string
A text providing a summary of the description
in English.
abstract_native
string
A text providing a summary of the description
in the native language of the classifier.
keywords_eng
vector
of keywords in English.
keywords_native
vector
of keywords in the native language of the classifier.
Function does not return a value. It is used to set the private members for the description of the model.
get_model_description()
Method for requesting the model description.
TextEmbeddingModel$get_model_description()
list
with the description of the model in English
and the native language.
get_model_info()
Method for requesting the model information
TextEmbeddingModel$get_model_info()
list
of all relevant model information
get_package_versions()
Method for requesting a summary of the R and python packages' versions used for creating the model.
TextEmbeddingModel$get_package_versions()
Returns a list
containing the versions of the relevant
R and python packages.
get_basic_components()
Method for requesting the part of interface's configuration that is necessary for all models.
TextEmbeddingModel$get_basic_components()
Returns a list
.
get_transformer_components()
Method for requesting the part of interface's configuration that is necessary for transformer models.
TextEmbeddingModel$get_transformer_components()
Returns a list
.
get_sustainability_data()
Method for requesting a log of tracked energy consumption during training and an estimate of the resulting CO2 equivalents in kg.
TextEmbeddingModel$get_sustainability_data()
Returns a matrix
containing the tracked energy consumption,
CO2 equivalents in kg, information on the tracker used, and technical
information on the training infrastructure for every training run.
get_ml_framework()
Method for requesting the machine learning framework used for the classifier.
TextEmbeddingModel$get_ml_framework()
Returns a string
describing the machine learning framework used
for the classifier.
count_parameter()
Method for counting the trainable parameters of a model.
TextEmbeddingModel$count_parameter(with_head = FALSE)
with_head
bool
If TRUE
the number of parameters is returned including
the language modeling head of the model. If FALSE
only the number of parameters of
the core model is returned.
Returns the number of trainable parameters of the model.
is_configured()
Method for checking if the model was successfully configured.
An object can only be used if this value is TRUE
.
TextEmbeddingModel$is_configured()
bool
TRUE
if the model is fully configured. FALSE
if not.
get_private()
Method for requesting all private fields and methods. Used for loading and updating an object.
TextEmbeddingModel$get_private()
Returns a list
with all private fields and methods.
get_all_fields()
Return all fields.
TextEmbeddingModel$get_all_fields()
Method returns a list
containing all public and private fields
of the object.
clone()
The objects of this class are cloneable with this method.
TextEmbeddingModel$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other Text Embedding:
TEFeatureExtractor
Function written in C++ transforming a vector of classes (int) into a binary class matrix.
to_categorical_c(class_vector, n_classes)
to_categorical_c(class_vector, n_classes)
class_vector |
|
n_classes |
|
Returns a matrix
containing the binary representation for
every class.
Other Auxiliary Functions:
get_alpha_3_codes()
,
matrix_to_array_c()
,
summarize_tracked_sustainability()