| Title: | Artificial Intelligence for Education |
|---|---|
| Description: | In social and educational settings, the use of Artificial Intelligence (AI) is a challenging task. Relevant data is often only available in handwritten forms, or the use of data is restricted by privacy policies. This often leads to small data sets. Furthermore, in the educational and social sciences, data is often unbalanced in terms of frequencies. To support educators as well as educational and social researchers in using the potentials of AI for their work, this package provides a unified interface for neural nets in 'PyTorch' to deal with natural language problems. In addition, the package ships with a shiny app, providing a graphical user interface. This allows the usage of AI for people without skills in writing python/R scripts. The tools integrate existing mathematical and statistical methods for dealing with small data sets via pseudo-labeling (e.g. Cascante-Bonilla et al. (2020) <doi:10.48550/arXiv.2001.06001>) and imbalanced data via the creation of synthetic cases (e.g. Islam et al. (2012) <doi:10.1016/j.asoc.2021.108288>). Performance evaluation of AI is connected to measures from content analysis which educational and social researchers are generally more familiar with (e.g. Berding & Pargmann (2022) <doi:10.30819/5581>, Gwet (2014) <ISBN:978-0-9708062-8-4>, Krippendorff (2019) <doi:10.4135/9781071878781>). Estimation of energy consumption and CO2 emissions during model training is done with the 'python' library 'codecarbon'. Finally, all objects created with this package allow to share trained AI models with other people. |
| Authors: | Berding Florian [aut, cre] (ORCID: <https://orcid.org/0000-0002-3593-1695>), Tykhonova Yuliia [aut] (ORCID: <https://orcid.org/0009-0006-9015-1006>), Pargmann Julia [ctb] (ORCID: <https://orcid.org/0000-0003-3616-0172>), Leube Anna [ctb] (ORCID: <https://orcid.org/0009-0001-6949-1608>), Riebenbauer Elisabeth [ctb] (ORCID: <https://orcid.org/0000-0002-8535-3694>), Rebmann Karin [ctb], Slopinski Andreas [ctb] |
| Maintainer: | Berding Florian <[email protected]> |
| License: | GPL-3 |
| Version: | 1.1.5 |
| Built: | 2026-05-26 06:19:05 UTC |
| Source: | https://github.com/fberding/aifeducation |
This function is designed for taking the output of
summarize_args_for_long_task as input. It adds the missing arguments.
In general these are arguments that rely on objects of class R6 which can not
be exported to a new R session.
add_missing_args(args, path_args, meta_args)add_missing_args(args, path_args, meta_args)
args |
Named |
path_args |
Named |
meta_args |
Named |
Returns a named list of all arguments that a method of a specific class
requires.
Other Utils Studio Developers:
create_data_embeddings_description(),
long_load_target_data(),
summarize_args_for_long_task()
Objects of this class containing fields and methods used in several other classes in 'AI for Education'.
This class is not designed for a direct application and should only be used by developers.
A new object of this class.
aifeducation::AIFEMaster -> AIFEBaseModel
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()count_parameter()
Method for counting the trainable parameters of a model.
AIFEBaseModel$count_parameter()
Returns the number of trainable parameters of the model.
clone()
The objects of this class are cloneable with this method.
AIFEBaseModel$clone(deep = FALSE)
deepWhether to make a deep clone.
Other R6 Classes for Developers:
AIFEMaster,
BaseModelCore,
ClassifiersBasedOnTextEmbeddings,
DataManagerClassifier,
LargeDataSetBase,
ModelsBasedOnTextEmbeddings,
TEClassifiersBasedOnProtoNet,
TEClassifiersBasedOnRegular,
TokenizerBase
Objects of this class containing fields and methods used in several other classes in 'AI for Education'.
This class is not designed for a direct application and should only be used by developers.
A new object of this class.
last_training('list()')
List for storing the history, the configuration, and the results of the last
training. This information will be overwritten if a new training is started.
last_training$start_time: Time point when training started.
last_training$learning_time: Duration of the training process.
last_training$finish_time: Time when the last training finished.
last_training$history: History of the last training.
last_training$data: Object of class table storing the initial frequencies of the passed data.
last_training$config: List storing the configuration used for the last training.
get_model_info()
Method for requesting the model information.
AIFEMaster$get_model_info()
list of all relevant model information.
set_publication_info()
Method for setting publication information of the model.
AIFEMaster$set_publication_info(authors, citation, url = NULL)
authorsList of authors.
citationFree text citation.
urlURL of a corresponding homepage.
Function does not return a value. It is used for setting the private members for publication information.
get_publication_info()
Method for requesting the bibliographic information of the model.
AIFEMaster$get_publication_info()
list with all saved bibliographic information.
set_model_license()
Method for setting the license of the model.
AIFEMaster$set_model_license(license = "CC BY")
licensestring containing the abbreviation of the license or the license text.
Function does not return a value. It is used for setting the private member for the software license of the model.
get_model_license()
Method for getting the license of the model.
AIFEMaster$get_model_license()
licensestring containing the abbreviation of the license or the license text.
string representing the license for the model.
set_documentation_license()
Method for setting the license of the model's documentation.
AIFEMaster$set_documentation_license(license = "CC BY")
licensestring containing the abbreviation of the license or the license text.
Function does not return a value. It is used for setting the private member for the documentation license of the model.
get_documentation_license()
Method for getting the license of the model's documentation.
AIFEMaster$get_documentation_license()
licensestring containing the abbreviation of the license or the license text.
Returns the license as a string.
set_model_description()
Method for setting a description of the model.
AIFEMaster$set_model_description( eng = NULL, native = NULL, abstract_eng = NULL, abstract_native = NULL, keywords_eng = NULL, keywords_native = NULL )
engstring A text describing the training, its theoretical and empirical background, and output in
English.
nativestring A text describing the training , its theoretical and empirical background, and output in
the native language of the model.
abstract_engstring A text providing a summary of the description in English.
abstract_nativestring A text providing a summary of the description in the native language of the
model.
keywords_engvector of keyword in English.
keywords_nativevector of keyword in the native language of the model.
Function does not return a value. It is used for setting the private members for the description of the model.
get_model_description()
Method for requesting the model description.
AIFEMaster$get_model_description()
list with the description of the classifier in English and the native language.
get_package_versions()
Method for requesting a summary of the R and python packages' versions used for creating the model.
AIFEMaster$get_package_versions()
Returns a list containing the versions of the relevant R and python packages.
get_sustainability_data()
Method for requesting a summary of tracked energy consumption during training and an estimate of the resulting CO2 equivalents in kg.
AIFEMaster$get_sustainability_data(track_mode = "training")
track_modestring Determines the stept to which the data refer. Allowed values: 'training', 'inference'
Returns a list containing the tracked energy consumption, CO2 equivalents in kg, information on the
tracker used, and technical information on the training infrastructure.
get_ml_framework()
Method for requesting the machine learning framework used for the model.
AIFEMaster$get_ml_framework()
Returns a string describing the machine learning framework used for the classifier.
is_configured()
Method for checking if the model was successfully configured. An object can only be used if this
value is TRUE.
AIFEMaster$is_configured()
bool TRUE if the model is fully configured. FALSE if not.
is_trained()
Check if the TEFeatureExtractor is trained.
AIFEMaster$is_trained()
Returns TRUE if the object is trained and FALSE if not.
get_private()
Method for requesting all private fields and methods. Used for loading and updating an object.
AIFEMaster$get_private()
Returns a list with all private fields and methods.
get_all_fields()
Return all fields.
AIFEMaster$get_all_fields()
Method returns a list containing all public and private fields
of the object.
get_model_config()
Method for requesting the model configuration.
AIFEMaster$get_model_config()
Returns a list with all configuration parameters used during configuration.
clone()
The objects of this class are cloneable with this method.
AIFEMaster$clone(deep = FALSE)
deepWhether to make a deep clone.
Other R6 Classes for Developers:
AIFEBaseModel,
BaseModelCore,
ClassifiersBasedOnTextEmbeddings,
DataManagerClassifier,
LargeDataSetBase,
ModelsBasedOnTextEmbeddings,
TEClassifiersBasedOnProtoNet,
TEClassifiersBasedOnRegular,
TokenizerBase
Function for getting the number of cores that should be used
for parallel processing of tasks. The number of cores is set to 75 % of the
available cores. If the environment variable CI is set to "true" or if the
process is running on cran 2 is returned.
auto_n_cores()auto_n_cores()
Returns int as the number of cores.
Other Utils Developers:
create_object(),
create_synthetic_units_from_matrix(),
generate_id(),
get_n_chunks(),
get_synthetic_cases_from_matrix(),
get_time_stamp(),
matrix_to_array_c(),
tensor_to_matrix_c(),
to_categorical_c()
Represents models based on BERT.
Does return a new object of this class.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::BaseModelCore -> BaseModelBert
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::BaseModelCore$calc_flops_architecture_based()aifeducation::BaseModelCore$count_parameter()aifeducation::BaseModelCore$create_from_hf()aifeducation::BaseModelCore$estimate_sustainability_inference_fill_mask()aifeducation::BaseModelCore$fill_mask()aifeducation::BaseModelCore$get_final_size()aifeducation::BaseModelCore$get_flops_estimates()aifeducation::BaseModelCore$get_model()aifeducation::BaseModelCore$get_model_type()aifeducation::BaseModelCore$get_n_layers()aifeducation::BaseModelCore$get_special_tokens()aifeducation::BaseModelCore$get_tokenizer_statistics()aifeducation::BaseModelCore$load_from_disk()aifeducation::BaseModelCore$plot_training_history()aifeducation::BaseModelCore$save()aifeducation::BaseModelCore$set_publication_info()aifeducation::BaseModelCore$train()configure()
Configures a new object of this class. Please ensure that your chosen configuration comply with the following guidelines:
hidden_size is a multiple of num_attention_heads.
BaseModelBert$configure( tokenizer, max_position_embeddings = 512L, hidden_size = 768L, num_hidden_layers = 12L, num_attention_heads = 12L, intermediate_size = 3072L, hidden_act = "GELU", hidden_dropout_prob = 0.1, attention_probs_dropout_prob = 0.1 )
tokenizerTokenizerBase Tokenizer for the model.
max_position_embeddingsint Number of maximum position embeddings. This parameter also determines the maximum length of a sequence which
can be processed with the model. Allowed values:
hidden_sizeint Number of neurons in each layer. This parameter determines the dimensionality of the resulting text
embedding. Allowed values:
num_hidden_layersint Number of hidden layers. Allowed values:
num_attention_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
intermediate_sizeint determining the size of the projection layer within a each transformer encoder. Allowed values:
hidden_actstring Name of the activation function. Allowed values: 'GELU', 'relu', 'silu', 'gelu_new'
hidden_dropout_probdouble Ratio of dropout. Allowed values:
attention_probs_dropout_probdouble Ratio of dropout for attention probabilities. Allowed values:
Does nothing return.
clone()
The objects of this class are cloneable with this method.
BaseModelBert$clone(deep = FALSE)
deepWhether to make a deep clone.
Devlin, J., Chang, M.‑W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North (pp. 4171–4186). Association for Computational Linguistics. doi:10.18653/v1/N19-1423
Other Base Model:
BaseModelDebertaV2,
BaseModelFunnel,
BaseModelMPNet,
BaseModelModernBert,
BaseModelRoberta
This class contains all methods shared by all BaseModels.
Does return a new object of this class.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> BaseModelCore
Tokenizer('TokenizerBase')
Objects of class TokenizerBase.
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()create_from_hf()
Creates BaseModel from a pretrained model
BaseModelCore$create_from_hf(model_dir = NULL, tokenizer_dir = NULL)
model_dirPath where the model is stored.
tokenizer_dirstring Path to the directory where the tokenizer is saved. Allowed values: any
Does return a new object of this class.
train()
Traines a BaseModel
BaseModelCore$train( text_dataset, p_mask = 0.15, whole_word = TRUE, val_size = 0.1, n_epoch = 1L, batch_size = 12L, max_sequence_length = 250L, full_sequences_only = FALSE, min_seq_len = 50L, learning_rate = 0.003, sustain_track = FALSE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15L, sustain_log_level = "warning", trace = TRUE, pytorch_trace = 1L, log_dir = NULL, log_write_interval = 2L )
text_datasetLargeDataSetForText LargeDataSetForText Object storing textual data.
p_maskdouble Ratio that determines the number of tokens used for masking. Allowed values:
whole_wordbool * TRUE: whole word masking should be applied. Only relevant if a WordPieceTokenizer is used.
FALSE: token masking is used.
val_sizedouble between 0 and 1, indicating the proportion of cases which should be
used for the validation sample during the estimation of the model.
The remaining cases are part of the training data. Allowed values:
n_epochint Number of training epochs. Allowed values:
batch_sizeint Size of the batches for training. Allowed values:
max_sequence_lengthint Maximal number of tokens for every sequence. Allowed values:
full_sequences_onlybool TRUE for using only chunks with a sequence length equal to chunk_size.
min_seq_lenint Only relevant if full_sequences_only = FALSE. Value determines the minimal sequence length included in
training process. Allowed values:
learning_ratedouble Initial learning rate for the training. Sets the maximal learning rate. Allowed values:
sustain_trackbool If TRUE energy consumption is tracked during training via the python library 'codecarbon'.
sustain_iso_codestring ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any
sustain_regionstring Region within a country. Only available for USA and Canada See the documentation of
codecarbon for more information. https://docs.codecarbon.io/latest/ Allowed values: any
sustain_intervalint Interval in seconds for measuring power usage. Allowed values:
sustain_log_levelstring Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'
tracebool TRUE if information about the estimation phase should be printed to the console.
pytorch_traceint ml_trace=0 does not print any information about the training process from pytorch on the console. Allowed values:
log_dirstring Path to the directory where the log files should be saved.
If no logging is desired set this argument to NULL. Allowed values: any
log_write_intervalint Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_dir is not NULL. Allowed values:
Does nothing return.
count_parameter()
Method for counting the trainable parameters of a model.
BaseModelCore$count_parameter()
Returns the number of trainable parameters of the model.
plot_training_history()
Method for requesting a plot of the training history. This method requires the R package 'ggplot2' to work.
BaseModelCore$plot_training_history( x_min = NULL, x_max = NULL, y_min = NULL, y_max = NULL, ind_best_model = TRUE, text_size = 10L )
x_minint Minimal value for x-axis. Set to NULL for an automatic adjustment. Allowed values:
x_maxint Maximal value for x-axis. Set to NULL for an automatic adjustment. Allowed values:
y_minint Minimal value for y-axis. Set to NULL for an automatic adjustment. Allowed values:
y_maxint Maximal value for y-axis. Set to NULL for an automatic adjustment. Allowed values:
ind_best_modelbool If TRUE the plot indicates the best states of the model according to the chosen measure.
text_sizeint Size of text elements. Allowed values:
Returns a plot of class ggplot visualizing the training process.
get_special_tokens()
Method for receiving the special tokens of the model
BaseModelCore$get_special_tokens()
Returns a matrix containing the special tokens in the rows
and their type, token, and id in the columns.
get_tokenizer_statistics()
Tokenizer statistics
BaseModelCore$get_tokenizer_statistics()
Returns a data.frame containing the tokenizer's statistics.
fill_mask()
Method for calculating tokens behind mask tokens.
BaseModelCore$fill_mask(masked_text, n_solutions = 5L)
masked_textstring Text with mask tokens. Allowed values: any
n_solutionsint Number of solutions the model should predict. Allowed values:
Returns a list containing a data.frame for every
mask. The data.frame contains the solutions in the rows and reports
the score, token id, and token string in the columns.
save()
Method for saving a model on disk.
BaseModelCore$save(dir_path, folder_name)
dir_pathPath to the directory where to save the object.
folder_namestring Name of the folder where the model should be saved. Allowed values: any
Function does nothing return. It is used to save an object on disk.
load_from_disk()
Loads an object from disk and updates the object to the current version of the package.
BaseModelCore$load_from_disk(dir_path)
dir_pathPath where the object set is stored.
Function does nothin return. It loads an object from disk.
get_model()
Get 'PyTorch' model
BaseModelCore$get_model()
Returns the underlying 'PyTorch' model.
get_model_type()
Type of the underlying model.
BaseModelCore$get_model_type()
Returns a string describing the model's architecture.
get_final_size()
Size of the final layer.
BaseModelCore$get_final_size()
Returns an int describing the number of dimensions of the last
hidden layer.
get_n_layers()
Number of layers.
BaseModelCore$get_n_layers()
Returns an int describing the number of layers available for
embedding.
get_flops_estimates()
Flop estimates
BaseModelCore$get_flops_estimates()
Returns a data.frame containing statistics about the flops.
set_publication_info()
Method for setting the bibliographic information of the model.
BaseModelCore$set_publication_info(type, authors, citation, url = NULL)
typestring Type of information which should be changed/added.
developer, and modifier are possible.
authorsList of people.
citationstring Citation in free text.
urlstring Corresponding URL if applicable.
Function does not return a value. It is used to set the private members for publication information of the model.
estimate_sustainability_inference_fill_mask()
Calculates the energy consumption for inference of the given task.
BaseModelCore$estimate_sustainability_inference_fill_mask( text_dataset = NULL, n_samples = NULL, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15L, sustain_log_level = "warning", trace = TRUE )
text_datasetLargeDataSetForText LargeDataSetForText Object storing textual data.
n_samplesint Number of samples. Allowed values:
sustain_iso_codestring ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any
sustain_regionstring Region within a country. Only available for USA and Canada See the documentation of
codecarbon for more information. https://docs.codecarbon.io/latest/ Allowed values: any
sustain_intervalint Interval in seconds for measuring power usage. Allowed values:
sustain_log_levelstring Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'
tracebool TRUE if information about the estimation phase should be printed to the console.
Returns nothing. Method saves the statistics internally.
The statistics can be accessed with the method get_sustainability_data("inference")
calc_flops_architecture_based()
Calculates FLOPS based on model's architecture.
BaseModelCore$calc_flops_architecture_based(batch_size, n_batches, n_epoch)
batch_sizeint Size of the batches for training. Allowed values:
n_batchesint Number of batches. Allowed values:
n_epochint Number of training epochs. Allowed values:
Returns a data.frame storing the estimates.
clone()
The objects of this class are cloneable with this method.
BaseModelCore$clone(deep = FALSE)
deepWhether to make a deep clone.
Other R6 Classes for Developers:
AIFEBaseModel,
AIFEMaster,
ClassifiersBasedOnTextEmbeddings,
DataManagerClassifier,
LargeDataSetBase,
ModelsBasedOnTextEmbeddings,
TEClassifiersBasedOnProtoNet,
TEClassifiersBasedOnRegular,
TokenizerBase
Represents models based on DeBERTa version 2.
Does return a new object of this class.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::BaseModelCore -> BaseModelDebertaV2
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::BaseModelCore$calc_flops_architecture_based()aifeducation::BaseModelCore$count_parameter()aifeducation::BaseModelCore$create_from_hf()aifeducation::BaseModelCore$estimate_sustainability_inference_fill_mask()aifeducation::BaseModelCore$fill_mask()aifeducation::BaseModelCore$get_final_size()aifeducation::BaseModelCore$get_flops_estimates()aifeducation::BaseModelCore$get_model()aifeducation::BaseModelCore$get_model_type()aifeducation::BaseModelCore$get_n_layers()aifeducation::BaseModelCore$get_special_tokens()aifeducation::BaseModelCore$get_tokenizer_statistics()aifeducation::BaseModelCore$load_from_disk()aifeducation::BaseModelCore$plot_training_history()aifeducation::BaseModelCore$save()aifeducation::BaseModelCore$set_publication_info()aifeducation::BaseModelCore$train()configure()
Configures a new object of this class. Please ensure that your chosen configuration comply with the following guidelines:
hidden_size is a multiple of num_attention_heads.
BaseModelDebertaV2$configure( tokenizer, max_position_embeddings = 512L, hidden_size = 768L, num_hidden_layers = 12L, num_attention_heads = 12L, intermediate_size = 3072L, hidden_act = "GELU", hidden_dropout_prob = 0.1, attention_probs_dropout_prob = 0.1 )
tokenizerTokenizerBase Tokenizer for the model.
max_position_embeddingsint Number of maximum position embeddings. This parameter also determines the maximum length of a sequence which
can be processed with the model. Allowed values:
hidden_sizeint Number of neurons in each layer. This parameter determines the dimensionality of the resulting text
embedding. Allowed values:
num_hidden_layersint Number of hidden layers. Allowed values:
num_attention_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
intermediate_sizeint determining the size of the projection layer within a each transformer encoder. Allowed values:
hidden_actstring Name of the activation function. Allowed values: 'GELU', 'relu', 'silu', 'gelu_new'
hidden_dropout_probdouble Ratio of dropout. Allowed values:
attention_probs_dropout_probdouble Ratio of dropout for attention probabilities. Allowed values:
Does nothing return.
clone()
The objects of this class are cloneable with this method.
BaseModelDebertaV2$clone(deep = FALSE)
deepWhether to make a deep clone.
He, P., Liu, X., Gao, J. & Chen, W. (2020). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. doi:10.48550/arXiv.2006.03654
Other Base Model:
BaseModelBert,
BaseModelFunnel,
BaseModelMPNet,
BaseModelModernBert,
BaseModelRoberta
Represents models based on the Funnel-Transformer.
Does return a new object of this class.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::BaseModelCore -> BaseModelFunnel
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::BaseModelCore$calc_flops_architecture_based()aifeducation::BaseModelCore$count_parameter()aifeducation::BaseModelCore$create_from_hf()aifeducation::BaseModelCore$estimate_sustainability_inference_fill_mask()aifeducation::BaseModelCore$fill_mask()aifeducation::BaseModelCore$get_final_size()aifeducation::BaseModelCore$get_flops_estimates()aifeducation::BaseModelCore$get_model()aifeducation::BaseModelCore$get_model_type()aifeducation::BaseModelCore$get_special_tokens()aifeducation::BaseModelCore$get_tokenizer_statistics()aifeducation::BaseModelCore$load_from_disk()aifeducation::BaseModelCore$plot_training_history()aifeducation::BaseModelCore$save()aifeducation::BaseModelCore$set_publication_info()aifeducation::BaseModelCore$train()configure()
Configures a new object of this class. Please ensure that your chosen configuration comply with the following guidelines:
hidden_size is a multiple of num_attention_heads.
BaseModelFunnel$configure( tokenizer, max_position_embeddings = 512L, hidden_size = 768L, block_sizes = c(4L, 4L, 4L), num_attention_heads = 12L, intermediate_size = 3072L, num_decoder_layers = 2L, d_head = 64L, funnel_pooling_type = "Mean", hidden_act = "GELU", hidden_dropout_prob = 0.1, attention_probs_dropout_prob = 0.1, activation_dropout = 0 )
tokenizerTokenizerBase Tokenizer for the model.
max_position_embeddingsint Number of maximum position embeddings. This parameter also determines the maximum length of a sequence which
can be processed with the model. Allowed values:
hidden_sizeint Number of neurons in each layer. This parameter determines the dimensionality of the resulting text
embedding. Allowed values:
block_sizesvector vector of int determining the number and sizes of each block.
num_attention_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
intermediate_sizeint determining the size of the projection layer within a each transformer encoder. Allowed values:
num_decoder_layersint Number of decoding layers. Allowed values:
d_headint Number of neurons of the final layer. Allowed values:
funnel_pooling_typestring Method for pooling over the seqence length. Allowed values: 'Mean', 'Max'
hidden_actstring Name of the activation function. Allowed values: 'GELU', 'relu', 'silu', 'gelu_new'
hidden_dropout_probdouble Ratio of dropout. Allowed values:
attention_probs_dropout_probdouble Ratio of dropout for attention probabilities. Allowed values:
activation_dropoutdouble Dropout probability between the layers of the feed-forward blocks. Allowed values:
num_hidden_layersint Number of hidden layers. Allowed values:
Does nothing return.
get_n_layers()
Number of layers.
BaseModelFunnel$get_n_layers()
Returns an int describing the number of layers available for
embedding.
clone()
The objects of this class are cloneable with this method.
BaseModelFunnel$clone(deep = FALSE)
deepWhether to make a deep clone.
Dai, Z., Lai, G., Yang, Y. & Le, Q. V. (2020). Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. doi:10.48550/arXiv.2006.03236
Other Base Model:
BaseModelBert,
BaseModelDebertaV2,
BaseModelMPNet,
BaseModelModernBert,
BaseModelRoberta
Represents models based on Modern Bert.
Does return a new object of this class.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::BaseModelCore -> BaseModelModernBert
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::BaseModelCore$calc_flops_architecture_based()aifeducation::BaseModelCore$count_parameter()aifeducation::BaseModelCore$create_from_hf()aifeducation::BaseModelCore$estimate_sustainability_inference_fill_mask()aifeducation::BaseModelCore$fill_mask()aifeducation::BaseModelCore$get_final_size()aifeducation::BaseModelCore$get_flops_estimates()aifeducation::BaseModelCore$get_model()aifeducation::BaseModelCore$get_model_type()aifeducation::BaseModelCore$get_n_layers()aifeducation::BaseModelCore$get_special_tokens()aifeducation::BaseModelCore$get_tokenizer_statistics()aifeducation::BaseModelCore$load_from_disk()aifeducation::BaseModelCore$plot_training_history()aifeducation::BaseModelCore$save()aifeducation::BaseModelCore$set_publication_info()aifeducation::BaseModelCore$train()configure()
Configures a new object of this class. Please ensure that your chosen configuration comply with the following guidelines:
hidden_size is a multiple of num_attention_heads.
hidden_size/num_attention_heads must be a multiple of 2.
global_attn_every_n_layers is equal or smaller as num_hidden_layers.
BaseModelModernBert$configure( tokenizer, max_position_embeddings = 512L, hidden_size = 768L, num_hidden_layers = 12L, num_attention_heads = 12L, global_attn_every_n_layers = 3L, intermediate_size = 3072L, hidden_activation = "GELU", embedding_dropout = 0.1, mlp_dropout = 0.1, attention_dropout = 0.1 )
tokenizerTokenizerBase Tokenizer for the model.
max_position_embeddingsint Number of maximum position embeddings. This parameter also determines the maximum length of a sequence which
can be processed with the model. Allowed values:
hidden_sizeint Number of neurons in each layer. This parameter determines the dimensionality of the resulting text
embedding. Allowed values:
num_hidden_layersint Number of hidden layers. Allowed values:
num_attention_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
global_attn_every_n_layersint Number determining to use a global attention every x-th layer. Allowed values:
intermediate_sizeint determining the size of the projection layer within a each transformer encoder. Allowed values:
hidden_activationstring Name of the activation function. Allowed values: 'GELU', 'relu', 'silu', 'gelu_new'
embedding_dropoutdouble Dropout chance for the embeddings. Allowed values:
mlp_dropoutdouble Dropout rate for the mlp layer. Allowed values:
attention_dropoutdouble Ratio of dropout for attention probabilities. Allowed values:
Does nothing return.
clone()
The objects of this class are cloneable with this method.
BaseModelModernBert$clone(deep = FALSE)
deepWhether to make a deep clone.
Devlin, J., Chang, M.‑W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North (pp. 4171–4186). Association for Computational Linguistics. doi:10.18653/v1/N19-1423
Other Base Model:
BaseModelBert,
BaseModelDebertaV2,
BaseModelFunnel,
BaseModelMPNet,
BaseModelRoberta
Represents models based on MPNet.
Does return a new object of this class.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::BaseModelCore -> BaseModelMPNet
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::BaseModelCore$calc_flops_architecture_based()aifeducation::BaseModelCore$count_parameter()aifeducation::BaseModelCore$create_from_hf()aifeducation::BaseModelCore$estimate_sustainability_inference_fill_mask()aifeducation::BaseModelCore$fill_mask()aifeducation::BaseModelCore$get_final_size()aifeducation::BaseModelCore$get_flops_estimates()aifeducation::BaseModelCore$get_model()aifeducation::BaseModelCore$get_model_type()aifeducation::BaseModelCore$get_n_layers()aifeducation::BaseModelCore$get_special_tokens()aifeducation::BaseModelCore$get_tokenizer_statistics()aifeducation::BaseModelCore$load_from_disk()aifeducation::BaseModelCore$plot_training_history()aifeducation::BaseModelCore$save()aifeducation::BaseModelCore$set_publication_info()configure()
Configures a new object of this class. Please ensure that your chosen configuration comply with the following guidelines:
hidden_size is a multiple of num_attention_heads.
BaseModelMPNet$configure( tokenizer, max_position_embeddings = 512L, hidden_size = 768L, num_hidden_layers = 12L, num_attention_heads = 12L, intermediate_size = 3072L, hidden_act = "GELU", hidden_dropout_prob = 0.1, attention_probs_dropout_prob = 0.1 )
tokenizerTokenizerBase Tokenizer for the model.
max_position_embeddingsint Number of maximum position embeddings. This parameter also determines the maximum length of a sequence which
can be processed with the model. Allowed values:
hidden_sizeint Number of neurons in each layer. This parameter determines the dimensionality of the resulting text
embedding. Allowed values:
num_hidden_layersint Number of hidden layers. Allowed values:
num_attention_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
intermediate_sizeint determining the size of the projection layer within a each transformer encoder. Allowed values:
hidden_actstring Name of the activation function. Allowed values: 'GELU', 'relu', 'silu', 'gelu_new'
hidden_dropout_probdouble Ratio of dropout. Allowed values:
attention_probs_dropout_probdouble Ratio of dropout for attention probabilities. Allowed values:
Does nothing return.
train()
Traines a BaseModel
BaseModelMPNet$train( text_dataset, p_mask = 0.15, p_perm = 0.15, whole_word = TRUE, val_size = 0.1, n_epoch = 1L, batch_size = 12L, max_sequence_length = 250L, full_sequences_only = FALSE, min_seq_len = 50L, learning_rate = 0.003, sustain_track = FALSE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15L, sustain_log_level = "warning", trace = TRUE, pytorch_trace = 1L, log_dir = NULL, log_write_interval = 2L )
text_datasetLargeDataSetForText LargeDataSetForText Object storing textual data.
p_maskdouble Ratio that determines the number of tokens used for masking. Allowed values:
p_permdouble Ratio that determines the number of tokens used for permutation. Allowed values:
whole_wordbool * TRUE: whole word masking should be applied. Only relevant if a WordPieceTokenizer is used.
FALSE: token masking is used.
val_sizedouble between 0 and 1, indicating the proportion of cases which should be
used for the validation sample during the estimation of the model.
The remaining cases are part of the training data. Allowed values:
n_epochint Number of training epochs. Allowed values:
batch_sizeint Size of the batches for training. Allowed values:
max_sequence_lengthint Maximal number of tokens for every sequence. Allowed values:
full_sequences_onlybool TRUE for using only chunks with a sequence length equal to chunk_size.
min_seq_lenint Only relevant if full_sequences_only = FALSE. Value determines the minimal sequence length included in
training process. Allowed values:
learning_ratedouble Initial learning rate for the training. Sets the maximal learning rate. Allowed values:
sustain_trackbool If TRUE energy consumption is tracked during training via the python library 'codecarbon'.
sustain_iso_codestring ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any
sustain_regionstring Region within a country. Only available for USA and Canada See the documentation of
codecarbon for more information. https://docs.codecarbon.io/latest/ Allowed values: any
sustain_intervalint Interval in seconds for measuring power usage. Allowed values:
sustain_log_levelstring Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'
tracebool TRUE if information about the estimation phase should be printed to the console.
pytorch_traceint ml_trace=0 does not print any information about the training process from pytorch on the console. Allowed values:
log_dirstring Path to the directory where the log files should be saved.
If no logging is desired set this argument to NULL. Allowed values: any
log_write_intervalint Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_dir is not NULL. Allowed values:
Does nothing return.
clone()
The objects of this class are cloneable with this method.
BaseModelMPNet$clone(deep = FALSE)
deepWhether to make a deep clone.
Song,K., Tan, X., Qin, T., Lu, J. & Liu, T.-Y. (2020). MPNet: Masked and Permuted Pre-training for Language Understanding. doi:10.48550/arXiv.2004.09297
Other Base Model:
BaseModelBert,
BaseModelDebertaV2,
BaseModelFunnel,
BaseModelModernBert,
BaseModelRoberta
Represents models based on RoBERTa.
Does return a new object of this class.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::BaseModelCore -> BaseModelRoberta
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::BaseModelCore$calc_flops_architecture_based()aifeducation::BaseModelCore$count_parameter()aifeducation::BaseModelCore$create_from_hf()aifeducation::BaseModelCore$estimate_sustainability_inference_fill_mask()aifeducation::BaseModelCore$fill_mask()aifeducation::BaseModelCore$get_final_size()aifeducation::BaseModelCore$get_flops_estimates()aifeducation::BaseModelCore$get_model()aifeducation::BaseModelCore$get_model_type()aifeducation::BaseModelCore$get_n_layers()aifeducation::BaseModelCore$get_special_tokens()aifeducation::BaseModelCore$get_tokenizer_statistics()aifeducation::BaseModelCore$load_from_disk()aifeducation::BaseModelCore$plot_training_history()aifeducation::BaseModelCore$save()aifeducation::BaseModelCore$set_publication_info()aifeducation::BaseModelCore$train()configure()
Configures a new object of this class. Please ensure that your chosen configuration comply with the following guidelines:
hidden_size is a multiple of num_attention_heads.
BaseModelRoberta$configure( tokenizer, max_position_embeddings = 512L, hidden_size = 768L, num_hidden_layers = 12L, num_attention_heads = 12L, intermediate_size = 3072L, hidden_act = "GELU", hidden_dropout_prob = 0.1, attention_probs_dropout_prob = 0.1 )
tokenizerTokenizerBase Tokenizer for the model.
max_position_embeddingsint Number of maximum position embeddings. This parameter also determines the maximum length of a sequence which
can be processed with the model. Allowed values:
hidden_sizeint Number of neurons in each layer. This parameter determines the dimensionality of the resulting text
embedding. Allowed values:
num_hidden_layersint Number of hidden layers. Allowed values:
num_attention_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
intermediate_sizeint determining the size of the projection layer within a each transformer encoder. Allowed values:
hidden_actstring Name of the activation function. Allowed values: 'GELU', 'relu', 'silu', 'gelu_new'
hidden_dropout_probdouble Ratio of dropout. Allowed values:
attention_probs_dropout_probdouble Ratio of dropout for attention probabilities. Allowed values:
Does nothing return.
clone()
The objects of this class are cloneable with this method.
BaseModelRoberta$clone(deep = FALSE)
deepWhether to make a deep clone.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. doi:10.48550/arXiv.1907.11692
Other Base Model:
BaseModelBert,
BaseModelDebertaV2,
BaseModelFunnel,
BaseModelMPNet,
BaseModelModernBert
Named list containing all BaseModels as a string.
BaseModelsIndexBaseModelsIndex
An object of class list of length 6.
Other Parameter Dictionary:
DataSetsIndex,
TokenizerIndex,
doc_formula(),
get_TEClassifiers_class_names(),
get_called_args(),
get_depr_obj_names(),
get_magnitude_values(),
get_param_def(),
get_param_dict(),
get_param_doc_desc()
Function for generating the documentation of a model.
build_documentation_for_model( model_name, cls_type = NULL, core_type = NULL, input_type = "text_embeddings" )build_documentation_for_model( model_name, cls_type = NULL, core_type = NULL, input_type = "text_embeddings" )
model_name |
|
cls_type |
|
core_type |
|
input_type |
|
Returns a string containing the description written in rmarkdown.
Function is designed to be used with roxygen2 in the regular documentation.
Other Utils Documentation:
build_layer_stack_documentation_for_vignette(),
get_desc_for_core_model_architecture(),
get_dict_cls_type(),
get_dict_core_models(),
get_dict_input_types(),
get_layer_dict(),
get_layer_documentation(),
get_parameter_documentation()
Function for generating the whole documentation for an article used on the package's home page.
build_layer_stack_documentation_for_vignette()build_layer_stack_documentation_for_vignette()
Returns a string containing the description written in rmarkdown.
Function is designed to be used with inline r code in rmarkdown vignettes/articles.
Other Utils Documentation:
build_documentation_for_model(),
get_desc_for_core_model_architecture(),
get_dict_cls_type(),
get_dict_core_models(),
get_dict_input_types(),
get_layer_dict(),
get_layer_documentation(),
get_parameter_documentation()
Function for calculating recall, precision, and f1-scores.
calc_standard_classification_measures(true_values, predicted_values)calc_standard_classification_measures(true_values, predicted_values)
true_values |
|
predicted_values |
|
Returns a matrix which contains the cases categories in the rows and the measures (precision, recall, f1) in the columns.
Other performance measures:
cohens_kappa(),
fleiss_kappa(),
get_coder_metrics(),
gwet_ac(),
kendalls_w(),
kripp_alpha()
Function for estimating the tokenizer statistics described by Kaya & Tantuğ (2024).
calc_tokenizer_statistics( dataset, step = "creation", statistics_max_tokens_length = 512L )calc_tokenizer_statistics( dataset, step = "creation", statistics_max_tokens_length = 512L )
dataset |
Object of class datasets.arrow_dataset.Dataset. The data set must contain a column |
step |
|
statistics_max_tokens_length |
|
Returns a list with the following entries:
n_sequences: Number of sequences
n_words: Number for words in whole corpus
n_tokens: Number of tokens in the whole corpus
mu_t: eqn(n_tokens/n_sequences)
mu_w: eqn(n_words/n_sequences)
mu_g: eqn(n_tokens/n_words)
Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. https://doi.org/10.1016/j.iswa.2024.200335
cat())Prints a message msg if trace parameter is TRUE with current date with cat() function.
cat_message(msg, trace)cat_message(msg, trace)
msg |
|
trace |
|
This function returns nothing.
Other Utils Log Developers:
clean_pytorch_log_transformers(),
output_message(),
print_message(),
read_log(),
read_loss_log(),
reset_log(),
reset_loss_log(),
write_log()
Depending on the test environment, the function adjusts the number of samples. For continuous integration, it is limited to a random sample of combinations. The same applies if CUDA is unavailable.
check_adjust_n_samples_on_CI(n_samples_requested, n_CI = 50L)check_adjust_n_samples_on_CI(n_samples_requested, n_CI = 50L)
n_samples_requested |
|
n_CI |
|
Returns an int depending on the test environment.
Other Utils TestThat Developers:
generate_args_for_tests(),
generate_embeddings(),
generate_tensors(),
get_current_args_for_print(),
get_fixed_test_tensor(),
get_test_data_for_classifiers(),
monitor_test_time_on_CI(),
random_bool_on_CI()
This function checks if all python modules necessary for the package 'aifeducation' to work are available.
check_aif_py_modules(trace = TRUE)check_aif_py_modules(trace = TRUE)
trace |
|
The function prints a table with all relevant packages and shows which modules are available or unavailable.
If all relevant modules are available, the functions returns TRUE. In all other cases it returns FALSE
Other Installation and Configuration:
get_recommended_py_versions(),
install_aifeducation(),
install_aifeducation_studio(),
install_py_modules(),
prepare_session(),
set_transformers_logger(),
update_aifeducation()
This function performs checks for every provided argument. It can only check arguments that are defined in the central parameter dictionary. See get_param_dict for more details.
check_all_args(args)check_all_args(args)
args |
Named |
Function does nothing return. It raises an error the arguments are not valid.
Other Utils Checks Developers:
check_class_and_type()
Function for checking if an object is of a specific type or class.
check_class_and_type( object, object_name = NULL, type_classes = "bool", allow_NULL = FALSE, min = NULL, max = NULL, allowed_values = NULL )check_class_and_type( object, object_name = NULL, type_classes = "bool", allow_NULL = FALSE, min = NULL, max = NULL, allowed_values = NULL )
object |
Any R object. |
object_name |
|
type_classes |
|
allow_NULL |
|
min |
|
max |
|
allowed_values |
|
Function does nothing return. It raises an error if the object is not of the specified type.
parameter min, max, and allowed_values do not apply if type_classes is a class.
allowed_values does only apply if type_classes is string.
Other Utils Checks Developers:
check_all_args()
Function converts a vector of class indices into an arrow data set.
class_vector_to_py_dataset(vector)class_vector_to_py_dataset(vector)
vector |
|
Returns a data set of class datasets.arrow_dataset.Dataset containing the class indices.
Other Utils Python Data Management Developers:
create_py_dataset_cache_file_path(),
data.frame_to_py_dataset(),
extract_column_from_py_dataset(),
get_batches_index(),
prepare_r_array_for_dataset(),
py_dataset_to_embeddings(),
reduce_to_unique(),
tensor_list_to_numpy(),
tensor_to_numpy()
Base class for classifiers relying on EmbeddedText or LargeDataSetForTextEmbeddings generated with a TextEmbeddingModel.
Objects of this class containing fields and methods used in several other classes in 'AI for Education'.
This class is not designed for a direct application and should only be used by developers.
A new object of this class.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::ModelsBasedOnTextEmbeddings -> ClassifiersBasedOnTextEmbeddings
feature_extractor('list()')
List for storing information and objects about the feature_extractor.
reliability('list()')
List for storing central reliability measures of the last training.
reliability$test_metric: Array containing the reliability measures for the test data for
every fold and step (in case of pseudo-labeling).
reliability$test_metric_mean: Array containing the reliability measures for the test data.
The values represent the mean values for every fold.
reliability$raw_iota_objects: List containing all iota_object generated with the package iotarelr
for every fold at the end of the last training for the test data.
reliability$raw_iota_objects$iota_objects_end: List of objects with class iotarelr_iota2 containing the
estimated iota reliability of the second generation for the final model for every fold for the test data.
reliability$raw_iota_objects$iota_objects_end_free: List of objects with class iotarelr_iota2 containing
the estimated iota reliability of the second generation for the final model for every fold for the test data.
Please note that the model is estimated without forcing the Assignment Error Matrix to be in line with the
assumption of weak superiority.
reliability$iota_object_end: Object of class iotarelr_iota2 as a mean of the individual objects
for every fold for the test data.
reliability$iota_object_end_free: Object of class iotarelr_iota2 as a mean of the individual objects
for every fold. Please note that the model is estimated without forcing the Assignment Error Matrix to be in
line with the assumption of weak superiority.
reliability$standard_measures_end: Object of class list containing the final measures for precision,
recall, and f1 for every fold.
reliability$standard_measures_mean: matrix containing the mean measures for precision, recall, and f1.
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::AIFEBaseModel$count_parameter()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model_name()predict()
Method for predicting new data with a trained neural net.
ClassifiersBasedOnTextEmbeddings$predict( newdata, batch_size = 32L, ml_trace = 1L )
newdataObject of class TextEmbeddingModel or LargeDataSetForTextEmbeddings for which predictions
should be made. In addition, this method allows to use objects of class array and
datasets.arrow_dataset.Dataset. However, these should be used only by developers.
batch_sizeint Size of batches.
ml_traceint ml_trace=0 does not print any information on the process from the machine learning
framework.
Returns a data.frame containing the predictions and the probabilities of the different labels for each
case.
check_embedding_model()
Method for checking if the provided text embeddings are created with the same TextEmbeddingModel as the classifier.
ClassifiersBasedOnTextEmbeddings$check_embedding_model( text_embeddings, require_compressed = FALSE )
text_embeddingsObject of class EmbeddedText or LargeDataSetForTextEmbeddings.
require_compressedTRUE if a compressed version of the embeddings are necessary. Compressed embeddings
are created by an object of class TEFeatureExtractor.
TRUE if the underlying TextEmbeddingModel is the same. FALSE if the models differ.
check_feature_extractor_object_type()
Method for checking an object of class TEFeatureExtractor.
ClassifiersBasedOnTextEmbeddings$check_feature_extractor_object_type( feature_extractor )
feature_extractorObject of class TEFeatureExtractor
This method does nothing returns. It raises an error if
the object is NULL
the object does not rely on the same machine learning framework as the classifier
the object is not trained.
requires_compression()
Method for checking if provided text embeddings must be compressed via a TEFeatureExtractor before processing.
ClassifiersBasedOnTextEmbeddings$requires_compression(text_embeddings)
text_embeddingsObject of class EmbeddedText, LargeDataSetForTextEmbeddings, array or
datasets.arrow_dataset.Dataset.
Return TRUE if a compression is necessary and FALSE if not.
save()
Method for saving a model.
ClassifiersBasedOnTextEmbeddings$save(dir_path, folder_name)
dir_pathstring Path of the directory where the model should be saved.
folder_namestring Name of the folder that should be created within the directory.
Function does not return a value. It saves the model to disk.
load_from_disk()
loads an object from disk and updates the object to the current version of the package.
ClassifiersBasedOnTextEmbeddings$load_from_disk(dir_path)
dir_pathPath where the object set is stored.
Method does not return anything. It loads an object from disk.
adjust_target_levels()
Method transforms the levels of a factor into numbers corresponding to the models definition.
ClassifiersBasedOnTextEmbeddings$adjust_target_levels(data_targets)
data_targetsfactor containing the labels for cases stored in embeddings. Factor must be
named and has to use the same names as used in in the embeddings.
Method returns a factor containing the numerical representation of
categories/classes.
plot_training_history()
Method for requesting a plot of the training history. This method requires the R package 'ggplot2' to work.
ClassifiersBasedOnTextEmbeddings$plot_training_history( final_training = FALSE, pl_step = NULL, measure = "loss", ind_best_model = TRUE, ind_selected_model = TRUE, x_min = NULL, x_max = NULL, y_min = NULL, y_max = NULL, add_min_max = TRUE, text_size = 10L )
final_trainingbool If FALSE the values of the performance estimation are used. If TRUE only the epochs of the final training are used.
pl_stepint Number of the step during pseudo labeling to plot. Only relevant if the model was trained
with active pseudo labeling.
measurestring Measure to plot. Allowed values:
"avg_iota" = Average Iota
"loss" = Loss
"accuracy" = Accuracy
"balanced_accuracy" = Balanced Accuracy
ind_best_modelbool If TRUE the plot indicates the best states of the model according to the chosen measure.
ind_selected_modelbool If TRUE the plot indicates the states of the model which are used after training. These are the final states of the fold or the final state of the last training loop.
x_minint Minimal value for x-axis. Set to NULL for an automatic adjustment. Allowed values:
x_maxint Maximal value for x-axis. Set to NULL for an automatic adjustment. Allowed values:
y_minint Minimal value for y-axis. Set to NULL for an automatic adjustment. Allowed values:
y_maxint Maximal value for y-axis. Set to NULL for an automatic adjustment. Allowed values:
add_min_maxbool If TRUE the minimal and maximal values during performance estimation are port of the plot. If FALSE only the mean values are shown. Parameter is ignored if final_training=TRUE.
text_sizeint Size of text elements. Allowed values:
Returns a plot of class ggplot visualizing the training process.
plot_coding_stream()
Method for requesting a plot the coding stream. The plot shows how the cases of different categories/classes are assigned to a the available classes/categories. The visualization is helpful for analyzing the consequences of coding errors.
ClassifiersBasedOnTextEmbeddings$plot_coding_stream( label_categories_size = 3L, key_size = 0.5, text_size = 10L )
label_categories_sizedouble determining the size of the label for each true and assigned category within the plot.
key_sizedouble determining the size of the legend.
text_sizedouble determining the size of the text within the legend.
Returns a plot of class ggplot visualizing the training process.
clone()
The objects of this class are cloneable with this method.
ClassifiersBasedOnTextEmbeddings$clone(deep = FALSE)
deepWhether to make a deep clone.
Other R6 Classes for Developers:
AIFEBaseModel,
AIFEMaster,
BaseModelCore,
DataManagerClassifier,
LargeDataSetBase,
ModelsBasedOnTextEmbeddings,
TEClassifiersBasedOnProtoNet,
TEClassifiersBasedOnRegular,
TokenizerBase
Function for preparing and cleaning the log created by an object of class Trainer from the python library 'transformer's.
clean_pytorch_log_transformers(log)clean_pytorch_log_transformers(log)
log |
|
Returns a data.frame containing epochs, loss, and val_loss.
Other Utils Log Developers:
cat_message(),
output_message(),
print_message(),
read_log(),
read_loss_log(),
reset_log(),
reset_loss_log(),
write_log()
This function calculates different version of Cohen's Kappa.
cohens_kappa(rater_one, rater_two)cohens_kappa(rater_one, rater_two)
rater_one |
|
rater_two |
|
Returns a list containing the results for Cohen' Kappa if no weights
are applied (kappa_unweighted), if weights are applied and the weights increase
linear (kappa_linear), and if weights are applied and the weights increase quadratic
(kappa_squared).
Cohen, J (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220. https://doi.org/10.1037/h0026256
Cohen, J (1960). A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104
Other performance measures:
calc_standard_classification_measures(),
fleiss_kappa(),
get_coder_metrics(),
gwet_ac(),
kendalls_w(),
kripp_alpha()
Check whether the passed dir_path directory exists. If not, creates a new directory and prints a msg
message if trace is TRUE.
create_dir(dir_path, trace, msg = "Creating Directory", msg_fun = TRUE)create_dir(dir_path, trace, msg = "Creating Directory", msg_fun = TRUE)
dir_path |
|
trace |
|
msg |
|
msg_fun |
|
TRUE or FALSE depending on whether the shiny app is active.
Other Utils File Management Developers:
get_file_extension()
Support function for creating objects.
create_object(class)create_object(class)
class |
|
Returns an object of the requested class.
Other Utils Developers:
auto_n_cores(),
create_synthetic_units_from_matrix(),
generate_id(),
get_n_chunks(),
get_synthetic_cases_from_matrix(),
get_time_stamp(),
matrix_to_array_c(),
tensor_to_matrix_c(),
to_categorical_c()
Function creates a valid file path for the argument cache_file_name of classes
"datasets.arrow_dataset.Dataset" from the python library 'datasets'. The aim of the
function is to ensure compatibility between different versions of 'datasets'.
create_py_dataset_cache_file_path(file_path)create_py_dataset_cache_file_path(file_path)
file_path |
|
Returns a file path as string.
Other Utils Python Data Management Developers:
class_vector_to_py_dataset(),
data.frame_to_py_dataset(),
extract_column_from_py_dataset(),
get_batches_index(),
prepare_r_array_for_dataset(),
py_dataset_to_embeddings(),
reduce_to_unique(),
tensor_list_to_numpy(),
tensor_to_numpy()
Function for creating synthetic cases in order to balance the data for training with TEClassifierRegular or TEClassifierProtoNet]. This is an auxiliary function for use with get_synthetic_cases_from_matrix to allow parallel computations.
create_synthetic_units_from_matrix( matrix_form, target, required_cases, k, method, cat )create_synthetic_units_from_matrix( matrix_form, target, required_cases, k, method, cat )
matrix_form |
Named |
target |
Named |
required_cases |
|
k |
|
method |
|
cat |
|
Returns a list which contains the text embeddings of the new synthetic cases as a named data.frame and
their labels as a named factor.
Other Utils Developers:
auto_n_cores(),
create_object(),
generate_id(),
get_n_chunks(),
get_synthetic_cases_from_matrix(),
get_time_stamp(),
matrix_to_array_c(),
tensor_to_matrix_c(),
to_categorical_c()
Function for converting a data.frame into a pyarrow data set.
data.frame_to_py_dataset(data_frame)data.frame_to_py_dataset(data_frame)
data_frame |
Object of class |
Returns the data.frame as a pyarrow data set of class datasets.arrow_dataset.Dataset.
Other Utils Python Data Management Developers:
class_vector_to_py_dataset(),
create_py_dataset_cache_file_path(),
extract_column_from_py_dataset(),
get_batches_index(),
prepare_r_array_for_dataset(),
py_dataset_to_embeddings(),
reduce_to_unique(),
tensor_list_to_numpy(),
tensor_to_numpy()
Abstract class for managing the data and samples during training a classifier. DataManagerClassifier is used with all classifiers based on text embeddings.
Objects of this class are used for ensuring the correct data management for training different types of classifiers. They are also used for data augmentation by creating synthetic cases with different techniques.
config('list')
Field for storing configuration of the DataManagerClassifier.
state('list')
Field for storing the current state of the DataManagerClassifier.
datasets('list')
Field for storing the data sets used during training. All elements of the list are data sets of class
datasets.arrow_dataset.Dataset. The following data sets are available:
data_labeled: all cases which have a label.
data_unlabeled: all cases which have no label.
data_labeled_synthetic: all synthetic cases with their corresponding labels.
data_labeled_pseudo: subset of data_unlabeled if pseudo labels were estimated by a classifier.
name_idx('named vector')
Field for storing the pairs of indexes and names of every case. The pairs for labeled and unlabeled data are
separated.
samples('list')
Field for storing the assignment of every cases to a train, validation or test data set depending on the
concrete fold. Only the indexes and not the names are stored. In addition, the list contains the assignment for
the final training which excludes a test data set. If the DataManagerClassifier uses i folds the sample for
the final training can be requested with i+1.
new()
Creating a new instance of this class.
DataManagerClassifier$new( data_embeddings, data_targets, class_levels, folds = 5L, val_size = 0.25, pad_value = -100L, one_hot_encoding = TRUE, add_matrix_map = TRUE, sc_methods = "knnor", sc_min_k = 1L, sc_max_k = 10L, trace = TRUE, n_cores = auto_n_cores() )
data_embeddingsEmbeddedText, LargeDataSetForTextEmbeddings Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
data_targetsfactor containing the labels for cases stored in embeddings. Factor must be
named and has to use the same names as used in in the embeddings.
class_levelsvector containing the levels (categories or classes) within the target data. Please
note that order matters. For ordinal data please ensure that the levels are sorted correctly with later levels
indicating a higher category/class. For nominal data the order does not matter.
foldsint determining the number of cross-fold samples. Allowed values:
val_sizedouble between 0 and 1, indicating the proportion of cases which should be
used for the validation sample during the estimation of the model.
The remaining cases are part of the training data. Allowed values:
pad_valueint Value indicating padding. This value should no be in the range of
regluar values for computations. Thus it is not recommended to chance this value.
Default is -100. Allowed values:
one_hot_encodingbool If TRUE all labels are converted to one hot encoding.
add_matrix_mapbool If TRUE all embeddings are transformed into a two dimensional matrix.
The number of rows equals the number of cases. The number of columns equals times*features.
sc_methodsstring containing the method for generating synthetic cases. Allowed values: 'knnor'
sc_min_kint determining the minimal number of k which is used for creating synthetic units. Allowed values:
sc_max_kint determining the maximal number of k which is used for creating synthetic units. Allowed values:
tracebool TRUE if information about the estimation phase should be printed to the console.
n_coresint Number of cores which should be used during the calculation of synthetic cases. Only relevant if use_sc=TRUE. Allowed values:
Method returns an initialized object of class DataManagerClassifier.
get_config()
Method for requesting the configuration of the DataManagerClassifier.
DataManagerClassifier$get_config()
Returns a list storing the configuration of the DataManagerClassifier.
get_labeled_data()
Method for requesting the complete labeled data set.
DataManagerClassifier$get_labeled_data()
Returns an object of class datasets.arrow_dataset.Dataset containing all cases with labels.
get_unlabeled_data()
Method for requesting the complete unlabeled data set.
DataManagerClassifier$get_unlabeled_data()
Returns an object of class datasets.arrow_dataset.Dataset containing all cases without labels.
get_samples()
Method for requesting the assignments to train, validation, and test data sets for every fold and the final training.
DataManagerClassifier$get_samples()
Returns a list storing the assignments to a train, validation, and test data set for every fold. In the
case of the sample for the final training the test data set is always empty (NULL).
set_state()
Method for setting the current state of the DataManagerClassifier.
DataManagerClassifier$set_state(iteration, step = NULL)
iterationint determining the current iteration of the training. That is iteration determines the fold
to use for training, validation, and testing. If i is the number of fold i+1 request the sample for the
final training. For requesting the sample for the final training iteration can take a string "final".
stepint determining the step for estimating and using pseudo labels during training. Only relevant if
training is requested with pseudo labels.
Method does not return anything. It is used for setting the internal state of the DataManager.
get_n_folds()
Method for requesting the number of folds the DataManagerClassifier can use with the current data.
DataManagerClassifier$get_n_folds()
Returns the number of folds the DataManagerClassifier uses.
get_n_classes()
Method for requesting the number of classes.
DataManagerClassifier$get_n_classes()
Returns the number classes.
get_statistics()
Method for requesting descriptive sample statistics.
DataManagerClassifier$get_statistics()
Returns a table describing the absolute frequencies of the labeled and unlabeled data. The rows contain the length of the sequences while the columns contain the labels.
contains_unlabeled_data()
Method for checking if the dataset contains cases without labels.
DataManagerClassifier$contains_unlabeled_data()
Returns TRUE if the dataset contains cases without labels. Returns FALSE
if all cases have labels.
get_dataset()
Method for requesting a data set for training depending in the current state of the DataManagerClassifier.
DataManagerClassifier$get_dataset( inc_labeled = TRUE, inc_unlabeled = FALSE, inc_synthetic = FALSE, inc_pseudo_data = FALSE )
inc_labeledbool If TRUE the data set includes all cases which have labels.
inc_unlabeledbool If TRUE the data set includes all cases which have no labels.
inc_syntheticbool If TRUE the data set includes all synthetic cases with their corresponding labels.
inc_pseudo_databool If TRUE the data set includes all cases which have pseudo labels.
Returns an object of class datasets.arrow_dataset.Dataset containing the requested kind of data along
with all requested transformations for training. Please note that this method returns a data sets that is
designed for training only. The corresponding validation data set is requested with get_val_dataset and the
corresponding test data set with get_test_dataset.
get_val_dataset()
Method for requesting a data set for validation depending in the current state of the DataManagerClassifier.
DataManagerClassifier$get_val_dataset()
Returns an object of class datasets.arrow_dataset.Dataset containing the requested kind of data along
with all requested transformations for validation. The corresponding data set for training can be requested
with get_dataset and the corresponding data set for testing with get_test_dataset.
get_test_dataset()
Method for requesting a data set for testing depending in the current state of the DataManagerClassifier.
DataManagerClassifier$get_test_dataset()
Returns an object of class datasets.arrow_dataset.Dataset containing the requested kind of data along
with all requested transformations for validation. The corresponding data set for training can be requested
with get_dataset and the corresponding data set for validation with get_val_dataset.
create_synthetic()
Method for generating synthetic data used during training. The process uses all labeled data belonging to the current state of the DataManagerClassifier.
DataManagerClassifier$create_synthetic(trace = TRUE, inc_pseudo_data = FALSE)
tracebool If TRUE information on the process are printed to the console.
inc_pseudo_databool If TRUE data with pseudo labels are used in addition to the labeled data for
generating synthetic cases.
This method does nothing return. It generates a new data set for synthetic cases which are stored as an
object of class datasets.arrow_dataset.Dataset in the field datasets$data_labeled_synthetic. Please note
that a call of this method will override an existing data set in the corresponding field.
add_replace_pseudo_data()
Method for adding data with pseudo labels generated by a classifier
DataManagerClassifier$add_replace_pseudo_data(inputs, labels)
inputsarray or matrix representing the input data.
labelsfactor containing the corresponding pseudo labels.
This method does nothing return. It generates a new data set for synthetic cases which are stored as an
object of class datasets.arrow_dataset.Dataset in the field datasets$data_labeled_pseudo. Please note that
a call of this method will override an existing data set in the corresponding field.
clone()
The objects of this class are cloneable with this method.
DataManagerClassifier$clone(deep = FALSE)
deepWhether to make a deep clone.
Other R6 Classes for Developers:
AIFEBaseModel,
AIFEMaster,
BaseModelCore,
ClassifiersBasedOnTextEmbeddings,
LargeDataSetBase,
ModelsBasedOnTextEmbeddings,
TEClassifiersBasedOnProtoNet,
TEClassifiersBasedOnRegular,
TokenizerBase
Named list containing all available types of data sets as a string.
DataSetsIndexDataSetsIndex
An object of class list of length 3.
Other Parameter Dictionary:
BaseModelsIndex,
TokenizerIndex,
doc_formula(),
get_TEClassifiers_class_names(),
get_called_args(),
get_depr_obj_names(),
get_magnitude_values(),
get_param_def(),
get_param_dict(),
get_param_doc_desc()
Object of class R6 which stores the text embeddings generated by an object of class
TextEmbeddingModel. The text embeddings are stored within memory/RAM. In the case of a high number of documents
the data may not fit into memory/RAM. Thus, please use this object only for a small sample of texts. In general, it
is recommended to use an object of class LargeDataSetForTextEmbeddings which can deal with any number of texts.
Returns an object of class EmbeddedText. These objects are used for storing and managing the text embeddings created with objects of class TextEmbeddingModel. Objects of class EmbeddedText serve as input for objects of class TEClassifierRegular, TEClassifierProtoNet, and TEFeatureExtractor. The main aim of this class is to provide a structured link between embedding models and classifiers. Since objects of this class save information on the text embedding model that created the text embedding it ensures that only embedding generated with same embedding model are combined. Furthermore, the stored information allows objects to check if embeddings of the correct text embedding model are used for training and predicting.
embeddings('data.frame()')
data.frame containing the text embeddings for all chunks. Documents are in the rows. Embedding dimensions are
in the columns.
configure()
Creates a new object representing text embeddings.
EmbeddedText$configure( embeddings, model_name = NA, model_label = NA, model_date = NA, model_method = NA, model_version = NA, model_language = NA, param_seq_length = NA, param_chunks = NULL, param_features = NULL, param_overlap = NULL, param_emb_layer_min = NULL, param_emb_layer_max = NULL, param_emb_pool_type = NULL, param_aggregation = NULL, param_pad_value = -100L )
embeddingsdata.frame containing the text embeddings.
model_namestring Name of the model that generates this embedding.
model_labelstring Label of the model that generates this embedding.
model_datestring Date when the embedding generating model was created.
model_methodstring Method of the underlying embedding model.
model_versionstring Version of the model that generated this embedding.
model_languagestring Language of the model that generated this embedding.
param_seq_lengthint Maximum number of tokens that processes the generating model for a chunk.
param_chunksint Maximum number of chunks which are supported by the generating model.
param_featuresint Number of dimensions of the text embeddings.
param_overlapint Number of tokens that were added at the beginning of the sequence for the next chunk
by this model. #'
param_emb_layer_minint or string determining the first layer to be included in the creation of
embeddings.
param_emb_layer_maxint or string determining the last layer to be included in the creation of
embeddings.
param_emb_pool_typestring determining the method for pooling the token embeddings within each layer.
param_aggregationstring Aggregation method of the hidden states. Deprecated. Only included for backward
compatibility.
param_pad_valueint Value indicating padding. This value should no be in the range of
regluar values for computations. Thus it is not recommended to chance this value.
Default is -100. Allowed values:
Returns an object of class EmbeddedText which stores the text embeddings produced by an objects of class TextEmbeddingModel.
save()
Saves a data set to disk.
EmbeddedText$save(dir_path, folder_name, create_dir = TRUE)
dir_pathPath where to store the data set.
folder_namestring Name of the folder for storing the data set.
create_dirbool If True the directory will be created if it does not exist.
Method does not return anything. It write the data set to disk.
is_configured()
Method for checking if the model was successfully configured. An object can only be used if this
value is TRUE.
EmbeddedText$is_configured()
bool TRUE if the model is fully configured. FALSE if not.
load_from_disk()
loads an object of class EmbeddedText from disk and updates the object to the current version of the package.
EmbeddedText$load_from_disk(dir_path)
dir_pathPath where the data set set is stored.
Method does not return anything. It loads an object from disk.
get_model_info()
Method for retrieving information about the model that generated this embedding.
EmbeddedText$get_model_info()
list contains all saved information about the underlying text embedding model.
get_model_label()
Method for retrieving the label of the model that generated this embedding.
EmbeddedText$get_model_label()
string Label of the corresponding text embedding model
get_times()
Number of chunks/times of the text embeddings.
EmbeddedText$get_times()
Returns an int describing the number of chunks/times of the text embeddings.
get_features()
Number of actual features/dimensions of the text embeddings.In the case a
feature extractor was used the number of features is smaller as the original number of
features. To receive the original number of features (the number of features before applying a
feature extractor) you can use the method get_original_features of this class.
EmbeddedText$get_features()
Returns an int describing the number of features/dimensions of the text embeddings.
get_original_features()
Number of original features/dimensions of the text embeddings.
EmbeddedText$get_original_features()
Returns an int describing the number of features/dimensions if no
feature extractor) is used or before a feature extractor) is
applied.
get_pad_value()
Value for indicating padding.
EmbeddedText$get_pad_value()
Returns an int describing the value used for padding.
is_compressed()
Checks if the text embedding were reduced by a feature extractor.
EmbeddedText$is_compressed()
Returns TRUE if the number of dimensions was reduced by a feature extractor. If
not return FALSE.
add_feature_extractor_info()
Method setting information on the feature extractor that was used to reduce the number of dimensions of the text embeddings. This information should only be used if a feature extractor was applied.
EmbeddedText$add_feature_extractor_info( model_name, model_label = NA, features = NA, method = NA, noise_factor = NA, optimizer = NA )
model_namestring Name of the underlying TextEmbeddingModel.
model_labelstring Label of the underlying TextEmbeddingModel.
featuresint Number of dimension (features) for the compressed text embeddings.
methodstring Method that the TEFeatureExtractor applies for genereating the compressed text
embeddings.
noise_factordouble Noise factor of the TEFeatureExtractor.
optimizerstring Optimizer used during training the TEFeatureExtractor.
Method does nothing return. It sets information on a feature extractor.
get_feature_extractor_info()
Method for receiving information on the feature extractor that was used to reduce the number of dimensions of the text embeddings.
EmbeddedText$get_feature_extractor_info()
Returns a list with information on the feature extractor. If no
feature extractor was used it returns NULL.
convert_to_LargeDataSetForTextEmbeddings()
Method for converting this object to an object of class LargeDataSetForTextEmbeddings.
EmbeddedText$convert_to_LargeDataSetForTextEmbeddings()
Returns an object of class LargeDataSetForTextEmbeddings which uses memory mapping allowing to work with large data sets.
n_rows()
Number of rows.
EmbeddedText$n_rows()
Returns the number of rows of the text embeddings which represent the number of cases.
get_all_fields()
Return all fields.
EmbeddedText$get_all_fields()
Method returns a list containing all public and private fields
of the object.
set_package_versions()
Method for setting the package version for 'aifeducation', 'reticulate', 'torch', and 'numpy' to the currently used versions.
EmbeddedText$set_package_versions()
Method does not return anything. It is used to set the private fields fo package versions.
get_package_versions()
Method for requesting a summary of the R and python packages' versions used for creating the model.
EmbeddedText$get_package_versions()
Returns a list containing the versions of the relevant
R and python packages.
clone()
The objects of this class are cloneable with this method.
EmbeddedText$clone(deep = FALSE)
deepWhether to make a deep clone.
Other Data Management:
LargeDataSetForText,
LargeDataSetForTextEmbeddings
Function extracts the content of a column from a python data set in order to allow further operations in R.
extract_column_from_py_dataset(py_dataset, column_name, format = "R")extract_column_from_py_dataset(py_dataset, column_name, format = "R")
py_dataset |
|
column_name |
|
format |
|
Returns a vector, matrix or array for format="R". In all other
cases the requested format is returned.
Other Utils Python Data Management Developers:
class_vector_to_py_dataset(),
create_py_dataset_cache_file_path(),
data.frame_to_py_dataset(),
get_batches_index(),
prepare_r_array_for_dataset(),
py_dataset_to_embeddings(),
reduce_to_unique(),
tensor_list_to_numpy(),
tensor_to_numpy()
This function calculates Fleiss' Kappa.
fleiss_kappa(rater_one, rater_two, additional_raters = NULL)fleiss_kappa(rater_one, rater_two, additional_raters = NULL)
rater_one |
|
rater_two |
|
additional_raters |
|
Returns the value for Fleiss' Kappa.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. https://doi.org/10.1037/h0031619
Other performance measures:
calc_standard_classification_measures(),
cohens_kappa(),
get_coder_metrics(),
gwet_ac(),
kendalls_w(),
kripp_alpha()
Function generates a specific number of combinations for a method. These are used for automating tests of objects.
generate_args_for_tests( object_name, method, var_objects = list(), necessary_objects = list(), var_override = list() )generate_args_for_tests( object_name, method, var_objects = list(), necessary_objects = list(), var_override = list() )
object_name |
|
method |
|
var_objects |
|
necessary_objects |
|
var_override |
Named |
Returns a list with combinations of arguments.
var_objects, necessary_objects, and var_override the names must exactly match
the name of the parameter. Otherwise they are not applied. Names of arguments which are not part
a a method are ignored. #'
Other Utils TestThat Developers:
check_adjust_n_samples_on_CI(),
generate_embeddings(),
generate_tensors(),
get_current_args_for_print(),
get_fixed_test_tensor(),
get_test_data_for_classifiers(),
monitor_test_time_on_CI(),
random_bool_on_CI()
Functions generates a random test embedding that can be used for testing methods and functions. The embeddings have the shape (Batch, Times,Features).
generate_embeddings(times, features, seq_len, pad_value)generate_embeddings(times, features, seq_len, pad_value)
times |
|
features |
|
seq_len |
Numeric |
pad_value |
|
Returns an array with dim (length(seq_len),times,features).
To generate a 'PyTorch' object please use generate_tensors.
Other Utils TestThat Developers:
check_adjust_n_samples_on_CI(),
generate_args_for_tests(),
generate_tensors(),
get_current_args_for_print(),
get_fixed_test_tensor(),
get_test_data_for_classifiers(),
monitor_test_time_on_CI(),
random_bool_on_CI()
Function for generating an ID suffix for objects of class TextEmbeddingModel, TEClassifierRegular, and TEClassifierProtoNet.
generate_id(length = 16L)generate_id(length = 16L)
length |
|
Returns a string of the requested length.
Other Utils Developers:
auto_n_cores(),
create_object(),
create_synthetic_units_from_matrix(),
get_n_chunks(),
get_synthetic_cases_from_matrix(),
get_time_stamp(),
matrix_to_array_c(),
tensor_to_matrix_c(),
to_categorical_c()
Functions generates a random test tensor that can be used for testing methods and functions based on 'PyTorch'. The tensors have the shape (Batch, Times,Features).
generate_tensors(times, features, seq_len, pad_value)generate_tensors(times, features, seq_len, pad_value)
times |
|
features |
|
seq_len |
Numeric |
pad_value |
|
Returns an object of class Tensor from 'PyTorch'.
To request a R array please use generate_embeddings.
Other Utils TestThat Developers:
check_adjust_n_samples_on_CI(),
generate_args_for_tests(),
generate_embeddings(),
get_current_args_for_print(),
get_fixed_test_tensor(),
get_test_data_for_classifiers(),
monitor_test_time_on_CI(),
random_bool_on_CI()
Function for requesting a vector containing the alpha-3 codes for most countries.
get_alpha_3_codes()get_alpha_3_codes()
Returns a vector containing the alpha-3 codes for most countries.
Other Utils Sustainability Developers:
summarize_tracked_sustainability()
Function groups cases into batches.
get_batches_index(number_rows, batch_size, zero_based = FALSE)get_batches_index(number_rows, batch_size, zero_based = FALSE)
number_rows |
|
batch_size |
|
zero_based |
|
Returns a list of batches. Each entry in the list contains a vector of int representing the cases
belonging to that batch.
Other Utils Python Data Management Developers:
class_vector_to_py_dataset(),
create_py_dataset_cache_file_path(),
data.frame_to_py_dataset(),
extract_column_from_py_dataset(),
prepare_r_array_for_dataset(),
py_dataset_to_embeddings(),
reduce_to_unique(),
tensor_list_to_numpy(),
tensor_to_numpy()
Function for receiving all arguments that were called by a method or function.
get_called_args(n = 1L)get_called_args(n = 1L)
n |
|
Returns a named list of all arguments and their values.
Other Parameter Dictionary:
BaseModelsIndex,
DataSetsIndex,
TokenizerIndex,
doc_formula(),
get_TEClassifiers_class_names(),
get_depr_obj_names(),
get_magnitude_values(),
get_param_def(),
get_param_dict(),
get_param_doc_desc()
This function calculates different reliability measures which are based on the empirical research method of content analysis.
get_coder_metrics( true_values = NULL, predicted_values = NULL, return_names_only = FALSE )get_coder_metrics( true_values = NULL, predicted_values = NULL, return_names_only = FALSE )
true_values |
|
predicted_values |
|
return_names_only |
|
If return_names_only = FALSE returns a vector with the following reliability measures:
iota_index: Iota Index from the Iota Reliability Concept Version 2.
min_iota2: Minimal Iota from Iota Reliability Concept Version 2.
avg_iota2: Average Iota from Iota Reliability Concept Version 2.
max_iota2: Maximum Iota from Iota Reliability Concept Version 2.
min_alpha: Minmal Alpha Reliability from Iota Reliability Concept Version 2.
avg_alpha: Average Alpha Reliability from Iota Reliability Concept Version 2.
max_alpha: Maximum Alpha Reliability from Iota Reliability Concept Version 2.
static_iota_index: Static Iota Index from Iota Reliability Concept Version 2.
dynamic_iota_index: Dynamic Iota Index Iota Reliability Concept Version 2.
kalpha_nominal: Krippendorff's Alpha for nominal variables.
kalpha_ordinal: Krippendorff's Alpha for ordinal variables.
kendall: Kendall's coefficient of concordance W with correction for ties.
c_kappa_unweighted: Cohen's Kappa unweighted.
c_kappa_linear: Weighted Cohen's Kappa with linear increasing weights.
c_kappa_squared: Weighted Cohen's Kappa with quadratic increasing weights.
kappa_fleiss: Fleiss' Kappa for multiple raters without exact estimation.
percentage_agreement: Percentage Agreement.
balanced_accuracy: Average accuracy within each class.
gwet_ac1_nominal: Gwet's Agreement Coefficient 1 (AC1) for nominal data which is unweighted.
gwet_ac2_linear: Gwet's Agreement Coefficient 2 (AC2) for ordinal data with linear weights.
gwet_ac2_quadratic: Gwet's Agreement Coefficient 2 (AC2) for ordinal data with quadratic weights.
If return_names_only = TRUE returns only the names of the vector elements.
Other performance measures:
calc_standard_classification_measures(),
cohens_kappa(),
fleiss_kappa(),
gwet_ac(),
kendalls_w(),
kripp_alpha()
Functions prints the used arguments. The aim of this function is to print the arguments to the console that resulted in a failed test.
get_current_args_for_print(arg_list)get_current_args_for_print(arg_list)
arg_list |
Named |
Function does nothing return.
Other Utils TestThat Developers:
check_adjust_n_samples_on_CI(),
generate_args_for_tests(),
generate_embeddings(),
generate_tensors(),
get_fixed_test_tensor(),
get_test_data_for_classifiers(),
monitor_test_time_on_CI(),
random_bool_on_CI()
Function returns the names of all objects that are deprecated.
get_depr_obj_names()get_depr_obj_names()
Returns a vector containing the names.
Other Parameter Dictionary:
BaseModelsIndex,
DataSetsIndex,
TokenizerIndex,
doc_formula(),
get_TEClassifiers_class_names(),
get_called_args(),
get_magnitude_values(),
get_param_def(),
get_param_dict(),
get_param_doc_desc()
Function for generating the documentation of a specific core model.
get_desc_for_core_model_architecture( name, title_format = "bold", inc_img = FALSE )get_desc_for_core_model_architecture( name, title_format = "bold", inc_img = FALSE )
name |
|
title_format |
|
inc_img |
|
Returns a string containing the description written in rmarkdown.
Other Utils Documentation:
build_documentation_for_model(),
build_layer_stack_documentation_for_vignette(),
get_dict_cls_type(),
get_dict_core_models(),
get_dict_input_types(),
get_layer_dict(),
get_layer_documentation(),
get_parameter_documentation()
Function for requesting the file extension
get_file_extension(file_path)get_file_extension(file_path)
file_path |
|
Returns the extension of a file as a string.
Other Utils File Management Developers:
create_dir()
Function generates a static test tensor which is always the same.
get_fixed_test_tensor(pad_value)get_fixed_test_tensor(pad_value)
pad_value |
|
Returns an object of class Tensor which is always the same except padding.
Shape (5,3,7).
Other Utils TestThat Developers:
check_adjust_n_samples_on_CI(),
generate_args_for_tests(),
generate_embeddings(),
generate_tensors(),
get_current_args_for_print(),
get_test_data_for_classifiers(),
monitor_test_time_on_CI(),
random_bool_on_CI()
Function for generating the documentation of a specific layer.
get_layer_documentation( layer_name, title_format = "bold", subtitle_format = "italic", inc_img = FALSE, inc_params = FALSE, inc_references = FALSE )get_layer_documentation( layer_name, title_format = "bold", subtitle_format = "italic", inc_img = FALSE, inc_params = FALSE, inc_references = FALSE )
layer_name |
|
title_format |
|
subtitle_format |
|
inc_img |
|
inc_params |
|
inc_references |
|
Returns a string containing the description written in rmarkdown.
Other Utils Documentation:
build_documentation_for_model(),
build_layer_stack_documentation_for_vignette(),
get_desc_for_core_model_architecture(),
get_dict_cls_type(),
get_dict_core_models(),
get_dict_input_types(),
get_layer_dict(),
get_parameter_documentation()
Function calculates different magnitude for a numeric argument.
get_magnitude_values(magnitude, n_elements = 9L, max = NULL, min = NULL)get_magnitude_values(magnitude, n_elements = 9L, max = NULL, min = NULL)
magnitude |
|
n_elements |
|
max |
|
min |
|
Returns a numeric vector with the generated values.
The values are calculated with the following formula:
max * magnitude^i for i=1,...,n_elements.
Only values equal or greater min are returned.
Other Parameter Dictionary:
BaseModelsIndex,
DataSetsIndex,
TokenizerIndex,
doc_formula(),
get_TEClassifiers_class_names(),
get_called_args(),
get_depr_obj_names(),
get_param_def(),
get_param_dict(),
get_param_doc_desc()
Function for calculating the number of chunks/sequences for every case.
get_n_chunks(text_embeddings, features, times, pad_value = -100L)get_n_chunks(text_embeddings, features, times, pad_value = -100L)
text_embeddings |
|
features |
|
times |
|
pad_value |
|
Namedvector of integers representing the number of chunks/sequences for every case.
Other Utils Developers:
auto_n_cores(),
create_object(),
create_synthetic_units_from_matrix(),
generate_id(),
get_synthetic_cases_from_matrix(),
get_time_stamp(),
matrix_to_array_c(),
tensor_to_matrix_c(),
to_categorical_c()
Function returns the definition of an argument. Please note that only definitions of arguments can be requested which are used for transformers or classifier models.
get_param_def(param_name)get_param_def(param_name)
param_name |
|
Returns a list with the definition of the argument. See get_param_dict
for more details.
Other Parameter Dictionary:
BaseModelsIndex,
DataSetsIndex,
TokenizerIndex,
doc_formula(),
get_TEClassifiers_class_names(),
get_called_args(),
get_depr_obj_names(),
get_magnitude_values(),
get_param_dict(),
get_param_doc_desc()
Function provides a list containing important characteristics
of the parameter used in the models. The list does contain only the definition of
arguments for transformer models and all classifiers. The arguments of other functions
in this package are documented separately.
The aim of this list is to automatize argument checking and widget generation for AI for Education - Studio.
get_param_dict()get_param_dict()
Returns a named list. The names correspond to specific arguments.
The list contains a list for every argument with the following components:
type: The type of allowed values.
allow_null: A bool indicating if the argument can be set to NULL.
min: The minimal value the argument can be. Set to NULL if not relevant. Set to -Inf if there is no minimum.
max: The maximal value the argument can be. Set to NULL if not relevant. Set to Inf if there is no Minimum.
desc: A string which includes the description of the argument written in markdown. This string is for the documentation the parameter.
values_desc: A named list containing a description of every possible value. The names must exactly match the strings in allowed_values. Descriptions should be written in markdown.
allowed_values: vector of allowed values. This is only relevant if the argument is not numeric. During the checking of the arguments
it is checked if the provided values can be found in this vector. If all values are allowed set to NULL.
default_value: The default value of the argument. If there is no default set to NULL.
default_historic: Historic default value. This can be necessary for backward compatibility.
gui_box: string Name of the box in AI for Education - Studio where the argument appears. If it should not appear set to NULL.
gui_label: string Label of the controlling widget in AI for Education - Studio.
Other Parameter Dictionary:
BaseModelsIndex,
DataSetsIndex,
TokenizerIndex,
doc_formula(),
get_TEClassifiers_class_names(),
get_called_args(),
get_depr_obj_names(),
get_magnitude_values(),
get_param_def(),
get_param_doc_desc()
Function provides the description of an argument in markdown. Its aim is to be used for documenting the parameter of functions.
get_param_doc_desc(param_name)get_param_doc_desc(param_name)
param_name |
|
Returns a string which contains the description of the argument in markdown. The concrete format depends on the type of the argument.
Other Parameter Dictionary:
BaseModelsIndex,
DataSetsIndex,
TokenizerIndex,
doc_formula(),
get_TEClassifiers_class_names(),
get_called_args(),
get_depr_obj_names(),
get_magnitude_values(),
get_param_def(),
get_param_dict()
Function for generating the documentation of a specific layer.
get_parameter_documentation( param_name, param_dict, as_list = TRUE, inc_param_name = TRUE )get_parameter_documentation( param_name, param_dict, as_list = TRUE, inc_param_name = TRUE )
param_name |
|
param_dict |
|
as_list |
|
inc_param_name |
|
Returns a string containing the description written in rmarkdown.
Other Utils Documentation:
build_documentation_for_model(),
build_layer_stack_documentation_for_vignette(),
get_desc_for_core_model_architecture(),
get_dict_cls_type(),
get_dict_core_models(),
get_dict_input_types(),
get_layer_dict(),
get_layer_documentation()
Function for requesting the version of a specific python package.
get_py_package_version(package_name)get_py_package_version(package_name)
package_name |
|
Returns the version as string or NA if the package does not exist
or no version is available.
Other Utils Python Developers:
get_py_package_versions(),
load_all_py_scripts(),
load_py_scripts(),
run_py_file()
Function for requesting a summary of the versions of all critical python components.
get_py_package_versions()get_py_package_versions()
Returns a list that contains the version number of python and
the versions of critical python packages. If a package is not available
version is set to NA.
Other Utils Python Developers:
get_py_package_version(),
load_all_py_scripts(),
load_py_scripts(),
run_py_file()
Returns the minimum and maximum versions of the core python packages used in aifeducation. It is recommended to use packages of these version. Packages of other versions can result in errors or unexpected results.
get_recommended_py_versions(package_name = NULL)get_recommended_py_versions(package_name = NULL)
package_name |
|
Returns a data.frame with the packages in the columns and the minimum,
maximum, and recommended version in the rows. If a concrete name is passed returns a
string with leading '<='.
Other Installation and Configuration:
check_aif_py_modules(),
install_aifeducation(),
install_aifeducation_studio(),
install_py_modules(),
prepare_session(),
set_transformers_logger(),
update_aifeducation()
This function creates synthetic cases for balancing the training with classifier models.
get_synthetic_cases_from_matrix( matrix_form, times, features, target, sequence_length, method = "knnor", min_k = 1L, max_k = 6L, pad_value = -100L )get_synthetic_cases_from_matrix( matrix_form, times, features, target, sequence_length, method = "knnor", min_k = 1L, max_k = 6L, pad_value = -100L )
matrix_form |
Named |
times |
|
features |
|
target |
Named |
sequence_length |
|
method |
|
min_k |
|
max_k |
|
pad_value |
|
list with the following components:
syntetic_embeddings: Named data.frame containing the text embeddings of the synthetic cases.
syntetic_targets: Named factor containing the labels of the corresponding synthetic cases.
n_syntetic_units: table showing the number of synthetic cases for every label/category.
Other Utils Developers:
auto_n_cores(),
create_object(),
create_synthetic_units_from_matrix(),
generate_id(),
get_n_chunks(),
get_time_stamp(),
matrix_to_array_c(),
tensor_to_matrix_c(),
to_categorical_c()
Function returns the names of all classifiers which are child classes of a specific super class.
get_TEClassifiers_class_names(super_class = NULL)get_TEClassifiers_class_names(super_class = NULL)
super_class |
|
Returns a vector containing the names of the classifiers.
Other Parameter Dictionary:
BaseModelsIndex,
DataSetsIndex,
TokenizerIndex,
doc_formula(),
get_called_args(),
get_depr_obj_names(),
get_magnitude_values(),
get_param_def(),
get_param_dict(),
get_param_doc_desc()
Function returns example data for testing the package
get_test_data_for_classifiers(class_range = c(2L, 3L), path_test_embeddings)get_test_data_for_classifiers(class_range = c(2L, 3L), path_test_embeddings)
class_range |
|
path_test_embeddings |
|
Returns a list with test data.
Other Utils TestThat Developers:
check_adjust_n_samples_on_CI(),
generate_args_for_tests(),
generate_embeddings(),
generate_tensors(),
get_current_args_for_print(),
get_fixed_test_tensor(),
monitor_test_time_on_CI(),
random_bool_on_CI()
Function returns the time on the machine at the moment of calling.
get_time_stamp()get_time_stamp()
Returns a string with date and time in format "%y-%m-%d %H:%M:%S".
Other Utils Developers:
auto_n_cores(),
create_object(),
create_synthetic_units_from_matrix(),
generate_id(),
get_n_chunks(),
get_synthetic_cases_from_matrix(),
matrix_to_array_c(),
tensor_to_matrix_c(),
to_categorical_c()
This function calculates Gwets Agreement Coefficients.
gwet_ac(rater_one, rater_two, additional_raters = NULL)gwet_ac(rater_one, rater_two, additional_raters = NULL)
rater_one |
|
rater_two |
|
additional_raters |
|
Returns a list with the following entries
ac1: Gwet's Agreement Coefficient 1 (AC1) for nominal data which is unweighted.
ac2_linear: Gwet's Agreement Coefficient 2 (AC2) for ordinal data with linear weights.
ac2_quadratic: Gwet's Agreement Coefficient 2 (AC2) for ordinal data with quadratic weights.
Weights are calculated as described in Gwet (2021).
Missing values are supported.
Gwet, K. L. (2021). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (Fifth edition, volume 1). AgreeStat Analytics.
Other performance measures:
calc_standard_classification_measures(),
cohens_kappa(),
fleiss_kappa(),
get_coder_metrics(),
kendalls_w(),
kripp_alpha()
Abstract class for all tokenizers used with the 'transformers' library.
Does return a new object of this class.
aifeducation::AIFEMaster -> aifeducation::TokenizerBase -> HuggingFaceTokenizer
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::TokenizerBase$calculate_statistics()aifeducation::TokenizerBase$decode()aifeducation::TokenizerBase$encode()aifeducation::TokenizerBase$get_special_tokens()aifeducation::TokenizerBase$get_tokenizer()aifeducation::TokenizerBase$get_tokenizer_statistics()aifeducation::TokenizerBase$load_from_disk()aifeducation::TokenizerBase$n_special_tokens()aifeducation::TokenizerBase$save()create_from_hf()
Creates a tokenizer from a pretrained model
HuggingFaceTokenizer$create_from_hf(model_dir)
model_dirPath where the model is stored.
Does return a new object of this class.
clone()
The objects of this class are cloneable with this method.
HuggingFaceTokenizer$clone(deep = FALSE)
deepWhether to make a deep clone.
Other Tokenizer:
WordPieceTokenizer
Function reporting the number of files and the cumulative size of the files in temporary directory.
inspect_tmp_dir()inspect_tmp_dir()
Returns a list containing a vector with the paths of all files in
the temporary directory and the cumulative file size in bytes.
Function for installing 'aifeducation' on a machine.
Using a virtual environment (use_conda=FALSE)
If 'python' is already installed the installed version is used. In the case that
the required version of 'python' is different from the existing version the new
version is installed. In all other cases python will be installed on the system.
#' Using a conda environment (use_conda=TRUE)
If 'miniconda' is already existing on the machine no installation of 'miniconda'
is applied. In this case the system checks for update and updates 'miniconda' to
the newest version. If 'miniconda' is not found on the system it will be installed.
install_aifeducation( install_aifeducation_studio = TRUE, python_version = "3.12", cuda_version = "13.0", use_conda = FALSE )install_aifeducation( install_aifeducation_studio = TRUE, python_version = "3.12", cuda_version = "13.0", use_conda = FALSE )
install_aifeducation_studio |
|
python_version |
|
cuda_version |
|
use_conda |
|
Function does nothing return. It installs python, optional R packages, and necessary 'python' packages on a machine.
On MAC OS torch will be installed without support for cuda.
Other Installation and Configuration:
check_aif_py_modules(),
get_recommended_py_versions(),
install_aifeducation_studio(),
install_py_modules(),
prepare_session(),
set_transformers_logger(),
update_aifeducation()
Function installs/updates all relevant R packages necessary to run the shiny app ”AI for Education - Studio'.
install_aifeducation_studio()install_aifeducation_studio()
Function does nothing return. It installs/updates R packages.
Other Installation and Configuration:
check_aif_py_modules(),
get_recommended_py_versions(),
install_aifeducation(),
install_py_modules(),
prepare_session(),
set_transformers_logger(),
update_aifeducation()
Function for installing the necessary python modules.
install_py_modules( envname = "aifeducation", transformer_version = get_recommended_py_versions("transformers"), tokenizers_version = get_recommended_py_versions("tokenizers"), pandas_version = get_recommended_py_versions("pandas"), datasets_version = get_recommended_py_versions("datasets"), codecarbon_version = get_recommended_py_versions("codecarbon"), safetensors_version = get_recommended_py_versions("safetensors"), torcheval_version = get_recommended_py_versions("torcheval"), accelerate_version = get_recommended_py_versions("accelerate"), calflops_version = get_recommended_py_versions("calflops"), pytorch_cuda_version = "13.0", python_version = "3.12", remove_first = FALSE, use_conda = FALSE )install_py_modules( envname = "aifeducation", transformer_version = get_recommended_py_versions("transformers"), tokenizers_version = get_recommended_py_versions("tokenizers"), pandas_version = get_recommended_py_versions("pandas"), datasets_version = get_recommended_py_versions("datasets"), codecarbon_version = get_recommended_py_versions("codecarbon"), safetensors_version = get_recommended_py_versions("safetensors"), torcheval_version = get_recommended_py_versions("torcheval"), accelerate_version = get_recommended_py_versions("accelerate"), calflops_version = get_recommended_py_versions("calflops"), pytorch_cuda_version = "13.0", python_version = "3.12", remove_first = FALSE, use_conda = FALSE )
envname |
|
transformer_version |
|
tokenizers_version |
|
pandas_version |
|
datasets_version |
|
codecarbon_version |
|
safetensors_version |
|
torcheval_version |
|
accelerate_version |
|
calflops_version |
|
pytorch_cuda_version |
|
python_version |
|
remove_first |
|
use_conda |
|
Returns no values or objects. Function is used for installing the necessary python libraries in a conda environment.
Function tries to identify the type of operating system. In the case that MAC OS is detected 'PyTorch' is installed without support for cuda.
Supported versions of the packages can be requested with get_recommended_py_versions())
Other Installation and Configuration:
check_aif_py_modules(),
get_recommended_py_versions(),
install_aifeducation(),
install_aifeducation_studio(),
prepare_session(),
set_transformers_logger(),
update_aifeducation()
This function calculates Kendall's coefficient of concordance w with and without correction.
kendalls_w(rater_one, rater_two, additional_raters = NULL)kendalls_w(rater_one, rater_two, additional_raters = NULL)
rater_one |
|
rater_two |
|
additional_raters |
|
Returns a list containing the results for Kendall's coefficient of concordance w
with and without correction.
Other performance measures:
calc_standard_classification_measures(),
cohens_kappa(),
fleiss_kappa(),
get_coder_metrics(),
gwet_ac(),
kripp_alpha()
K-Nearest Neighbor OveRsampling approach (KNNOR)
knnor(dataset, k, aug_num, cycles_number_limit = 100L)knnor(dataset, k, aug_num, cycles_number_limit = 100L)
dataset |
|
k |
|
aug_num |
|
cycles_number_limit |
|
Returns artificial points (2-D array (matrix) with size aug_numxtimes*features')
Islam, A., Belhaouari, S. B., Rehman, A. U. & Bensmail, H. (2022). KNNOR: An oversampling technique for imbalanced datasets. Applied Soft Computing, 115, 108288. https://doi.org/10.1016/j.asoc.2021.108288
Function written in C++ for validating a new point (KNNOR-Validation)
knnor_is_same_class(new_point, dataset, labels, k)knnor_is_same_class(new_point, dataset, labels, k)
new_point |
|
dataset |
|
labels |
|
k |
|
Returns TRUE if a new point can be added, otherwise - FALSE
This function calculates different Krippendorff's Alpha for nominal and ordinal variables.
kripp_alpha(rater_one, rater_two, additional_raters = NULL)kripp_alpha(rater_one, rater_two, additional_raters = NULL)
rater_one |
|
rater_two |
|
additional_raters |
|
Returns a list containing the results for Krippendorff's Alpha for
nominal and ordinal data.
Missing values are supported.
Krippendorff, K. (2019). Content Analysis: An Introduction to Its Methodology (4th Ed.). SAGE
Other performance measures:
calc_standard_classification_measures(),
cohens_kappa(),
fleiss_kappa(),
get_coder_metrics(),
gwet_ac(),
kendalls_w()
This object contains public and private methods which may be useful for every large data sets. Objects of this class are not intended to be used directly.
Returns a new object of this class.
n_cols()
Number of columns in the data set.
LargeDataSetBase$n_cols()
int describing the number of columns in the data set.
n_rows()
Number of rows in the data set.
LargeDataSetBase$n_rows()
int describing the number of rows in the data set.
get_colnames()
Get names of the columns in the data set.
LargeDataSetBase$get_colnames()
vector containing the names of the columns as strings.
extract_column()
Extracts the data from a python data set.
LargeDataSetBase$extract_column(col_name, format = "R")
col_namestring Name of the column.
formatstring Format of the data.
"R" returns the data as a R object.
"torch" returns the data as PyTorch tensors.
"numpy" returns the data as numpy array.
Returns a vector, matrix or array for format="R". In
all other cases the requestes format is returned..
get_dataset()
Get data set.
LargeDataSetBase$get_dataset()
Returns the data set of this object as an object of class datasets.arrow_dataset.Dataset.
reduce_to_unique_ids()
Reduces the data set to a data set containing only unique ids. In the case an id exists multiple times in the data set the first case remains in the data set. The other cases are dropped.
Attention Calling this method will change the data set in place.
LargeDataSetBase$reduce_to_unique_ids()
Method does not return anything. It changes the data set of this object in place.
select()
Returns a data set which contains only the cases belonging to the specific indices.
LargeDataSetBase$select(indicies)
indiciesvector of int for selecting rows in the data set. Attention The indices are zero-based.
Returns a data set of class datasets.arrow_dataset.Dataset with the selected rows.
get_ids()
Get ids
LargeDataSetBase$get_ids()
Returns a vector containing the ids of every row as strings.
save()
Saves a data set to disk.
LargeDataSetBase$save(dir_path, folder_name, create_dir = TRUE)
dir_pathPath where to store the data set.
folder_namestring Name of the folder for storing the data set.
create_dirbool If True the directory will be created if it does not exist.
Method does not return anything. It write the data set to disk.
load_from_disk()
loads an object of class LargeDataSetBase from disk 'and updates the object to the current version of the package.
LargeDataSetBase$load_from_disk(dir_path)
dir_pathPath where the data set set is stored.
Method does not return anything. It loads an object from disk.
load()
Loads a data set from disk.
LargeDataSetBase$load(dir_path)
dir_pathPath where the data set is stored.
Method does not return anything. It loads a data set from disk.
set_package_versions()
Method for setting the package version for 'aifeducation', 'reticulate', 'torch', and 'numpy' to the currently used versions.
LargeDataSetBase$set_package_versions()
Method does not return anything. It is used to set the private fields fo package versions.
get_package_versions()
Method for requesting a summary of the R and python packages' versions used for creating the model.
LargeDataSetBase$get_package_versions()
Returns a list containing the versions of the relevant
R and python packages.
get_all_fields()
Return all fields.
LargeDataSetBase$get_all_fields()
Method returns a list containing all public and private fields of the object.
clone()
The objects of this class are cloneable with this method.
LargeDataSetBase$clone(deep = FALSE)
deepWhether to make a deep clone.
Other R6 Classes for Developers:
AIFEBaseModel,
AIFEMaster,
BaseModelCore,
ClassifiersBasedOnTextEmbeddings,
DataManagerClassifier,
ModelsBasedOnTextEmbeddings,
TEClassifiersBasedOnProtoNet,
TEClassifiersBasedOnRegular,
TokenizerBase
This object stores raw texts. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.
Returns a new object of this class.
aifeducation::LargeDataSetBase -> LargeDataSetForText
aifeducation::LargeDataSetBase$extract_column()aifeducation::LargeDataSetBase$get_all_fields()aifeducation::LargeDataSetBase$get_colnames()aifeducation::LargeDataSetBase$get_dataset()aifeducation::LargeDataSetBase$get_ids()aifeducation::LargeDataSetBase$get_package_versions()aifeducation::LargeDataSetBase$load()aifeducation::LargeDataSetBase$load_from_disk()aifeducation::LargeDataSetBase$n_cols()aifeducation::LargeDataSetBase$n_rows()aifeducation::LargeDataSetBase$reduce_to_unique_ids()aifeducation::LargeDataSetBase$save()aifeducation::LargeDataSetBase$select()aifeducation::LargeDataSetBase$set_package_versions()new()
Method for creation of LargeDataSetForText instance. It can be initialized with init_data
parameter if passed (Uses add_from_data.frame() method if init_data is data.frame).
LargeDataSetForText$new(init_data = NULL)
init_dataInitial data.frame for dataset.
A new instance of this class initialized with init_data if passed.
add_from_files_txt()
Method for adding raw texts saved within .txt files to the data set. Please note the the directory should contain one folder for each .txt file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .txt file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
LargeDataSetForText$add_from_files_txt( dir_path, batch_size = 500L, log_file = NULL, log_write_interval = 2L, log_top_value = 0L, log_top_total = 1L, log_top_message = NA, clean_text = TRUE, trace = TRUE )
dir_pathPath to the directory where the files are stored.
batch_sizeint determining the number of files to process at once.
log_filestring Path to the file where the log should be saved. If no logging is desired set this
argument to NULL.
log_write_intervalint Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_file is not NULL.
log_top_valueint indicating the current iteration of the process.
log_top_totalint determining the maximal number of iterations.
log_top_messagestring providing additional information of the process.
clean_textbool If TRUE the text is modified to improve the quality of the following analysis:
Some special symbols are removed.
All spaces at the beginning and the end of a row are removed.
Multiple spaces are reduced to single space.
All rows with a number from 1 to 999 at the beginning or at the end are removed (header and footer).
List of content is removed.
Hyphenation is made undone.
Line breaks within a paragraph are removed.
Multiple line breaks are reduced to a single line break.
tracebool If TRUE information on the progress is printed to the console.
The method does not return anything. It adds new raw texts to the data set.
add_from_files_pdf()
Method for adding raw texts saved within .pdf files to the data set. Please note the the directory should contain one folder for each .pdf file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .pdf file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
LargeDataSetForText$add_from_files_pdf( dir_path, batch_size = 500L, log_file = NULL, log_write_interval = 2L, log_top_value = 0L, log_top_total = 1L, log_top_message = NA, clean_text = TRUE, trace = TRUE )
dir_pathPath to the directory where the files are stored.
batch_sizeint determining the number of files to process at once.
log_filestring Path to the file where the log should be saved. If no logging is desired set this
argument to NULL.
log_write_intervalint Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_file is not NULL.
log_top_valueint indicating the current iteration of the process.
log_top_totalint determining the maximal number of iterations.
log_top_messagestring providing additional information of the process.
clean_textbool If TRUE the text is modified to improve the quality of the following analysis:
Some special symbols are removed.
All spaces at the beginning and the end of a row are removed.
Multiple spaces are reduced to single space.
All rows with a number from 1 to 999 at the beginning or at the end are removed (header and footer).
List of content is removed.
Hyphenation is made undone.
Line breaks within a paragraph are removed.
Multiple line breaks are reduced to a single line break.
tracebool If TRUE information on the progress is printed to the console.
The method does not return anything. It adds new raw texts to the data set.
add_from_files_xlsx()
Method for adding raw texts saved within .xlsx files to the data set. The method assumes that the texts are saved in the rows and that the columns store the id and the raw texts in the columns. In addition, a column for the bibliography information and the license can be added. The column names for these rows must be specified with the following arguments. They must be the same for all .xlsx files in the chosen directory. Id and raw texts are mandatory, bibliographic, license, license's url, license's text, and source's url are optional. Additional columns are dropped.
LargeDataSetForText$add_from_files_xlsx( dir_path, trace = TRUE, id_column = "id", text_column = "text", bib_entry_column = "bib_entry", license_column = "license", url_license_column = "url_license", text_license_column = "text_license", url_source_column = "url_source", log_file = NULL, log_write_interval = 2L, log_top_value = 0L, log_top_total = 1L, log_top_message = NA )
dir_pathPath to the directory where the files are stored.
tracebool If TRUE prints information on the progress to the console.
id_columnstring Name of the column storing the ids for the texts.
text_columnstring Name of the column storing the raw text.
bib_entry_columnstring Name of the column storing the bibliographic information of the texts.
license_columnstring Name of the column storing information about the licenses.
url_license_columnstring Name of the column storing information about the url to the license in the
internet.
text_license_columnstring Name of the column storing the license as text.
url_source_columnstring Name of the column storing information about about the url to the source in the
internet.
log_filestring Path to the file where the log should be saved. If no logging is desired set this
argument to NULL.
log_write_intervalint Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_file is not NULL.
log_top_valueint indicating the current iteration of the process.
log_top_totalint determining the maximal number of iterations.
log_top_messagestring providing additional information of the process.
The method does not return anything. It adds new raw texts to the data set.
add_from_data.frame()
Method for adding raw texts from a data.frame
LargeDataSetForText$add_from_data.frame(data_frame)
data_frameObject of class data.frame with at least the following columns "id","text","bib_entry",
"license", "url_license", "text_license", and "url_source". If "id" and7or "text" is missing an error occurs.
If the other columns are not present in the data.frame they are added with empty values(NA).
Additional columns are dropped.
The method does not return anything. It adds new raw texts to the data set.
get_private()
Method for requesting all private fields and methods. Used for loading and updating an object.
LargeDataSetForText$get_private()
Returns a list with all private fields and methods.
clone()
The objects of this class are cloneable with this method.
LargeDataSetForText$clone(deep = FALSE)
deepWhether to make a deep clone.
Other Data Management:
EmbeddedText,
LargeDataSetForTextEmbeddings
This object stores text embeddings which are usually produced by an object of class TextEmbeddingModel. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.
LargeDataSetForTextEmbeddings are used for storing and managing the text embeddings created with objects of class TextEmbeddingModel. Objects of class LargeDataSetForTextEmbeddings serve as input for objects of class ClassifiersBasedOnTextEmbeddings and TEFeatureExtractor. The main aim of this class is to provide a structured link between embedding models and classifiers. Since objects of this class save information on the text embedding model that created the text embedding it ensures that only embeddings generated with same embedding model are combined. Furthermore, the stored information allows objects to check if embeddings of the correct text embedding model are used for training and predicting.
This class is not designed for a direct use.
Returns a new object of this class.
aifeducation::LargeDataSetBase -> LargeDataSetForTextEmbeddings
LargeDataSetForTextEmbeddings$get_text_embedding_model_name()
LargeDataSetForTextEmbeddings$add_embeddings_from_EmbeddedText()
LargeDataSetForTextEmbeddings$add_embeddings_from_LargeDataSetForTextEmbeddings()
aifeducation::LargeDataSetBase$extract_column()aifeducation::LargeDataSetBase$get_all_fields()aifeducation::LargeDataSetBase$get_colnames()aifeducation::LargeDataSetBase$get_dataset()aifeducation::LargeDataSetBase$get_ids()aifeducation::LargeDataSetBase$get_package_versions()aifeducation::LargeDataSetBase$load()aifeducation::LargeDataSetBase$n_cols()aifeducation::LargeDataSetBase$n_rows()aifeducation::LargeDataSetBase$reduce_to_unique_ids()aifeducation::LargeDataSetBase$save()aifeducation::LargeDataSetBase$select()aifeducation::LargeDataSetBase$set_package_versions()configure()
Creates a new object representing text embeddings.
LargeDataSetForTextEmbeddings$configure( model_name = NA, model_label = NA, model_date = NA, model_method = NA, model_version = NA, model_language = NA, param_seq_length = NA, param_chunks = NULL, param_features = NULL, param_overlap = NULL, param_emb_layer_min = NULL, param_emb_layer_max = NULL, param_emb_pool_type = NULL, param_pad_value = -100L, param_aggregation = NULL )
model_namestring Name of the model that generates this embedding.
model_labelstring Label of the model that generates this embedding.
model_datestring Date when the embedding generating model was created.
model_methodstring Method of the underlying embedding model.
model_versionstring Version of the model that generated this embedding.
model_languagestring Language of the model that generated this embedding.
param_seq_lengthint Maximum number of tokens that processes the generating model for a chunk.
param_chunksint Maximum number of chunks which are supported by the generating model.
param_featuresint Number of dimensions of the text embeddings.
param_overlapint Number of tokens that were added at the beginning of the sequence for the next chunk
by this model.
param_emb_layer_minint or string determining the first layer to be included in the creation of
embeddings.
param_emb_layer_maxint or string determining the last layer to be included in the creation of
embeddings.
param_emb_pool_typestring determining the method for pooling the token embeddings within each layer.
param_pad_valueint Value indicating padding. This value should no be in the range of
regluar values for computations. Thus it is not recommended to chance this value.
Default is -100. Allowed values:
param_aggregationstring Aggregation method of the hidden states. Deprecated. Only included for backward
compatibility.
The method returns a new object of this class.
is_configured()
Method for checking if the model was successfully configured. An object can only be used if this
value is TRUE.
LargeDataSetForTextEmbeddings$is_configured()
bool TRUE if the model is fully configured. FALSE if not.
get_text_embedding_model_name()
Method for requesting the name (unique id) of the underlying text embedding model.
LargeDataSetForTextEmbeddings$get_text_embedding_model_name()
Returns a string describing name of the text embedding model.
get_model_info()
Method for retrieving information about the model that generated this embedding.
LargeDataSetForTextEmbeddings$get_model_info()
list containing all saved information about the underlying text embedding model.
load_from_disk()
loads an object of class LargeDataSetForTextEmbeddings from disk and updates the object to the current version of the package.
LargeDataSetForTextEmbeddings$load_from_disk(dir_path)
dir_pathPath where the data set set is stored.
Method does not return anything. It loads an object from disk.
get_model_label()
Method for retrieving the label of the model that generated this embedding.
LargeDataSetForTextEmbeddings$get_model_label()
string Label of the corresponding text embedding model
add_feature_extractor_info()
Method setting information on the TEFeatureExtractor that was used to reduce the number of dimensions of the text embeddings. This information should only be used if a TEFeatureExtractor was applied.
LargeDataSetForTextEmbeddings$add_feature_extractor_info( model_name, model_label = NA, features = NA, method = NA, noise_factor = NA, optimizer = NA )
model_namestring Name of the underlying TextEmbeddingModel.
model_labelstring Label of the underlying TextEmbeddingModel.
featuresint Number of dimension (features) for the compressed text embeddings.
methodstring Method that the TEFeatureExtractor applies for genereating the compressed text
embeddings.
noise_factordouble Noise factor of the TEFeatureExtractor.
optimizerstring Optimizer used during training the TEFeatureExtractor.
Method does nothing return. It sets information on a TEFeatureExtractor.
get_feature_extractor_info()
Method for receiving information on the TEFeatureExtractor that was used to reduce the number of dimensions of the text embeddings.
LargeDataSetForTextEmbeddings$get_feature_extractor_info()
Returns a list with information on the TEFeatureExtractor. If no TEFeatureExtractor was used it
returns NULL.
is_compressed()
Checks if the text embedding were reduced by a TEFeatureExtractor.
LargeDataSetForTextEmbeddings$is_compressed()
Returns TRUE if the number of dimensions was reduced by a TEFeatureExtractor. If not return FALSE.
get_times()
Number of chunks/times of the text embeddings.
LargeDataSetForTextEmbeddings$get_times()
Returns an int describing the number of chunks/times of the text embeddings.
get_features()
Number of actual features/dimensions of the text embeddings.In the case a TEFeatureExtractor was
used the number of features is smaller as the original number of features. To receive the original number of
features (the number of features before applying a TEFeatureExtractor) you can use the method
get_original_features of this class.
LargeDataSetForTextEmbeddings$get_features()
Returns an int describing the number of features/dimensions of the text embeddings.
get_original_features()
Number of original features/dimensions of the text embeddings.
LargeDataSetForTextEmbeddings$get_original_features()
Returns an int describing the number of features/dimensions if no TEFeatureExtractor) is used or
before a TEFeatureExtractor) is applied.
get_pad_value()
Value for indicating padding.
LargeDataSetForTextEmbeddings$get_pad_value()
Returns an int describing the value used for padding.
add_embeddings_from_array()
Method for adding new data to the data set from an array. Please note that the method does not
check if cases already exist in the data set. To reduce the data set to unique cases call the method
reduce_to_unique_ids.
LargeDataSetForTextEmbeddings$add_embeddings_from_array(embedding_array)
embedding_arrayarray containing the text embeddings.
The method does not return anything. It adds new data to the data set.
add_embeddings_from_EmbeddedText()
Method for adding new data to the data set from an EmbeddedText. Please note that the method does
not check if cases already exist in the data set. To reduce the data set to unique cases call the method
reduce_to_unique_ids.
LargeDataSetForTextEmbeddings$add_embeddings_from_EmbeddedText(EmbeddedText)
EmbeddedTextObject of class EmbeddedText.
The method does not return anything. It adds new data to the data set.
add_embeddings_from_LargeDataSetForTextEmbeddings()
Method for adding new data to the data set from an LargeDataSetForTextEmbeddings. Please note that
the method does not check if cases already exist in the data set. To reduce the data set to unique cases call
the method reduce_to_unique_ids.
LargeDataSetForTextEmbeddings$add_embeddings_from_LargeDataSetForTextEmbeddings( dataset )
datasetObject of class LargeDataSetForTextEmbeddings.
The method does not return anything. It adds new data to the data set.
convert_to_EmbeddedText()
Method for converting this object to an object of class EmbeddedText.
Attention This object uses memory mapping to allow the usage of data sets that do not fit into memory. By calling this method the data set will be loaded and stored into memory/RAM. This may lead to an out-of-memory error.
LargeDataSetForTextEmbeddings$convert_to_EmbeddedText()
LargeDataSetForTextEmbeddings an object of class EmbeddedText which is stored in the memory/RAM.
clone()
The objects of this class are cloneable with this method.
LargeDataSetForTextEmbeddings$clone(deep = FALSE)
deepWhether to make a deep clone.
Other Data Management:
EmbeddedText,
LargeDataSetForText
Function loads or re-loads all python scripts within the package 'aifeducation'.
load_all_py_scripts()load_all_py_scripts()
Function does nothing return. It loads the requested scripts.
Other Utils Python Developers:
get_py_package_version(),
get_py_package_versions(),
load_py_scripts(),
run_py_file()
Function for loading objects created with 'aifeducation'.
load_from_disk(dir_path)load_from_disk(dir_path)
dir_path |
|
Returns an object of class TEClassifierRegular, TEClassifierProtoNet, TEFeatureExtractor, TextEmbeddingModel, LargeDataSetForTextEmbeddings, LargeDataSetForText or EmbeddedText.
Other Saving and Loading:
save_to_disk()
Function loads or re-loads python scripts within the package 'aifeducation'.
load_py_scripts(files)load_py_scripts(files)
files |
|
Function does nothing return. It loads the requested scripts.
Other Utils Python Developers:
get_py_package_version(),
get_py_package_versions(),
load_all_py_scripts(),
run_py_file()
Function loads the target data for a long running task.
long_load_target_data(file_path, selectet_column)long_load_target_data(file_path, selectet_column)
file_path |
|
selectet_column |
|
This function assumes that the target data is stored as a columns with the cases in the rows and the categories in the columns. The ids of the cases must be stored in a column called "id".
Returns a named factor containing the target data.
Other Utils Studio Developers:
add_missing_args(),
create_data_embeddings_description(),
summarize_args_for_long_task()
Function written in C++ for reshaping a matrix containing sequential data into an array for use with keras.
matrix_to_array_c(matrix, times, features)matrix_to_array_c(matrix, times, features)
matrix |
|
times |
|
features |
|
Returns an array. The first dimension corresponds to the cases, the second to the times, and the third to the features.
Other Utils Developers:
auto_n_cores(),
create_object(),
create_synthetic_units_from_matrix(),
generate_id(),
get_n_chunks(),
get_synthetic_cases_from_matrix(),
get_time_stamp(),
tensor_to_matrix_c(),
to_categorical_c()
Abstract class for all models that do not rely on the python library 'transformers'. All models of this class require text embeddings as input. These are provided as objects of class EmbeddedText or LargeDataSetForTextEmbeddings.
Objects of this class containing fields and methods used in several other classes in 'AI for Education'.
This class is not designed for a direct application and should only be used by developers.
A new object of this class.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> ModelsBasedOnTextEmbeddings
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::AIFEBaseModel$count_parameter()get_text_embedding_model()
Method for requesting the text embedding model information.
ModelsBasedOnTextEmbeddings$get_text_embedding_model()
list of all relevant model information on the text embedding model underlying the model.
get_text_embedding_model_name()
Method for requesting the name (unique id) of the underlying text embedding model.
ModelsBasedOnTextEmbeddings$get_text_embedding_model_name()
Returns a string describing name of the text embedding model.
check_embedding_model()
Method for checking if the provided text embeddings are created with the same TextEmbeddingModel as the model.
ModelsBasedOnTextEmbeddings$check_embedding_model(text_embeddings)
text_embeddingsObject of class EmbeddedText or LargeDataSetForTextEmbeddings.
TRUE if the underlying TextEmbeddingModel are the same. FALSE if the models differ.
save()
Method for saving a model.
ModelsBasedOnTextEmbeddings$save(dir_path, folder_name)
dir_pathstring Path of the directory where the model should be saved.
folder_namestring Name of the folder that should be created within the directory.
Function does not return a value. It saves the model to disk.
load_from_disk()
loads an object from disk and updates the object to the current version of the package.
ModelsBasedOnTextEmbeddings$load_from_disk(dir_path)
dir_pathPath where the object set is stored.
Method does not return anything. It loads an object from disk.
plot_training_history()
Method for requesting a plot of the training history. This method requires the R package 'ggplot2' to work.
ModelsBasedOnTextEmbeddings$plot_training_history( final_training = FALSE, pl_step = NULL, measure = "loss", ind_best_model = TRUE, ind_selected_model = TRUE, x_min = NULL, x_max = NULL, y_min = NULL, y_max = NULL, add_min_max = TRUE, text_size = 10L )
final_trainingbool If FALSE the values of the performance estimation are used. If TRUE only
the epochs of the final training are used.
pl_stepint Number of the step during pseudo labeling to plot. Only relevant if the model was trained
with active pseudo labeling.
measureMeasure to plot.
ind_best_modelbool If TRUE the plot indicates the best states of the model according to the chosen measure.
ind_selected_modelbool If TRUE the plot indicates the states of the model which are used after training. These are the final states of the fold or the final state of the last training loop.
x_minint Minimal value for x-axis. Set to NULL for an automatic adjustment. Allowed values:
x_maxint Maximal value for x-axis. Set to NULL for an automatic adjustment. Allowed values:
y_minint Minimal value for y-axis. Set to NULL for an automatic adjustment. Allowed values:
y_maxint Maximal value for y-axis. Set to NULL for an automatic adjustment. Allowed values:
add_min_maxbool If TRUE the minimal and maximal values during performance estimation are port of the plot. If FALSE only the mean values are shown. Parameter is ignored if final_training=TRUE.
text_sizeint Size of text elements. Allowed values:
Returns a plot of class ggplot visualizing the training process.
Prepare history data of objects
Function for preparing the history data of a model in order to be plotted in AI for Education - Studio.
final bool If TRUE the history data of the final training is used for the data set.
pl_step int If use_pl=TRUE select the step within pseudo labeling for which the data should be prepared.
Returns a named list with the training history data of the model. The
reported measures depend on the provided model.
Utils Studio Developers internal
clone()
The objects of this class are cloneable with this method.
ModelsBasedOnTextEmbeddings$clone(deep = FALSE)
deepWhether to make a deep clone.
Other R6 Classes for Developers:
AIFEBaseModel,
AIFEMaster,
BaseModelCore,
ClassifiersBasedOnTextEmbeddings,
DataManagerClassifier,
LargeDataSetBase,
TEClassifiersBasedOnProtoNet,
TEClassifiersBasedOnRegular,
TokenizerBase
Function prints the duration of a test to console if the test is running on CI. If not no output appears in console.
monitor_test_time_on_CI(start_time, test_name)monitor_test_time_on_CI(start_time, test_name)
start_time |
|
test_name |
|
Returns nothing.
Other Utils TestThat Developers:
check_adjust_n_samples_on_CI(),
generate_args_for_tests(),
generate_embeddings(),
generate_tensors(),
get_current_args_for_print(),
get_fixed_test_tensor(),
get_test_data_for_classifiers(),
random_bool_on_CI()
Prints a message msg if trace parameter is TRUE with current date with message() or cat()
function.
output_message(msg, trace, msg_fun)output_message(msg, trace, msg_fun)
msg |
|
trace |
|
msg_fun |
|
This function returns nothing.
Other Utils Log Developers:
cat_message(),
clean_pytorch_log_transformers(),
print_message(),
read_log(),
read_loss_log(),
reset_log(),
reset_loss_log(),
write_log()
Function converts a R array into a numpy array that can be added to an arrow data set. The array should represent embeddings.
prepare_r_array_for_dataset(r_array)prepare_r_array_for_dataset(r_array)
r_array |
|
Returns a numpy array.
Other Utils Python Data Management Developers:
class_vector_to_py_dataset(),
create_py_dataset_cache_file_path(),
data.frame_to_py_dataset(),
extract_column_from_py_dataset(),
get_batches_index(),
py_dataset_to_embeddings(),
reduce_to_unique(),
tensor_list_to_numpy(),
tensor_to_numpy()
This functions checks for python and a specified environment. If the environment exists it will be activated. If python is already initialized it uses the current environment.
prepare_session( env_type = "auto", envname = "aifeducation", check_session = TRUE )prepare_session( env_type = "auto", envname = "aifeducation", check_session = TRUE )
env_type |
|
envname |
|
check_session |
|
Function does not return anything. It is used for preparing python and R.
Other Installation and Configuration:
check_aif_py_modules(),
get_recommended_py_versions(),
install_aifeducation(),
install_aifeducation_studio(),
install_py_modules(),
set_transformers_logger(),
update_aifeducation()
message())Prints a message msg if trace parameter is TRUE with current date with message() function.
print_message(msg, trace)print_message(msg, trace)
msg |
|
trace |
|
This function returns nothing.
Other Utils Log Developers:
cat_message(),
clean_pytorch_log_transformers(),
output_message(),
read_log(),
read_loss_log(),
reset_log(),
reset_loss_log(),
write_log()
Function for converting an arrow data set into a data set that can be used to store and process embeddings.
py_dataset_to_embeddings(py_dataset)py_dataset_to_embeddings(py_dataset)
py_dataset |
Object of class |
Returns the data set of class datasets.arrow_dataset.Dataset with only two columns ("id","input"). "id"
stores the name of the cases while "input" stores the embeddings.
Other Utils Python Data Management Developers:
class_vector_to_py_dataset(),
create_py_dataset_cache_file_path(),
data.frame_to_py_dataset(),
extract_column_from_py_dataset(),
get_batches_index(),
prepare_r_array_for_dataset(),
reduce_to_unique(),
tensor_list_to_numpy(),
tensor_to_numpy()
Function returns randomly TRUE or FALSE if on CI. It returns FALSE if it is
not on CI.
random_bool_on_CI()random_bool_on_CI()
Returns a bool.
Other Utils TestThat Developers:
check_adjust_n_samples_on_CI(),
generate_args_for_tests(),
generate_embeddings(),
generate_tensors(),
get_current_args_for_print(),
get_fixed_test_tensor(),
get_test_data_for_classifiers(),
monitor_test_time_on_CI()
This function reads a log file at the given location. The log file should be created with write_log.
read_log(file_path)read_log(file_path)
file_path |
|
Returns a matrix containing the log file.
Other Utils Log Developers:
cat_message(),
clean_pytorch_log_transformers(),
output_message(),
print_message(),
read_loss_log(),
reset_log(),
reset_loss_log(),
write_log()
This function reads a log file that contains values for every epoch for the loss. The values are grouped for training and validation data. The log contains values for test data if test data was available during training.
read_loss_log(path_loss)read_loss_log(path_loss)
path_loss |
|
In general the loss is written by a python function during model's training.
Function returns a matrix that contains two or three row depending on
the data inside the loss log. In the case of two rows the first represents the
training data and the second the validation data. In the case of three rows
the third row represents the values for test data. All Columns represent the
epochs.
Other Utils Log Developers:
cat_message(),
clean_pytorch_log_transformers(),
output_message(),
print_message(),
read_log(),
reset_log(),
reset_loss_log(),
write_log()
Function creates an arrow data set that contains only unique cases. That is, duplicates are removed.
reduce_to_unique(dataset_to_reduce, column_name)reduce_to_unique(dataset_to_reduce, column_name)
dataset_to_reduce |
Object of class |
column_name |
|
Returns a data set of class datasets.arrow_dataset.Dataset where the duplicates are removed according to
the given column.
Other Utils Python Data Management Developers:
class_vector_to_py_dataset(),
create_py_dataset_cache_file_path(),
data.frame_to_py_dataset(),
extract_column_from_py_dataset(),
get_batches_index(),
prepare_r_array_for_dataset(),
py_dataset_to_embeddings(),
tensor_list_to_numpy(),
tensor_to_numpy()
This function writes a log file with default values. The file can be read with read_log.
reset_log(log_path)reset_log(log_path)
log_path |
|
Function does nothing return. It is used to write an "empty" log file.
Other Utils Log Developers:
cat_message(),
clean_pytorch_log_transformers(),
output_message(),
print_message(),
read_log(),
read_loss_log(),
reset_loss_log(),
write_log()
This function writes an empty log file for loss information.
reset_loss_log(log_path, epochs)reset_loss_log(log_path, epochs)
log_path |
|
epochs |
|
Function does nothing return. It writes a log file at the given location. The file is a .csv file that contains three rows. The first row takes the value for the training, the second for the validation, and the third row for the test data. The columns represent epochs.
Other Utils Log Developers:
cat_message(),
clean_pytorch_log_transformers(),
output_message(),
print_message(),
read_log(),
read_loss_log(),
reset_log(),
write_log()
Used to run python files with reticulate::py_run_file() from folder python.
run_py_file(py_file_name)run_py_file(py_file_name)
py_file_name |
|
This function returns nothing.
Other Utils Python Developers:
get_py_package_version(),
get_py_package_versions(),
load_all_py_scripts(),
load_py_scripts()
Function for saving objects created with 'aifeducation'.
save_to_disk(object, dir_path, folder_name)save_to_disk(object, dir_path, folder_name)
object |
Object of class TEClassifierRegular, TEClassifierProtoNet, TEFeatureExtractor, TextEmbeddingModel, LargeDataSetForTextEmbeddings, LargeDataSetForText or EmbeddedText which should be saved. |
dir_path |
|
folder_name |
|
Function does not return a value. It saves the model to disk.
No return value, called for side effects.
Other Saving and Loading:
load_from_disk()
This function changes the level for logging information of the 'transformers' library. It influences the output printed to console for creating and training transformer models as well as TextEmbeddingModels.
set_transformers_logger(level = "ERROR")set_transformers_logger(level = "ERROR")
level |
|
This function does not return anything. It is used for its side effects.
Other Installation and Configuration:
check_aif_py_modules(),
get_recommended_py_versions(),
install_aifeducation(),
install_aifeducation_studio(),
install_py_modules(),
prepare_session(),
update_aifeducation()
Functions starts a shiny app that represents Aifeducation Studio.
start_aifeducation_studio(launch_browser = TRUE)start_aifeducation_studio(launch_browser = TRUE)
launch_browser |
|
This function does nothing return. It is used to start a shiny app.
This function extracts the input relevant for a specific method of a specific class from shiny input.
In addition, it adds the path
to all objects which can not be exported to another R session. These object
must be loaded separately in the new session with the function add_missing_args.
The paths are intended to be used with shiny::ExtendedTask. The final preparation of the arguments
should be done with
The function can also be used to override the default value of
a method or to add value for arguments which are not part of shiny input
(use parameter override_args).
summarize_args_for_long_task( input, object_class, method = "configure", path_args = list(path_to_embeddings = NULL, path_to_textual_dataset = NULL, path_to_target_data = NULL, path_to_feature_extractor = NULL, destination_path = NULL, folder_name = NULL), override_args = list(), meta_args = list(py_environment_type = get_py_env_type(), py_env_name = get_py_env_name(), target_data_column = input$data_target_column, object_class = input$classifier_type) )summarize_args_for_long_task( input, object_class, method = "configure", path_args = list(path_to_embeddings = NULL, path_to_textual_dataset = NULL, path_to_target_data = NULL, path_to_feature_extractor = NULL, destination_path = NULL, folder_name = NULL), override_args = list(), meta_args = list(py_environment_type = get_py_env_type(), py_env_name = get_py_env_name(), target_data_column = input$data_target_column, object_class = input$classifier_type) )
input |
Shiny input. |
object_class |
|
method |
|
path_args |
|
override_args |
|
meta_args |
|
Returns a named list with the following entries:
args: Named list of all arguments necessary for the method of the class.
path_args: Named list of all paths for loading the objects missing in args.
meta_args: Named list of all arguments that are not part of the arguments of
the method but which are necessary to set up the shiny::ExtendedTask correctly.
Please not that all list are named list of the format (argument_name=values).
Other Utils Studio Developers:
add_missing_args(),
create_data_embeddings_description(),
long_load_target_data()
Classification Type
This is a probability classifier that predicts a probability distribution for different classes/categories. This is the standard case most common in literature.
Parallel Core Architecture
This model is based on a parallel architecture. An input is passed to different types of layers separately. At the end the outputs are combined to create the final output of the whole model.
Transformer Encoder Layers
Description
The transformer encoder layers follow the structure of the encoder layers used in transformer models. A single layer is designed as described by Chollet, Kalinowski, and Allaire (2022, p. 373) with the exception that single components of the layers (such as the activation function, the kind of residual connection, the kind of normalization or the kind of attention) can be customized. All parameters with the prefix tf_ can be used to configure this layer.
Feature Layer
Description
The feature layer is a dense layer that can be used to increase or decrease the number of features of the input data before passing the data into your model. The aim of this layer is to increase or reduce the complexity of the data for your model. The output size of this layer determines the number of features for all following layers. In the special case that the requested number of features equals the number of features of the text embeddings this layer is reduced to a dropout layer with masking capabilities. All parameters with the prefix feat_ can be used to configure this layer.
Dense Layers
Description
A fully connected layer. The layer is applied to every step of a sequence. All parameters with the prefix dense_ can be used to configure this layer.
Multiple N-Gram Layers
Description
This type of layer focuses on sub-sequence and performs an 1d convolutional operation. On a word and token level these sub-sequences can be interpreted as n-grams (Jacovi, Shalom & Goldberg 2018). The convolution is done across all features. The number of filters equals the number of features of the input tensor. Thus, the shape of the tensor is retained (Pham, Kruszewski & Boleda 2016).
The layer is able to consider multiple n-grams at the same time. In this case the convolution of the n-grams is done seprately and the resulting tensors are concatenated along the feature dimension. The number of filters for each n-gram is set to the next smallest natural number of num_features/num_n-grams. A residual is added to the first n-gram. Thus, the resulting tensor has the same shape as the input tensor.
Sub-sequences that are masked in the input are also masked in the output.
The output of this layer can be understand as the results of the n-gram filters. Stacking this layer allows the model to perform n-gram detection of n-grams (meta perspective). All parameters with the prefix ng_conv_ can be used to configure this layer.
Recurrent Layers
Description
A regular recurrent layer either as Gated Recurrent Unit (GRU) or Long Short-Term Memory (LSTM) layer. Uses PyTorchs implementation. All parameters with the prefix rec_ can be used to configure this layer.
Merge Layer
Description
Layer for combining the output of different layers. All inputs must be sequential data of shape (Batch, Times, Features). First, pooling over time is applied extracting the minimal and/or maximal features. Second, the pooled tensors are combined by calculating their weighted sum. Different attention mechanism can be used to dynamically calculate the corresponding weights. This allows the model to decide which part of the data is most usefull. Finally, pooling over features is applied extracting a specific number of maximal and/or minimal features. A normalization of all input at the begining of the layer is possible. All parameters with the prefix merge_ can be used to configure this layer.
Training and Prediction
For the creation and training of a classifier an object of class EmbeddedText or LargeDataSetForTextEmbeddings on the one hand and a factor on the other hand are necessary.
The object of class EmbeddedText or LargeDataSetForTextEmbeddings contains the numerical text representations (text embeddings) of the raw texts generated by an object of class TextEmbeddingModel. For supporting large data sets it is recommended to use LargeDataSetForTextEmbeddings instead of EmbeddedText.
The factor contains the classes/categories for every text. Missing values (unlabeled cases) are supported and can
be used for pseudo labeling.
For predictions an object of class EmbeddedText or LargeDataSetForTextEmbeddings has to be used which was created with the same TextEmbeddingModel as for training.
Returns a new object of this class ready for configuration or for loading a saved classifier.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::ModelsBasedOnTextEmbeddings -> aifeducation::ClassifiersBasedOnTextEmbeddings -> aifeducation::TEClassifiersBasedOnRegular -> TEClassifierParallel
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::AIFEBaseModel$count_parameter()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model_name()aifeducation::ClassifiersBasedOnTextEmbeddings$adjust_target_levels()aifeducation::ClassifiersBasedOnTextEmbeddings$check_embedding_model()aifeducation::ClassifiersBasedOnTextEmbeddings$check_feature_extractor_object_type()aifeducation::ClassifiersBasedOnTextEmbeddings$load_from_disk()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_coding_stream()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_training_history()aifeducation::ClassifiersBasedOnTextEmbeddings$predict()aifeducation::ClassifiersBasedOnTextEmbeddings$requires_compression()aifeducation::ClassifiersBasedOnTextEmbeddings$save()aifeducation::TEClassifiersBasedOnRegular$train()configure()
Creating a new instance of this class.
TEClassifierParallel$configure( name = NULL, label = NULL, text_embeddings = NULL, feature_extractor = NULL, target_levels = NULL, shared_feat_layer = TRUE, cls_head_type = "Regular", feat_act_fct = "ELU", feat_size = 50L, feat_bias = TRUE, feat_dropout = 0, feat_parametrizations = "None", feat_normalization_type = "LayerNorm", ng_conv_act_fct = "ELU", ng_conv_n_layers = 1L, ng_conv_ks_min = 2L, ng_conv_ks_max = 4L, ng_conv_bias = FALSE, ng_conv_dropout = 0.1, ng_conv_parametrizations = "None", ng_conv_normalization_type = "LayerNorm", ng_conv_residual_type = "ResidualGate", dense_act_fct = "ELU", dense_n_layers = 1L, dense_dropout = 0.5, dense_bias = FALSE, dense_parametrizations = "None", dense_normalization_type = "LayerNorm", dense_residual_type = "ResidualGate", rec_act_fct = "Tanh", rec_n_layers = 1L, rec_type = "GRU", rec_bidirectional = FALSE, rec_dropout = 0.2, rec_bias = FALSE, rec_parametrizations = "None", rec_normalization_type = "LayerNorm", rec_residual_type = "ResidualGate", tf_act_fct = "ELU", tf_dense_dim = 50L, tf_n_layers = 1L, tf_dropout_rate_1 = 0.1, tf_dropout_rate_2 = 0.5, tf_attention_type = "MultiHead", tf_positional_type = "absolute", tf_num_heads = 1L, tf_bias = FALSE, tf_parametrizations = "None", tf_normalization_type = "LayerNorm", tf_normalization_position = "Pre", tf_residual_type = "ResidualGate", merge_attention_type = "multi_head", merge_num_heads = 1L, merge_normalization_type = "LayerNorm", merge_pooling_features = 50L, merge_pooling_type = "MinMax" )
namestring Name of the new model. Please refer to common name conventions.
Free text can be used with parameter label. If set to NULL a unique ID
is generated automatically. Allowed values: any
labelstring Label for the new model. Here you can use free text. Allowed values: any
text_embeddingsEmbeddedText, LargeDataSetForTextEmbeddings Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
feature_extractorTEFeatureExtractor Object of class TEFeatureExtractor which should be used in order to reduce
the number of dimensions of the text embeddings. If no feature extractor should be applied set NULL.
target_levelsvector containing the levels (categories or classes) within the target data. Please
note that order matters. For ordinal data please ensure that the levels are sorted correctly with later levels
indicating a higher category/class. For nominal data the order does not matter.
shared_feat_layerbool If TRUE all streams use the same feature layer. If FALSE all streams use their own feature layer.
cls_head_typestring Type of classification head. Allowed values: 'Regular', 'PairwiseOrthogonal', 'PairwiseOrthogonalDense'
feat_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
feat_sizeint Number of neurons for each dense layer. Allowed values:
feat_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
feat_dropoutdouble determining the dropout for the dense projection of the feature layer. Allowed values:
feat_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
feat_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
ng_conv_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
ng_conv_n_layersint determining how many times the n-gram layers should be added to the network. Allowed values:
ng_conv_ks_minint determining the minimal window size for n-grams. Allowed values:
ng_conv_ks_maxint determining the maximal window size for n-grams. Allowed values:
ng_conv_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
ng_conv_dropoutdouble determining the dropout for n-gram convolution layers. Allowed values:
ng_conv_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
ng_conv_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
ng_conv_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
dense_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
dense_n_layersint Number of dense layers. Allowed values:
dense_dropoutdouble determining the dropout between dense layers. Allowed values:
dense_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
dense_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
dense_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
dense_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
rec_act_fctstring Activation function for all layers. Allowed values: 'Tanh'
rec_n_layersint Number of recurrent layers. Allowed values:
rec_typestring Type of the recurrent layers. rec_type='GRU' for Gated Recurrent Unit and rec_type='LSTM' for Long Short-Term Memory. Allowed values: 'GRU', 'LSTM'
rec_bidirectionalbool If TRUE a bidirectional version of the recurrent layers is used.
rec_dropoutdouble determining the dropout between recurrent layers. Allowed values:
rec_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
rec_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None'
rec_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
rec_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
tf_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
tf_dense_dimint determining the size of the projection layer within a each transformer encoder. Allowed values:
tf_n_layersint determining how many times the encoder should be added to the network. Allowed values:
tf_dropout_rate_1double determining the dropout after the attention mechanism within the transformer encoder layers. Allowed values:
tf_dropout_rate_2double determining the dropout for the dense projection within the transformer encoder layers. Allowed values:
tf_attention_typestring Choose the attention type. Allowed values: 'Fourier', 'MultiHead'
tf_positional_typestring Type of processing positional information. Allowed values: 'None', 'absolute'
tf_num_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
tf_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
tf_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
tf_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
tf_normalization_positionstring Position where the normalization should be applied. Allowed values: 'Pre', 'Post'
tf_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
merge_attention_typestring Choose the attention type. Allowed values: 'Fourier', 'MultiHead'
merge_num_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
merge_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
merge_pooling_featuresint Number of features to be extracted at the end of the model. Allowed values:
merge_pooling_typestring Type of extracting intermediate features. Allowed values: 'Max', 'Min', 'MinMax'
Function does nothing return. It modifies the current object.
clone()
The objects of this class are cloneable with this method.
TEClassifierParallel$clone(deep = FALSE)
deepWhether to make a deep clone.
Other Classification:
TEClassifierParallelPrototype,
TEClassifierProtoNet,
TEClassifierRegular,
TEClassifierSequential,
TEClassifierSequentialPrototype
Classification Type
This object is a metric based classifer and represents in implementation of a prototypical network for few-shot learning as described by Snell, Swersky, and Zemel (2017). The network uses a multi way contrastive loss described by Zhang et al. (2019). The network learns to scale the metric as described by Oreshkin, Rodriguez, and Lacoste (2018).
Parallel Core Architecture
This model is based on a parallel architecture. An input is passed to different types of layers separately. At the end the outputs are combined to create the final output of the whole model.
Transformer Encoder Layers
Description
The transformer encoder layers follow the structure of the encoder layers used in transformer models. A single layer is designed as described by Chollet, Kalinowski, and Allaire (2022, p. 373) with the exception that single components of the layers (such as the activation function, the kind of residual connection, the kind of normalization or the kind of attention) can be customized. All parameters with the prefix tf_ can be used to configure this layer.
Feature Layer
Description
The feature layer is a dense layer that can be used to increase or decrease the number of features of the input data before passing the data into your model. The aim of this layer is to increase or reduce the complexity of the data for your model. The output size of this layer determines the number of features for all following layers. In the special case that the requested number of features equals the number of features of the text embeddings this layer is reduced to a dropout layer with masking capabilities. All parameters with the prefix feat_ can be used to configure this layer.
Dense Layers
Description
A fully connected layer. The layer is applied to every step of a sequence. All parameters with the prefix dense_ can be used to configure this layer.
Multiple N-Gram Layers
Description
This type of layer focuses on sub-sequence and performs an 1d convolutional operation. On a word and token level these sub-sequences can be interpreted as n-grams (Jacovi, Shalom & Goldberg 2018). The convolution is done across all features. The number of filters equals the number of features of the input tensor. Thus, the shape of the tensor is retained (Pham, Kruszewski & Boleda 2016).
The layer is able to consider multiple n-grams at the same time. In this case the convolution of the n-grams is done seprately and the resulting tensors are concatenated along the feature dimension. The number of filters for each n-gram is set to the next smallest natural number of num_features/num_n-grams. A residual is added to the first n-gram. Thus, the resulting tensor has the same shape as the input tensor.
Sub-sequences that are masked in the input are also masked in the output.
The output of this layer can be understand as the results of the n-gram filters. Stacking this layer allows the model to perform n-gram detection of n-grams (meta perspective). All parameters with the prefix ng_conv_ can be used to configure this layer.
Recurrent Layers
Description
A regular recurrent layer either as Gated Recurrent Unit (GRU) or Long Short-Term Memory (LSTM) layer. Uses PyTorchs implementation. All parameters with the prefix rec_ can be used to configure this layer.
Merge Layer
Description
Layer for combining the output of different layers. All inputs must be sequential data of shape (Batch, Times, Features). First, pooling over time is applied extracting the minimal and/or maximal features. Second, the pooled tensors are combined by calculating their weighted sum. Different attention mechanism can be used to dynamically calculate the corresponding weights. This allows the model to decide which part of the data is most usefull. Finally, pooling over features is applied extracting a specific number of maximal and/or minimal features. A normalization of all input at the begining of the layer is possible. All parameters with the prefix merge_ can be used to configure this layer.
Training and Prediction
For the creation and training of a classifier an object of class EmbeddedText or LargeDataSetForTextEmbeddings on the one hand and a factor on the other hand are necessary.
The object of class EmbeddedText or LargeDataSetForTextEmbeddings contains the numerical text representations (text embeddings) of the raw texts generated by an object of class TextEmbeddingModel. For supporting large data sets it is recommended to use LargeDataSetForTextEmbeddings instead of EmbeddedText.
The factor contains the classes/categories for every text. Missing values (unlabeled cases) are supported and can
be used for pseudo labeling.
For predictions an object of class EmbeddedText or LargeDataSetForTextEmbeddings has to be used which was created with the same TextEmbeddingModel as for training.
Returns a new object of this class ready for configuration or for loading a saved classifier.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::ModelsBasedOnTextEmbeddings -> aifeducation::ClassifiersBasedOnTextEmbeddings -> aifeducation::TEClassifiersBasedOnProtoNet -> TEClassifierParallelPrototype
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::AIFEBaseModel$count_parameter()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model_name()aifeducation::ClassifiersBasedOnTextEmbeddings$adjust_target_levels()aifeducation::ClassifiersBasedOnTextEmbeddings$check_embedding_model()aifeducation::ClassifiersBasedOnTextEmbeddings$check_feature_extractor_object_type()aifeducation::ClassifiersBasedOnTextEmbeddings$load_from_disk()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_coding_stream()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_training_history()aifeducation::ClassifiersBasedOnTextEmbeddings$predict()aifeducation::ClassifiersBasedOnTextEmbeddings$requires_compression()aifeducation::ClassifiersBasedOnTextEmbeddings$save()aifeducation::TEClassifiersBasedOnProtoNet$embed()aifeducation::TEClassifiersBasedOnProtoNet$get_metric_scale_factor()aifeducation::TEClassifiersBasedOnProtoNet$plot_embeddings()aifeducation::TEClassifiersBasedOnProtoNet$predict_with_samples()aifeducation::TEClassifiersBasedOnProtoNet$train()configure()
Creating a new instance of this class.
TEClassifierParallelPrototype$configure( name = NULL, label = NULL, text_embeddings = NULL, feature_extractor = NULL, target_levels = NULL, metric_type = "Euclidean", shared_feat_layer = TRUE, projection_type = "Regular", feat_act_fct = "ELU", feat_size = 50L, feat_bias = TRUE, feat_dropout = 0, feat_parametrizations = "None", feat_normalization_type = "LayerNorm", ng_conv_act_fct = "ELU", ng_conv_n_layers = 1L, ng_conv_ks_min = 2L, ng_conv_ks_max = 4L, ng_conv_bias = FALSE, ng_conv_dropout = 0.1, ng_conv_parametrizations = "None", ng_conv_normalization_type = "LayerNorm", ng_conv_residual_type = "ResidualGate", dense_act_fct = "ELU", dense_n_layers = 1L, dense_dropout = 0.5, dense_bias = FALSE, dense_parametrizations = "None", dense_normalization_type = "LayerNorm", dense_residual_type = "ResidualGate", rec_act_fct = "Tanh", rec_n_layers = 1L, rec_type = "GRU", rec_bidirectional = FALSE, rec_dropout = 0.2, rec_bias = FALSE, rec_parametrizations = "None", rec_normalization_type = "LayerNorm", rec_residual_type = "ResidualGate", tf_act_fct = "ELU", tf_dense_dim = 50L, tf_n_layers = 1L, tf_dropout_rate_1 = 0.1, tf_dropout_rate_2 = 0.5, tf_attention_type = "MultiHead", tf_positional_type = "absolute", tf_num_heads = 1L, tf_bias = FALSE, tf_parametrizations = "None", tf_normalization_type = "LayerNorm", tf_normalization_position = "Pre", tf_residual_type = "ResidualGate", merge_attention_type = "multi_head", merge_num_heads = 1L, merge_normalization_type = "LayerNorm", merge_pooling_features = 50L, merge_pooling_type = "MinMax", embedding_dim = 2L )
namestring Name of the new model. Please refer to common name conventions.
Free text can be used with parameter label. If set to NULL a unique ID
is generated automatically. Allowed values: any
labelstring Label for the new model. Here you can use free text. Allowed values: any
text_embeddingsEmbeddedText, LargeDataSetForTextEmbeddings Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
feature_extractorTEFeatureExtractor Object of class TEFeatureExtractor which should be used in order to reduce
the number of dimensions of the text embeddings. If no feature extractor should be applied set NULL.
target_levelsvector containing the levels (categories or classes) within the target data. Please
note that order matters. For ordinal data please ensure that the levels are sorted correctly with later levels
indicating a higher category/class. For nominal data the order does not matter.
metric_typestring Type of metric used for calculating the distance. Allowed values: 'Euclidean', 'CosineDistance'
shared_feat_layerbool If TRUE all streams use the same feature layer. If FALSE all streams use their own feature layer.
projection_typestring Type of projection. Allowed values: 'Regular', 'PairwiseOrthogonal', 'PairwiseOrthogonalDense'
feat_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
feat_sizeint Number of neurons for each dense layer. Allowed values:
feat_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
feat_dropoutdouble determining the dropout for the dense projection of the feature layer. Allowed values:
feat_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
feat_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
ng_conv_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
ng_conv_n_layersint determining how many times the n-gram layers should be added to the network. Allowed values:
ng_conv_ks_minint determining the minimal window size for n-grams. Allowed values:
ng_conv_ks_maxint determining the maximal window size for n-grams. Allowed values:
ng_conv_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
ng_conv_dropoutdouble determining the dropout for n-gram convolution layers. Allowed values:
ng_conv_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
ng_conv_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
ng_conv_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
dense_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
dense_n_layersint Number of dense layers. Allowed values:
dense_dropoutdouble determining the dropout between dense layers. Allowed values:
dense_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
dense_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
dense_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
dense_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
rec_act_fctstring Activation function for all layers. Allowed values: 'Tanh'
rec_n_layersint Number of recurrent layers. Allowed values:
rec_typestring Type of the recurrent layers. rec_type='GRU' for Gated Recurrent Unit and rec_type='LSTM' for Long Short-Term Memory. Allowed values: 'GRU', 'LSTM'
rec_bidirectionalbool If TRUE a bidirectional version of the recurrent layers is used.
rec_dropoutdouble determining the dropout between recurrent layers. Allowed values:
rec_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
rec_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None'
rec_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
rec_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
tf_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
tf_dense_dimint determining the size of the projection layer within a each transformer encoder. Allowed values:
tf_n_layersint determining how many times the encoder should be added to the network. Allowed values:
tf_dropout_rate_1double determining the dropout after the attention mechanism within the transformer encoder layers. Allowed values:
tf_dropout_rate_2double determining the dropout for the dense projection within the transformer encoder layers. Allowed values:
tf_attention_typestring Choose the attention type. Allowed values: 'Fourier', 'MultiHead'
tf_positional_typestring Type of processing positional information. Allowed values: 'None', 'absolute'
tf_num_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
tf_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
tf_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
tf_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
tf_normalization_positionstring Position where the normalization should be applied. Allowed values: 'Pre', 'Post'
tf_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
merge_attention_typestring Choose the attention type. Allowed values: 'Fourier', 'MultiHead'
merge_num_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
merge_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
merge_pooling_featuresint Number of features to be extracted at the end of the model. Allowed values:
merge_pooling_typestring Type of extracting intermediate features. Allowed values: 'Max', 'Min', 'MinMax'
embedding_dimint determining the number of dimensions for the embedding. Allowed values:
Function does nothing return. It modifies the current object.
clone()
The objects of this class are cloneable with this method.
TEClassifierParallelPrototype$clone(deep = FALSE)
deepWhether to make a deep clone.
Oreshkin, B. N., Rodriguez, P. & Lacoste, A. (2018). TADAM: Task dependent adaptive metric for improved few-shot learning. https://doi.org/10.48550/arXiv.1805.10123
Snell, J., Swersky, K. & Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. https://doi.org/10.48550/arXiv.1703.05175
Zhang, X., Nie, J., Zong, L., Yu, H. & Liang, W. (2019). One Shot Learning with Margin. In Q. Yang, Z.-H. Zhou, Z. Gong, M.-L. Zhang & S.-J. Huang (Eds.), Lecture Notes in Computer Science. Advances in Knowledge Discovery and Data Mining (Vol. 11440, pp. 305–317). Springer International Publishing. https://doi.org/10.1007/978-3-030-16145-3_24
Other Classification:
TEClassifierParallel,
TEClassifierProtoNet,
TEClassifierRegular,
TEClassifierSequential,
TEClassifierSequentialPrototype
Abstract class for neural nets with 'pytorch'.
This class is deprecated. Please use an Object of class TEClassifierSequentialPrototype instead.
This object represents in implementation of a prototypical network for few-shot learning as described by Snell, Swersky, and Zemel (2017). The network uses a multi way contrastive loss described by Zhang et al. (2019). The network learns to scale the metric as described by Oreshkin, Rodriguez, and Lacoste (2018)
Objects of this class are used for assigning texts to classes/categories. For the creation and training of a
classifier an object of class EmbeddedText or LargeDataSetForTextEmbeddings and a factor are necessary. The
object of class EmbeddedText or LargeDataSetForTextEmbeddings contains the numerical text representations (text
embeddings) of the raw texts generated by an object of class TextEmbeddingModel. The factor contains the
classes/categories for every text. Missing values (unlabeled cases) are supported. For predictions an object of
class EmbeddedText or LargeDataSetForTextEmbeddings has to be used which was created with the same
TextEmbeddingModel as for training.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::ModelsBasedOnTextEmbeddings -> aifeducation::ClassifiersBasedOnTextEmbeddings -> aifeducation::TEClassifiersBasedOnProtoNet -> TEClassifierProtoNet
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::AIFEBaseModel$count_parameter()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model_name()aifeducation::ClassifiersBasedOnTextEmbeddings$adjust_target_levels()aifeducation::ClassifiersBasedOnTextEmbeddings$check_embedding_model()aifeducation::ClassifiersBasedOnTextEmbeddings$check_feature_extractor_object_type()aifeducation::ClassifiersBasedOnTextEmbeddings$load_from_disk()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_coding_stream()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_training_history()aifeducation::ClassifiersBasedOnTextEmbeddings$predict()aifeducation::ClassifiersBasedOnTextEmbeddings$requires_compression()aifeducation::ClassifiersBasedOnTextEmbeddings$save()aifeducation::TEClassifiersBasedOnProtoNet$get_metric_scale_factor()aifeducation::TEClassifiersBasedOnProtoNet$predict_with_samples()aifeducation::TEClassifiersBasedOnProtoNet$train()new()
Creating a new instance of this class.
TEClassifierProtoNet$new()
Returns an object of class TEClassifierProtoNet which is ready for configuration.
configure()
Creating a new instance of this class.
TEClassifierProtoNet$configure( name = NULL, label = NULL, text_embeddings = NULL, feature_extractor = NULL, target_levels = NULL, dense_size = 4L, dense_layers = 0L, rec_size = 4L, rec_layers = 2L, rec_type = "GRU", rec_bidirectional = FALSE, embedding_dim = 2L, self_attention_heads = 0L, intermediate_size = NULL, attention_type = "Fourier", add_pos_embedding = TRUE, act_fct = "ELU", parametrizations = "None", rec_dropout = 0.1, repeat_encoder = 1L, dense_dropout = 0.4, encoder_dropout = 0.1 )
namestring Name of the new model. Please refer to common name conventions.
Free text can be used with parameter label. If set to NULL a unique ID
is generated automatically. Allowed values: any
labelstring Label for the new model. Here you can use free text. Allowed values: any
text_embeddingsEmbeddedText, LargeDataSetForTextEmbeddings Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
feature_extractorTEFeatureExtractor Object of class TEFeatureExtractor which should be used in order to reduce
the number of dimensions of the text embeddings. If no feature extractor should be applied set NULL.
target_levelsvector containing the levels (categories or classes) within the target data. Please
note that order matters. For ordinal data please ensure that the levels are sorted correctly with later levels
indicating a higher category/class. For nominal data the order does not matter.
dense_sizeint Number of neurons for each dense layer. Allowed values:
dense_layersint Number of dense layers. Allowed values:
rec_sizeint Number of neurons for each recurrent layer. Allowed values:
rec_layersint Number of recurrent layers. Allowed values:
rec_typestring Type of the recurrent layers. rec_type='GRU' for Gated Recurrent Unit and rec_type='LSTM' for Long Short-Term Memory. Allowed values: 'GRU', 'LSTM'
rec_bidirectionalbool If TRUE a bidirectional version of the recurrent layers is used.
embedding_dimint determining the number of dimensions for the embedding. Allowed values:
self_attention_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
intermediate_sizeint determining the size of the projection layer within a each transformer encoder. Allowed values:
attention_typestring Choose the attention type. Allowed values: 'Fourier', 'MultiHead'
add_pos_embeddingbool TRUE if positional embedding should be used.
act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
rec_dropoutdouble determining the dropout between recurrent layers. Allowed values:
repeat_encoderint determining how many times the encoder should be added to the network. Allowed values:
dense_dropoutdouble determining the dropout between dense layers. Allowed values:
encoder_dropoutdouble determining the dropout for the dense projection within the transformer encoder layers. Allowed values:
biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
embed()
Method for embedding documents. Please do not confuse this type of embeddings with the embeddings of texts created by an object of class TextEmbeddingModel. These embeddings embed documents according to their similarity to specific classes.
TEClassifierProtoNet$embed(embeddings_q = NULL, batch_size = 32L)
embeddings_qObject of class EmbeddedText or LargeDataSetForTextEmbeddings containing the text embeddings for all cases which should be embedded into the classification space.
batch_sizeint batch size.
Returns a list containing the following elements
embeddings_q: embeddings for the cases (query sample).
embeddings_prototypes: embeddings of the prototypes which were learned during training. They represents the
center for the different classes.
plot_embeddings()
Method for creating a plot to visualize embeddings and their corresponding centers (prototypes).
TEClassifierProtoNet$plot_embeddings( embeddings_q, classes_q = NULL, batch_size = 12L, alpha = 0.5, size_points = 3L, size_points_prototypes = 8L, inc_unlabeled = TRUE )
embeddings_qObject of class EmbeddedText or LargeDataSetForTextEmbeddings containing the text embeddings for all cases which should be embedded into the classification space.
classes_qNamed factor containg the true classes for every case. Please note that the names must match
the names/ids in embeddings_q.
batch_sizeint batch size.
alphafloat Value indicating how transparent the points should be (important
if many points overlap). Does not apply to points representing prototypes.
size_pointsint Size of the points excluding the points for prototypes.
size_points_prototypesint Size of points representing prototypes.
inc_unlabeledbool If TRUE plot includes unlabeled cases as data points.
Returns a plot of class ggplotvisualizing embeddings.
clone()
The objects of this class are cloneable with this method.
TEClassifierProtoNet$clone(deep = FALSE)
deepWhether to make a deep clone.
This model requires pad_value=0. If this condition is not met the
padding value is switched automatically.
Oreshkin, B. N., Rodriguez, P. & Lacoste, A. (2018). TADAM: Task dependent adaptive metric for improved few-shot learning. https://doi.org/10.48550/arXiv.1805.10123
Snell, J., Swersky, K. & Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. https://doi.org/10.48550/arXiv.1703.05175
Zhang, X., Nie, J., Zong, L., Yu, H. & Liang, W. (2019). One Shot Learning with Margin. In Q. Yang, Z.-H. Zhou, Z. Gong, M.-L. Zhang & S.-J. Huang (Eds.), Lecture Notes in Computer Science. Advances in Knowledge Discovery and Data Mining (Vol. 11440, pp. 305–317). Springer International Publishing. https://doi.org/10.1007/978-3-030-16145-3_24
Other Classification:
TEClassifierParallel,
TEClassifierParallelPrototype,
TEClassifierRegular,
TEClassifierSequential,
TEClassifierSequentialPrototype
Abstract class for neural nets with 'pytorch'.
This class is deprecated. Please use an Object of class TEClassifierSequential instead.
Objects of this class are used for assigning texts to classes/categories. For the creation and training of a classifier an object of class EmbeddedText or LargeDataSetForTextEmbeddings on the one hand and a factor on the other hand are necessary.
The object of class EmbeddedText or LargeDataSetForTextEmbeddings contains the numerical text representations (text embeddings) of the raw texts generated by an object of class TextEmbeddingModel. For supporting large data sets it is recommended to use LargeDataSetForTextEmbeddings instead of EmbeddedText.
The factor contains the classes/categories for every text. Missing values (unlabeled cases) are supported and can
be used for pseudo labeling.
For predictions an object of class EmbeddedText or LargeDataSetForTextEmbeddings has to be used which was created with the same TextEmbeddingModel as for training.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::ModelsBasedOnTextEmbeddings -> aifeducation::ClassifiersBasedOnTextEmbeddings -> aifeducation::TEClassifiersBasedOnRegular -> TEClassifierRegular
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::AIFEBaseModel$count_parameter()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model_name()aifeducation::ClassifiersBasedOnTextEmbeddings$adjust_target_levels()aifeducation::ClassifiersBasedOnTextEmbeddings$check_embedding_model()aifeducation::ClassifiersBasedOnTextEmbeddings$check_feature_extractor_object_type()aifeducation::ClassifiersBasedOnTextEmbeddings$load_from_disk()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_coding_stream()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_training_history()aifeducation::ClassifiersBasedOnTextEmbeddings$predict()aifeducation::ClassifiersBasedOnTextEmbeddings$requires_compression()aifeducation::ClassifiersBasedOnTextEmbeddings$save()aifeducation::TEClassifiersBasedOnRegular$train()new()
Creating a new instance of this class.
TEClassifierRegular$new()
Returns an object of class TEClassifierRegular which is ready for configuration.
configure()
Creating a new instance of this class.
TEClassifierRegular$configure( name = NULL, label = NULL, text_embeddings = NULL, feature_extractor = NULL, target_levels = NULL, bias = TRUE, dense_size = 4L, dense_layers = 0L, rec_size = 4L, rec_layers = 2L, rec_type = "GRU", rec_bidirectional = FALSE, self_attention_heads = 0L, intermediate_size = NULL, attention_type = "Fourier", add_pos_embedding = TRUE, act_fct = "ELU", parametrizations = "None", rec_dropout = 0.1, repeat_encoder = 1L, dense_dropout = 0.4, encoder_dropout = 0.1 )
namestring Name of the new model. Please refer to common name conventions.
Free text can be used with parameter label. If set to NULL a unique ID
is generated automatically. Allowed values: any
labelstring Label for the new model. Here you can use free text. Allowed values: any
text_embeddingsEmbeddedText, LargeDataSetForTextEmbeddings Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
feature_extractorTEFeatureExtractor Object of class TEFeatureExtractor which should be used in order to reduce
the number of dimensions of the text embeddings. If no feature extractor should be applied set NULL.
target_levelsvector containing the levels (categories or classes) within the target data. Please
note that order matters. For ordinal data please ensure that the levels are sorted correctly with later levels
indicating a higher category/class. For nominal data the order does not matter.
biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
dense_sizeint Number of neurons for each dense layer. Allowed values:
dense_layersint Number of dense layers. Allowed values:
rec_sizeint Number of neurons for each recurrent layer. Allowed values:
rec_layersint Number of recurrent layers. Allowed values:
rec_typestring Type of the recurrent layers. rec_type='GRU' for Gated Recurrent Unit and rec_type='LSTM' for Long Short-Term Memory. Allowed values: 'GRU', 'LSTM'
rec_bidirectionalbool If TRUE a bidirectional version of the recurrent layers is used.
self_attention_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
intermediate_sizeint determining the size of the projection layer within a each transformer encoder. Allowed values:
attention_typestring Choose the attention type. Allowed values: 'Fourier', 'MultiHead'
add_pos_embeddingbool TRUE if positional embedding should be used.
act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
rec_dropoutdouble determining the dropout between recurrent layers. Allowed values:
repeat_encoderint determining how many times the encoder should be added to the network. Allowed values:
dense_dropoutdouble determining the dropout between dense layers. Allowed values:
encoder_dropoutdouble determining the dropout for the dense projection within the transformer encoder layers. Allowed values:
Returns an object of class TEClassifierRegular which is ready for training.
clone()
The objects of this class are cloneable with this method.
TEClassifierRegular$clone(deep = FALSE)
deepWhether to make a deep clone.
This model requires pad_value=0. If this condition is not met the
padding value is switched automatically.
Other Classification:
TEClassifierParallel,
TEClassifierParallelPrototype,
TEClassifierProtoNet,
TEClassifierSequential,
TEClassifierSequentialPrototype
Base class for classifiers relying on EmbeddedText or LargeDataSetForTextEmbeddings as input which use the architecture of Protonets and its corresponding training techniques.
Objects of this class containing fields and methods used in several other classes in 'AI for Education'.
This class is not designed for a direct application and should only be used by developers.
A new object of this class.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::ModelsBasedOnTextEmbeddings -> aifeducation::ClassifiersBasedOnTextEmbeddings -> TEClassifiersBasedOnProtoNet
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::AIFEBaseModel$count_parameter()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model_name()aifeducation::ClassifiersBasedOnTextEmbeddings$adjust_target_levels()aifeducation::ClassifiersBasedOnTextEmbeddings$check_embedding_model()aifeducation::ClassifiersBasedOnTextEmbeddings$check_feature_extractor_object_type()aifeducation::ClassifiersBasedOnTextEmbeddings$load_from_disk()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_coding_stream()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_training_history()aifeducation::ClassifiersBasedOnTextEmbeddings$predict()aifeducation::ClassifiersBasedOnTextEmbeddings$requires_compression()aifeducation::ClassifiersBasedOnTextEmbeddings$save()train()
Method for training a neural net.
Training includes a routine for early stopping. In the case that loss<0.0001 and Accuracy=1.00 and Average Iota=1.00 training stops. The history uses the values of the last trained epoch for the remaining epochs.
After training the model with the best values for Average Iota, Accuracy, and Loss on the validation data set is used as the final model.
TEClassifiersBasedOnProtoNet$train( data_embeddings = NULL, data_targets = NULL, data_folds = 5L, data_val_size = 0.25, loss_pt_fct_name = "MultiWayContrastiveLoss", use_sc = FALSE, sc_method = "knnor", sc_min_k = 1L, sc_max_k = 10L, use_pl = FALSE, pl_max_steps = 3L, pl_max = 1, pl_anchor = 1, pl_min = 0, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15L, sustain_log_level = "warning", epochs = 40L, batch_size = 35L, Ns = 5L, Nq = 3L, loss_alpha = 0.5, loss_margin = 0.05, sampling_separate = FALSE, sampling_shuffle = TRUE, trace = TRUE, ml_trace = 1L, log_dir = NULL, log_write_interval = 10L, n_cores = auto_n_cores(), lr_rate = 0.001, lr_min = 1e-04, lr_scheduler = "None", lr_warm_up_ratio = 0.02, optimizer = "AdamW", amp = FALSE )
data_embeddingsEmbeddedText, LargeDataSetForTextEmbeddings Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
data_targetsfactor containing the labels for cases stored in embeddings. Factor must be
named and has to use the same names as used in in the embeddings. .
data_foldsint determining the number of cross-fold samples. Allowed values:
data_val_sizedouble between 0 and 1, indicating the proportion of cases which should be
used for the validation sample during the estimation of the model.
The remaining cases are part of the training data. Allowed values:
loss_pt_fct_namestring Name of the loss function to use during training. Allowed values: 'MultiWayContrastiveLoss', 'MultiWayContrastiveLossFC'
use_scbool TRUE if the estimation should integrate synthetic cases. FALSE if not.
sc_methodstring containing the method for generating synthetic cases. Allowed values: 'knnor'
sc_min_kint determining the minimal number of k which is used for creating synthetic units. Allowed values:
sc_max_kint determining the maximal number of k which is used for creating synthetic units. Allowed values:
use_plbool TRUE if the estimation should integrate pseudo-labeling. FALSE if not.
pl_max_stepsint determining the maximum number of steps during pseudo-labeling. Allowed values:
pl_maxdouble setting the maximal level of confidence for considering a case for pseudo-labeling. Allowed values:
pl_anchordouble indicating the reference point for sorting the new cases of every label. Allowed values:
pl_mindouble setting the mnimal level of confidence for considering a case for pseudo-labeling. Allowed values:
sustain_trackbool If TRUE energy consumption is tracked during training via the python library 'codecarbon'.
sustain_iso_codestring ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any
sustain_regionstring Region within a country. Only available for USA and Canada See the documentation of
codecarbon for more information. https://docs.codecarbon.io/latest/ Allowed values: any
sustain_intervalint Interval in seconds for measuring power usage. Allowed values:
sustain_log_levelstring Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'
epochsint Number of training epochs. Allowed values:
batch_sizeint Size of the batches for training. Allowed values:
Nsint Number of cases for every class in the sample. Allowed values:
Nqint Number of cases for every class in the query. Allowed values:
loss_alphadouble Value between 0 and 1 indicating how strong the loss should focus on pulling cases to
its corresponding prototypes or pushing cases away from other prototypes. The higher the value the more the
loss concentrates on pulling cases to its corresponding prototypes. Allowed values:
loss_margindouble Value greater 0 indicating the minimal distance of every case from prototypes of other classes. Please note that
in contrast to the orginal work by Zhang et al. (2019) this implementation
reaches better performance if the margin is a magnitude lower (e.g. 0.05 instead of 0.5). Allowed values:
sampling_separatebool If TRUE the cases for every class are divided into a data set for sample and
for query. These are never mixed. If TRUE sample and query cases are drawn from the same data pool. That is,
a case can be part of sample in one epoch and in another epoch it can be part of query. It is ensured that a
case is never part of sample and query at the same time. In addition, it is ensured that every cases exists
only once during a training step.
sampling_shufflebool if TRUE cases a randomly drawn from the data during every step. If FALSE the
cases are not shuffled.
tracebool TRUE if information about the estimation phase should be printed to the console.
ml_traceint ml_trace=0 does not print any information about the training process from pytorch on the console. Allowed values:
log_dirstring Path to the directory where the log files should be saved.
If no logging is desired set this argument to NULL. Allowed values: any
log_write_intervalint Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_dir is not NULL. Allowed values:
n_coresint Number of cores which should be used during the calculation of synthetic cases. Only relevant if use_sc=TRUE. Allowed values:
lr_ratedouble Initial learning rate for the training. Sets the maximal learning rate. Allowed values:
lr_mindouble Minimal learning rate during training. Allowed values:
lr_schedulerstring Learning rate scheduler. To use a constant learning rate for the whole training set this parameter to 'None'. Allowed values: 'None', 'Linear', 'Cyclic'
lr_warm_up_ratiodouble Number of epochs used for warm up. To disable warm up set this value to 0.0. Allowed values:
optimizerstring determining the optimizer used for training. Allowed values: 'Adam', 'RMSprop', 'AdamW', 'SGD'
ampbool Apply automatic mixed precision to spped up computations. It is generally
recommended to set this parameter to TRUE. If you encounter problems set to FALSE.
* FALSE: Use full precision.
* TRUE: Use automatic mixed precision (amp) with gradient scaling.
loss_balance_class_weightsbool If TRUE class weights are generated based on the frequencies of the
training data with the method Inverse Class Frequency. If FALSE each class has the weight 1.
loss_balance_sequence_lengthbool If TRUE sample weights are generated for the length of sequences based on
the frequencies of the training data with the method Inverse Class Frequency.
If FALSE each sequences length has the weight 1.
sc_max_k: All values from sc_min_k up to sc_max_k are successively used. If
the number of sc_max_k is too high, the value is reduced to a number that allows the calculating of synthetic
units.
pl_anchor: With the help of this value, the new cases are sorted. For
this aim, the distance from the anchor is calculated and all cases are arranged into an ascending order.
Function does not return a value. It changes the object into a trained classifier.
predict_with_samples()
Method for predicting the class of given data (query) based on provided examples (sample).
TEClassifiersBasedOnProtoNet$predict_with_samples( newdata, batch_size = 32L, ml_trace = 1L, embeddings_s = NULL, classes_s = NULL )
newdataObject of class EmbeddedText or LargeDataSetForTextEmbeddings containing the text embeddings for all cases which should be predicted. They form the query set.
batch_sizeint batch size.
ml_traceint ml_trace=0 does not print any information about the training process from pytorch on the console. Allowed values:
embeddings_sObject of class EmbeddedText or LargeDataSetForTextEmbeddings containing the text embeddings for all reference examples. They form the sample set.
classes_sNamed factor containing the classes for every case within embeddings_s.
Returns a data.frame containing the predictions and the probabilities of the different labels for each
case.
embed()
Method for embedding documents. Please do not confuse this type of embeddings with the embeddings of texts created by an object of class TextEmbeddingModel. These embeddings embed documents according to their similarity to specific classes.
TEClassifiersBasedOnProtoNet$embed( embeddings_q = NULL, embeddings_s = NULL, classes_s = NULL, batch_size = 32L, ml_trace = 1L )
embeddings_qObject of class EmbeddedText or LargeDataSetForTextEmbeddings containing the text embeddings for all cases which should be embedded into the classification space.
embeddings_sObject of class EmbeddedText or LargeDataSetForTextEmbeddings containing the text
embeddings for all reference examples. They form the sample set. If set to NULL the trained prototypes are used.
classes_sNamed factor containing the classes for every case within embeddings_s.
If set to NULL the trained prototypes are used.
batch_sizeint batch size.
ml_traceint ml_trace=0 does not print any information about the training process from pytorch on the console. Allowed values:
Returns a list containing the following elements
embeddings_q: embeddings for the cases (query sample).
distances_q: matrix containing the distance of every query case to every prototype.
embeddings_prototypes: embeddings of the prototypes which were learned during training. They represents the
center for the different classes.
get_metric_scale_factor()
Method returns the scaling factor of the metric.
TEClassifiersBasedOnProtoNet$get_metric_scale_factor()
Returns the scaling factor of the metric as float.
plot_embeddings()
Method for creating a plot to visualize embeddings and their corresponding centers (prototypes).
TEClassifiersBasedOnProtoNet$plot_embeddings( embeddings_q, classes_q = NULL, embeddings_s = NULL, classes_s = NULL, batch_size = 12L, alpha = 0.5, size_points = 3L, size_points_prototypes = 8L, inc_unlabeled = TRUE, inc_margin = TRUE )
embeddings_qObject of class EmbeddedText or LargeDataSetForTextEmbeddings containing the text embeddings for all cases which should be embedded into the classification space.
classes_qNamed factor containg the true classes for every case. Please note that the names must match
the names/ids in embeddings_q.
embeddings_sObject of class EmbeddedText or LargeDataSetForTextEmbeddings containing the text
embeddings for all reference examples. They form the sample set. If set to NULL the trained prototypes are used.
classes_sNamed factor containing the classes for every case within embeddings_s.
If set to NULL the trained prototypes are used.
batch_sizeint batch size.
alphafloat Value indicating how transparent the points should be (important
if many points overlap). Does not apply to points representing prototypes.
size_pointsint Size of the points excluding the points for prototypes.
size_points_prototypesint Size of points representing prototypes.
inc_unlabeledbool If TRUE plot includes unlabeled cases as data points.
inc_marginbool If TRUE plot includes the margin around every prototype. Adding margin
requires a trained model. If the model is not trained this argument is treated as set to FALSE.
Returns a plot of class ggplotvisualizing embeddings.
clone()
The objects of this class are cloneable with this method.
TEClassifiersBasedOnProtoNet$clone(deep = FALSE)
deepWhether to make a deep clone.
Other R6 Classes for Developers:
AIFEBaseModel,
AIFEMaster,
BaseModelCore,
ClassifiersBasedOnTextEmbeddings,
DataManagerClassifier,
LargeDataSetBase,
ModelsBasedOnTextEmbeddings,
TEClassifiersBasedOnRegular,
TokenizerBase
Abstract class for all regular classifiers that use numerical representations of texts instead of words.
Objects of this class containing fields and methods used in several other classes in 'AI for Education'.
This class is not designed for a direct application and should only be used by developers.
A new object of this class.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::ModelsBasedOnTextEmbeddings -> aifeducation::ClassifiersBasedOnTextEmbeddings -> TEClassifiersBasedOnRegular
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::AIFEBaseModel$count_parameter()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model_name()aifeducation::ClassifiersBasedOnTextEmbeddings$adjust_target_levels()aifeducation::ClassifiersBasedOnTextEmbeddings$check_embedding_model()aifeducation::ClassifiersBasedOnTextEmbeddings$check_feature_extractor_object_type()aifeducation::ClassifiersBasedOnTextEmbeddings$load_from_disk()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_coding_stream()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_training_history()aifeducation::ClassifiersBasedOnTextEmbeddings$predict()aifeducation::ClassifiersBasedOnTextEmbeddings$requires_compression()aifeducation::ClassifiersBasedOnTextEmbeddings$save()train()
Method for training a neural net.
Training includes a routine for early stopping. In the case that loss<0.0001 and Accuracy=1.00 and Average Iota=1.00 training stops. The history uses the values of the last trained epoch for the remaining epochs.
After training the model with the best values for Average Iota, Accuracy, and Loss on the validation data set is used as the final model.
TEClassifiersBasedOnRegular$train( data_embeddings = NULL, data_targets = NULL, data_folds = 5L, data_val_size = 0.25, loss_balance_class_weights = TRUE, loss_balance_sequence_length = TRUE, loss_cls_fct_name = "FocalLoss", use_sc = FALSE, sc_method = "knnor", sc_min_k = 1L, sc_max_k = 10L, use_pl = FALSE, pl_max_steps = 3L, pl_max = 1, pl_anchor = 1, pl_min = 0, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15L, sustain_log_level = "warning", epochs = 40L, batch_size = 32L, trace = TRUE, ml_trace = 1L, log_dir = NULL, log_write_interval = 10L, n_cores = auto_n_cores(), lr_rate = 0.001, lr_min = 1e-04, lr_warm_up_ratio = 0.02, lr_scheduler = "None", optimizer = "AdamW", amp = FALSE )
data_embeddingsEmbeddedText, LargeDataSetForTextEmbeddings Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
data_targetsfactor containing the labels for cases stored in embeddings. Factor must be
named and has to use the same names as used in in the embeddings. .
data_foldsint determining the number of cross-fold samples. Allowed values:
data_val_sizedouble between 0 and 1, indicating the proportion of cases which should be
used for the validation sample during the estimation of the model.
The remaining cases are part of the training data. Allowed values:
loss_balance_class_weightsbool If TRUE class weights are generated based on the frequencies of the
training data with the method Inverse Class Frequency. If FALSE each class has the weight 1.
loss_balance_sequence_lengthbool If TRUE sample weights are generated for the length of sequences based on
the frequencies of the training data with the method Inverse Class Frequency.
If FALSE each sequences length has the weight 1.
loss_cls_fct_namestring Name of the loss function to use during training. Allowed values: 'FocalLoss', 'CrossEntropyLoss'
use_scbool TRUE if the estimation should integrate synthetic cases. FALSE if not.
sc_methodstring containing the method for generating synthetic cases. Allowed values: 'knnor'
sc_min_kint determining the minimal number of k which is used for creating synthetic units. Allowed values:
sc_max_kint determining the maximal number of k which is used for creating synthetic units. Allowed values:
use_plbool TRUE if the estimation should integrate pseudo-labeling. FALSE if not.
pl_max_stepsint determining the maximum number of steps during pseudo-labeling. Allowed values:
pl_maxdouble setting the maximal level of confidence for considering a case for pseudo-labeling. Allowed values:
pl_anchordouble indicating the reference point for sorting the new cases of every label. Allowed values:
pl_mindouble setting the mnimal level of confidence for considering a case for pseudo-labeling. Allowed values:
sustain_trackbool If TRUE energy consumption is tracked during training via the python library 'codecarbon'.
sustain_iso_codestring ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any
sustain_regionstring Region within a country. Only available for USA and Canada See the documentation of
codecarbon for more information. https://docs.codecarbon.io/latest/ Allowed values: any
sustain_intervalint Interval in seconds for measuring power usage. Allowed values:
sustain_log_levelstring Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'
epochsint Number of training epochs. Allowed values:
batch_sizeint Size of the batches for training. Allowed values:
tracebool TRUE if information about the estimation phase should be printed to the console.
ml_traceint ml_trace=0 does not print any information about the training process from pytorch on the console. Allowed values:
log_dirstring Path to the directory where the log files should be saved.
If no logging is desired set this argument to NULL. Allowed values: any
log_write_intervalint Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_dir is not NULL. Allowed values:
n_coresint Number of cores which should be used during the calculation of synthetic cases. Only relevant if use_sc=TRUE. Allowed values:
lr_ratedouble Initial learning rate for the training. Sets the maximal learning rate. Allowed values:
lr_mindouble Minimal learning rate during training. Allowed values:
lr_warm_up_ratiodouble Number of epochs used for warm up. To disable warm up set this value to 0.0. Allowed values:
lr_schedulerstring Learning rate scheduler. To use a constant learning rate for the whole training set this parameter to 'None'. Allowed values: 'None', 'Linear', 'Cyclic'
optimizerstring determining the optimizer used for training. Allowed values: 'Adam', 'RMSprop', 'AdamW', 'SGD'
ampbool Apply automatic mixed precision to spped up computations. It is generally
recommended to set this parameter to TRUE. If you encounter problems set to FALSE.
* FALSE: Use full precision.
* TRUE: Use automatic mixed precision (amp) with gradient scaling.
sc_max_k: All values from sc_min_k up to sc_max_k are successively used. If
the number of sc_max_k is too high, the value is reduced to a number that allows the calculating of synthetic
units.
pl_anchor: With the help of this value, the new cases are sorted. For
this aim, the distance from the anchor is calculated and all cases are arranged into an ascending order.
Function does not return a value. It changes the object into a trained classifier.
clone()
The objects of this class are cloneable with this method.
TEClassifiersBasedOnRegular$clone(deep = FALSE)
deepWhether to make a deep clone.
Other R6 Classes for Developers:
AIFEBaseModel,
AIFEMaster,
BaseModelCore,
ClassifiersBasedOnTextEmbeddings,
DataManagerClassifier,
LargeDataSetBase,
ModelsBasedOnTextEmbeddings,
TEClassifiersBasedOnProtoNet,
TokenizerBase
Classification Type
This is a probability classifier that predicts a probability distribution for different classes/categories. This is the standard case most common in literature.
Sequential Core Architecture
This model is based on a sequential architecture. The input is passed to a specific number of layers step by step. All layers are grouped by their kind into stacks.
Transformer Encoder Layers
Description
The transformer encoder layers follow the structure of the encoder layers used in transformer models. A single layer is designed as described by Chollet, Kalinowski, and Allaire (2022, p. 373) with the exception that single components of the layers (such as the activation function, the kind of residual connection, the kind of normalization or the kind of attention) can be customized. All parameters with the prefix tf_ can be used to configure this layer.
Feature Layer
Description
The feature layer is a dense layer that can be used to increase or decrease the number of features of the input data before passing the data into your model. The aim of this layer is to increase or reduce the complexity of the data for your model. The output size of this layer determines the number of features for all following layers. In the special case that the requested number of features equals the number of features of the text embeddings this layer is reduced to a dropout layer with masking capabilities. All parameters with the prefix feat_ can be used to configure this layer.
Dense Layers
Description
A fully connected layer. The layer is applied to every step of a sequence. All parameters with the prefix dense_ can be used to configure this layer.
Multiple N-Gram Layers
Description
This type of layer focuses on sub-sequence and performs an 1d convolutional operation. On a word and token level these sub-sequences can be interpreted as n-grams (Jacovi, Shalom & Goldberg 2018). The convolution is done across all features. The number of filters equals the number of features of the input tensor. Thus, the shape of the tensor is retained (Pham, Kruszewski & Boleda 2016).
The layer is able to consider multiple n-grams at the same time. In this case the convolution of the n-grams is done seprately and the resulting tensors are concatenated along the feature dimension. The number of filters for each n-gram is set to the next smallest natural number of num_features/num_n-grams. A residual is added to the first n-gram. Thus, the resulting tensor has the same shape as the input tensor.
Sub-sequences that are masked in the input are also masked in the output.
The output of this layer can be understand as the results of the n-gram filters. Stacking this layer allows the model to perform n-gram detection of n-grams (meta perspective). All parameters with the prefix ng_conv_ can be used to configure this layer.
Recurrent Layers
Description
A regular recurrent layer either as Gated Recurrent Unit (GRU) or Long Short-Term Memory (LSTM) layer. Uses PyTorchs implementation. All parameters with the prefix rec_ can be used to configure this layer.
Classifiction Pooling Layer
Description
Layer transforms sequences into a lower dimensional space that can be passed to dense layers. It performs two types of pooling. First, it extractes features across the time dimension selecting the maximal and/or minimal features. Second, it performs pooling over the remaining features selecting a specific number of the heighest and/or lowest features.
In the case of selecting the minmal and maximal features at the same time the minmal features are concatenated to the tensor of the maximal features resulting in the shape $(Batch, Times, 2*Features)$ at the end of the first step. In the second step the number of requested features is halved. The first half is used for the maximal features and the second for the minimal features. All parameters with the prefix cls_pooling_ can be used to configure this layer.
Training and Prediction
For the creation and training of a classifier an object of class EmbeddedText or LargeDataSetForTextEmbeddings on the one hand and a factor on the other hand are necessary.
The object of class EmbeddedText or LargeDataSetForTextEmbeddings contains the numerical text representations (text embeddings) of the raw texts generated by an object of class TextEmbeddingModel. For supporting large data sets it is recommended to use LargeDataSetForTextEmbeddings instead of EmbeddedText.
The factor contains the classes/categories for every text. Missing values (unlabeled cases) are supported and can
be used for pseudo labeling.
For predictions an object of class EmbeddedText or LargeDataSetForTextEmbeddings has to be used which was created with the same TextEmbeddingModel as for training.
Returns a new object of this class ready for configuration or for loading a saved classifier.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::ModelsBasedOnTextEmbeddings -> aifeducation::ClassifiersBasedOnTextEmbeddings -> aifeducation::TEClassifiersBasedOnRegular -> TEClassifierSequential
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::AIFEBaseModel$count_parameter()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model_name()aifeducation::ClassifiersBasedOnTextEmbeddings$adjust_target_levels()aifeducation::ClassifiersBasedOnTextEmbeddings$check_embedding_model()aifeducation::ClassifiersBasedOnTextEmbeddings$check_feature_extractor_object_type()aifeducation::ClassifiersBasedOnTextEmbeddings$load_from_disk()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_coding_stream()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_training_history()aifeducation::ClassifiersBasedOnTextEmbeddings$predict()aifeducation::ClassifiersBasedOnTextEmbeddings$requires_compression()aifeducation::ClassifiersBasedOnTextEmbeddings$save()aifeducation::TEClassifiersBasedOnRegular$train()configure()
Creating a new instance of this class.
TEClassifierSequential$configure( name = NULL, label = NULL, text_embeddings = NULL, feature_extractor = NULL, target_levels = NULL, skip_connection_type = "ResidualGate", cls_pooling_features = NULL, cls_pooling_type = "MinMax", cls_head_type = "Regular", feat_act_fct = "ELU", feat_size = 50L, feat_bias = TRUE, feat_dropout = 0, feat_parametrizations = "None", feat_normalization_type = "LayerNorm", ng_conv_act_fct = "ELU", ng_conv_n_layers = 1L, ng_conv_ks_min = 2L, ng_conv_ks_max = 4L, ng_conv_bias = FALSE, ng_conv_dropout = 0.1, ng_conv_parametrizations = "None", ng_conv_normalization_type = "LayerNorm", ng_conv_residual_type = "ResidualGate", dense_act_fct = "ELU", dense_n_layers = 1, dense_dropout = 0.5, dense_bias = FALSE, dense_parametrizations = "None", dense_normalization_type = "LayerNorm", dense_residual_type = "ResidualGate", rec_act_fct = "Tanh", rec_n_layers = 1L, rec_type = "GRU", rec_bidirectional = FALSE, rec_dropout = 0.2, rec_bias = FALSE, rec_parametrizations = "None", rec_normalization_type = "LayerNorm", rec_residual_type = "ResidualGate", tf_act_fct = "ELU", tf_dense_dim = 50L, tf_n_layers = 1L, tf_dropout_rate_1 = 0.1, tf_dropout_rate_2 = 0.5, tf_attention_type = "MultiHead", tf_positional_type = "absolute", tf_num_heads = 1, tf_bias = FALSE, tf_parametrizations = "None", tf_normalization_type = "LayerNorm", tf_normalization_position = "Pre", tf_residual_type = "ResidualGate" )
namestring Name of the new model. Please refer to common name conventions.
Free text can be used with parameter label. If set to NULL a unique ID
is generated automatically. Allowed values: any
labelstring Label for the new model. Here you can use free text. Allowed values: any
text_embeddingsEmbeddedText, LargeDataSetForTextEmbeddings Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
feature_extractorTEFeatureExtractor Object of class TEFeatureExtractor which should be used in order to reduce
the number of dimensions of the text embeddings. If no feature extractor should be applied set NULL.
target_levelsvector containing the levels (categories or classes) within the target data. Please
note that order matters. For ordinal data please ensure that the levels are sorted correctly with later levels
indicating a higher category/class. For nominal data the order does not matter.
skip_connection_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
cls_pooling_featuresint Number of features to be extracted at the end of the model. Allowed values:
cls_pooling_typestring Type of extracting intermediate features. Allowed values: 'Max', 'Min', 'MinMax'
cls_head_typestring Type of classification head. Allowed values: 'Regular', 'PairwiseOrthogonal', 'PairwiseOrthogonalDense'
feat_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
feat_sizeint Number of neurons for each dense layer. Allowed values:
feat_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
feat_dropoutdouble determining the dropout for the dense projection of the feature layer. Allowed values:
feat_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
feat_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
ng_conv_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
ng_conv_n_layersint determining how many times the n-gram layers should be added to the network. Allowed values:
ng_conv_ks_minint determining the minimal window size for n-grams. Allowed values:
ng_conv_ks_maxint determining the maximal window size for n-grams. Allowed values:
ng_conv_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
ng_conv_dropoutdouble determining the dropout for n-gram convolution layers. Allowed values:
ng_conv_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
ng_conv_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
ng_conv_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
dense_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
dense_n_layersint Number of dense layers. Allowed values:
dense_dropoutdouble determining the dropout between dense layers. Allowed values:
dense_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
dense_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
dense_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
dense_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
rec_act_fctstring Activation function for all layers. Allowed values: 'Tanh'
rec_n_layersint Number of recurrent layers. Allowed values:
rec_typestring Type of the recurrent layers. rec_type='GRU' for Gated Recurrent Unit and rec_type='LSTM' for Long Short-Term Memory. Allowed values: 'GRU', 'LSTM'
rec_bidirectionalbool If TRUE a bidirectional version of the recurrent layers is used.
rec_dropoutdouble determining the dropout between recurrent layers. Allowed values:
rec_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
rec_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None'
rec_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
rec_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
tf_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
tf_dense_dimint determining the size of the projection layer within a each transformer encoder. Allowed values:
tf_n_layersint determining how many times the encoder should be added to the network. Allowed values:
tf_dropout_rate_1double determining the dropout after the attention mechanism within the transformer encoder layers. Allowed values:
tf_dropout_rate_2double determining the dropout for the dense projection within the transformer encoder layers. Allowed values:
tf_attention_typestring Choose the attention type. Allowed values: 'Fourier', 'MultiHead'
tf_positional_typestring Type of processing positional information. Allowed values: 'None', 'absolute'
tf_num_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
tf_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
tf_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
tf_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
tf_normalization_positionstring Position where the normalization should be applied. Allowed values: 'Pre', 'Post'
tf_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
Function does nothing return. It modifies the current object.
clone()
The objects of this class are cloneable with this method.
TEClassifierSequential$clone(deep = FALSE)
deepWhether to make a deep clone.
Other Classification:
TEClassifierParallel,
TEClassifierParallelPrototype,
TEClassifierProtoNet,
TEClassifierRegular,
TEClassifierSequentialPrototype
Classification Type
This object is a metric based classifer and represents in implementation of a prototypical network for few-shot learning as described by Snell, Swersky, and Zemel (2017). The network uses a multi way contrastive loss described by Zhang et al. (2019). The network learns to scale the metric as described by Oreshkin, Rodriguez, and Lacoste (2018).
Sequential Core Architecture
This model is based on a sequential architecture. The input is passed to a specific number of layers step by step. All layers are grouped by their kind into stacks.
Transformer Encoder Layers
Description
The transformer encoder layers follow the structure of the encoder layers used in transformer models. A single layer is designed as described by Chollet, Kalinowski, and Allaire (2022, p. 373) with the exception that single components of the layers (such as the activation function, the kind of residual connection, the kind of normalization or the kind of attention) can be customized. All parameters with the prefix tf_ can be used to configure this layer.
Feature Layer
Description
The feature layer is a dense layer that can be used to increase or decrease the number of features of the input data before passing the data into your model. The aim of this layer is to increase or reduce the complexity of the data for your model. The output size of this layer determines the number of features for all following layers. In the special case that the requested number of features equals the number of features of the text embeddings this layer is reduced to a dropout layer with masking capabilities. All parameters with the prefix feat_ can be used to configure this layer.
Dense Layers
Description
A fully connected layer. The layer is applied to every step of a sequence. All parameters with the prefix dense_ can be used to configure this layer.
Multiple N-Gram Layers
Description
This type of layer focuses on sub-sequence and performs an 1d convolutional operation. On a word and token level these sub-sequences can be interpreted as n-grams (Jacovi, Shalom & Goldberg 2018). The convolution is done across all features. The number of filters equals the number of features of the input tensor. Thus, the shape of the tensor is retained (Pham, Kruszewski & Boleda 2016).
The layer is able to consider multiple n-grams at the same time. In this case the convolution of the n-grams is done seprately and the resulting tensors are concatenated along the feature dimension. The number of filters for each n-gram is set to the next smallest natural number of num_features/num_n-grams. A residual is added to the first n-gram. Thus, the resulting tensor has the same shape as the input tensor.
Sub-sequences that are masked in the input are also masked in the output.
The output of this layer can be understand as the results of the n-gram filters. Stacking this layer allows the model to perform n-gram detection of n-grams (meta perspective). All parameters with the prefix ng_conv_ can be used to configure this layer.
Recurrent Layers
Description
A regular recurrent layer either as Gated Recurrent Unit (GRU) or Long Short-Term Memory (LSTM) layer. Uses PyTorchs implementation. All parameters with the prefix rec_ can be used to configure this layer.
Classifiction Pooling Layer
Description
Layer transforms sequences into a lower dimensional space that can be passed to dense layers. It performs two types of pooling. First, it extractes features across the time dimension selecting the maximal and/or minimal features. Second, it performs pooling over the remaining features selecting a specific number of the heighest and/or lowest features.
In the case of selecting the minmal and maximal features at the same time the minmal features are concatenated to the tensor of the maximal features resulting in the shape $(Batch, Times, 2*Features)$ at the end of the first step. In the second step the number of requested features is halved. The first half is used for the maximal features and the second for the minimal features. All parameters with the prefix cls_pooling_ can be used to configure this layer.
Training and Prediction
For the creation and training of a classifier an object of class EmbeddedText or LargeDataSetForTextEmbeddings on the one hand and a factor on the other hand are necessary.
The object of class EmbeddedText or LargeDataSetForTextEmbeddings contains the numerical text representations (text embeddings) of the raw texts generated by an object of class TextEmbeddingModel. For supporting large data sets it is recommended to use LargeDataSetForTextEmbeddings instead of EmbeddedText.
The factor contains the classes/categories for every text. Missing values (unlabeled cases) are supported and can
be used for pseudo labeling.
For predictions an object of class EmbeddedText or LargeDataSetForTextEmbeddings has to be used which was created with the same TextEmbeddingModel as for training..
Returns a new object of this class ready for configuration or for loading a saved classifier.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::ModelsBasedOnTextEmbeddings -> aifeducation::ClassifiersBasedOnTextEmbeddings -> aifeducation::TEClassifiersBasedOnProtoNet -> TEClassifierSequentialPrototype
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::AIFEBaseModel$count_parameter()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model_name()aifeducation::ClassifiersBasedOnTextEmbeddings$adjust_target_levels()aifeducation::ClassifiersBasedOnTextEmbeddings$check_embedding_model()aifeducation::ClassifiersBasedOnTextEmbeddings$check_feature_extractor_object_type()aifeducation::ClassifiersBasedOnTextEmbeddings$load_from_disk()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_coding_stream()aifeducation::ClassifiersBasedOnTextEmbeddings$plot_training_history()aifeducation::ClassifiersBasedOnTextEmbeddings$predict()aifeducation::ClassifiersBasedOnTextEmbeddings$requires_compression()aifeducation::ClassifiersBasedOnTextEmbeddings$save()aifeducation::TEClassifiersBasedOnProtoNet$embed()aifeducation::TEClassifiersBasedOnProtoNet$get_metric_scale_factor()aifeducation::TEClassifiersBasedOnProtoNet$plot_embeddings()aifeducation::TEClassifiersBasedOnProtoNet$predict_with_samples()aifeducation::TEClassifiersBasedOnProtoNet$train()configure()
Creating a new instance of this class.
TEClassifierSequentialPrototype$configure( name = NULL, label = NULL, text_embeddings = NULL, feature_extractor = NULL, target_levels = NULL, skip_connection_type = "ResidualGate", cls_pooling_features = 50L, cls_pooling_type = "MinMax", projection_type = "Regular", metric_type = "Euclidean", feat_act_fct = "ELU", feat_size = 50L, feat_bias = TRUE, feat_dropout = 0, feat_parametrizations = "None", feat_normalization_type = "LayerNorm", ng_conv_act_fct = "ELU", ng_conv_n_layers = 1L, ng_conv_ks_min = 2L, ng_conv_ks_max = 4, ng_conv_bias = FALSE, ng_conv_dropout = 0.1, ng_conv_parametrizations = "None", ng_conv_normalization_type = "LayerNorm", ng_conv_residual_type = "ResidualGate", dense_act_fct = "ELU", dense_n_layers = 1L, dense_dropout = 0.5, dense_bias = FALSE, dense_parametrizations = "None", dense_normalization_type = "LayerNorm", dense_residual_type = "ResidualGate", rec_act_fct = "Tanh", rec_n_layers = 1, rec_type = "GRU", rec_bidirectional = FALSE, rec_dropout = 0.2, rec_bias = FALSE, rec_parametrizations = "None", rec_normalization_type = "LayerNorm", rec_residual_type = "ResidualGate", tf_act_fct = "ELU", tf_dense_dim = 50L, tf_n_layers = 1L, tf_dropout_rate_1 = 0.1, tf_dropout_rate_2 = 0.5, tf_attention_type = "MultiHead", tf_positional_type = "absolute", tf_num_heads = 1L, tf_bias = FALSE, tf_parametrizations = "None", tf_normalization_type = "LayerNorm", tf_normalization_position = "Pre", tf_residual_type = "ResidualGate", embedding_dim = 2L )
namestring Name of the new model. Please refer to common name conventions.
Free text can be used with parameter label. If set to NULL a unique ID
is generated automatically. Allowed values: any
labelstring Label for the new model. Here you can use free text. Allowed values: any
text_embeddingsEmbeddedText, LargeDataSetForTextEmbeddings Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
feature_extractorTEFeatureExtractor Object of class TEFeatureExtractor which should be used in order to reduce
the number of dimensions of the text embeddings. If no feature extractor should be applied set NULL.
target_levelsvector containing the levels (categories or classes) within the target data. Please
note that order matters. For ordinal data please ensure that the levels are sorted correctly with later levels
indicating a higher category/class. For nominal data the order does not matter.
skip_connection_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
cls_pooling_featuresint Number of features to be extracted at the end of the model. Allowed values:
cls_pooling_typestring Type of extracting intermediate features. Allowed values: 'Max', 'Min', 'MinMax'
projection_typestring Type of projection. Allowed values: 'Regular', 'PairwiseOrthogonal', 'PairwiseOrthogonalDense'
metric_typestring Type of metric used for calculating the distance. Allowed values: 'Euclidean', 'CosineDistance'
feat_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
feat_sizeint Number of neurons for each dense layer. Allowed values:
feat_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
feat_dropoutdouble determining the dropout for the dense projection of the feature layer. Allowed values:
feat_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
feat_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
ng_conv_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
ng_conv_n_layersint determining how many times the n-gram layers should be added to the network. Allowed values:
ng_conv_ks_minint determining the minimal window size for n-grams. Allowed values:
ng_conv_ks_maxint determining the maximal window size for n-grams. Allowed values:
ng_conv_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
ng_conv_dropoutdouble determining the dropout for n-gram convolution layers. Allowed values:
ng_conv_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
ng_conv_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
ng_conv_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
dense_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
dense_n_layersint Number of dense layers. Allowed values:
dense_dropoutdouble determining the dropout between dense layers. Allowed values:
dense_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
dense_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
dense_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
dense_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
rec_act_fctstring Activation function for all layers. Allowed values: 'Tanh'
rec_n_layersint Number of recurrent layers. Allowed values:
rec_typestring Type of the recurrent layers. rec_type='GRU' for Gated Recurrent Unit and rec_type='LSTM' for Long Short-Term Memory. Allowed values: 'GRU', 'LSTM'
rec_bidirectionalbool If TRUE a bidirectional version of the recurrent layers is used.
rec_dropoutdouble determining the dropout between recurrent layers. Allowed values:
rec_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
rec_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None'
rec_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
rec_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
tf_act_fctstring Activation function for all layers. Allowed values: 'ELU', 'LeakyReLU', 'ReLU', 'GELU', 'Sigmoid', 'Tanh', 'PReLU'
tf_dense_dimint determining the size of the projection layer within a each transformer encoder. Allowed values:
tf_n_layersint determining how many times the encoder should be added to the network. Allowed values:
tf_dropout_rate_1double determining the dropout after the attention mechanism within the transformer encoder layers. Allowed values:
tf_dropout_rate_2double determining the dropout for the dense projection within the transformer encoder layers. Allowed values:
tf_attention_typestring Choose the attention type. Allowed values: 'Fourier', 'MultiHead'
tf_positional_typestring Type of processing positional information. Allowed values: 'None', 'absolute'
tf_num_headsint determining the number of attention heads for a self-attention layer. Only relevant if attention_type='multihead' Allowed values:
tf_biasbool If TRUE a bias term is added to all layers. If FALSE no bias term is added to the layers.
tf_parametrizationsstring Re-Parametrizations of the weights of layers. Allowed values: 'None', 'OrthogonalWeights', 'WeightNorm', 'SpectralNorm'
tf_normalization_typestring Type of normalization applied to all layers and stack layers. Allowed values: 'LayerNorm', 'BatchNorm', 'PowerNorm', 'RMSNorm', 'None'
tf_normalization_positionstring Position where the normalization should be applied. Allowed values: 'Pre', 'Post'
tf_residual_typestring Type of residual connenction for all layers and stack of layers. Allowed values: 'ResidualGate', 'Addition', 'None'
embedding_dimint determining the number of dimensions for the embedding. Allowed values:
Function does nothing return. It modifies the current object.
clone()
The objects of this class are cloneable with this method.
TEClassifierSequentialPrototype$clone(deep = FALSE)
deepWhether to make a deep clone.
Oreshkin, B. N., Rodriguez, P. & Lacoste, A. (2018). TADAM: Task dependent adaptive metric for improved few-shot learning. https://doi.org/10.48550/arXiv.1805.10123
Snell, J., Swersky, K. & Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. https://doi.org/10.48550/arXiv.1703.05175
Zhang, X., Nie, J., Zong, L., Yu, H. & Liang, W. (2019). One Shot Learning with Margin. In Q. Yang, Z.-H. Zhou, Z. Gong, M.-L. Zhang & S.-J. Huang (Eds.), Lecture Notes in Computer Science. Advances in Knowledge Discovery and Data Mining (Vol. 11440, pp. 305–317). Springer International Publishing. https://doi.org/10.1007/978-3-030-16145-3_24
Other Classification:
TEClassifierParallel,
TEClassifierParallelPrototype,
TEClassifierProtoNet,
TEClassifierRegular,
TEClassifierSequential
Abstract class for auto encoders with 'pytorch'.
Objects of this class are used for reducing the number of dimensions of text embeddings created by an object of class TextEmbeddingModel.
For training a feature extractor of this class an object of class EmbeddedText or LargeDataSetForTextEmbeddings generated by an object of class TextEmbeddingModel is necessary. Passing raw texts is not supported.
For prediction an ob object class EmbeddedText or LargeDataSetForTextEmbeddings is necessary that was generated with the same TextEmbeddingModel as during training. Prediction outputs a new object of class EmbeddedText or LargeDataSetForTextEmbeddings which contains a text embedding with a lower number of dimensions.
All models use tied weights for the encoder and decoder layers and can apply the estimation of
orthogonal weights (except method="LSTM"). In addition, training tries to train the model to achieve uncorrelated features.
Objects of class TEFeatureExtractor are designed to be used with any ClassifiersBasedOnTextEmbeddings.
A new instances of this class.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> aifeducation::ModelsBasedOnTextEmbeddings -> TEFeatureExtractor
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::AIFEBaseModel$count_parameter()aifeducation::ModelsBasedOnTextEmbeddings$check_embedding_model()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model()aifeducation::ModelsBasedOnTextEmbeddings$get_text_embedding_model_name()aifeducation::ModelsBasedOnTextEmbeddings$load_from_disk()aifeducation::ModelsBasedOnTextEmbeddings$save()configure()
Creating a new instance of this class.
TEFeatureExtractor$configure( name = NULL, label = NULL, text_embeddings = NULL, features = 128L, method = "dense", orthogonal_method = "matrix_exp", noise_factor = 0.2 )
namestring Name of the new model. Please refer to common name conventions.
Free text can be used with parameter label. If set to NULL a unique ID
is generated automatically. Allowed values: any
labelstring Label for the new model. Here you can use free text. Allowed values: any
text_embeddingsEmbeddedText, LargeDataSetForTextEmbeddings Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
featuresint Number of features the model should use. Allowed values:
methodstring Method to use for the feature extraction. 'lstm' for an extractor based on LSTM-layers or 'Dense' for dense layers. Allowed values: 'Dense', 'LSTM'
orthogonal_methodstring Method for ensuring orthogonality of weights. Allowed values: 'matrix_exp', 'cayley', 'householder', 'None'
noise_factordouble Value between 0 and a value lower 1 indicating how much noise should
be added to the input during training. Allowed values:
Returns an object of class TEFeatureExtractor which is ready for training.
train()
Method for training a neural net.
TEFeatureExtractor$train( data_embeddings = NULL, data_val_size = 0.25, sustain_track = TRUE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15L, sustain_log_level = "warning", epochs = 40L, batch_size = 32L, trace = TRUE, ml_trace = 1L, log_dir = NULL, log_write_interval = 10L, lr_rate = 0.001, lr_min = 1e-04, lr_warm_up_ratio = 0.02, lr_scheduler = "None", optimizer = "AdamW", amp = FALSE )
data_embeddingsEmbeddedText, LargeDataSetForTextEmbeddings Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
data_val_sizedouble between 0 and 1, indicating the proportion of cases which should be
used for the validation sample during the estimation of the model.
The remaining cases are part of the training data. Allowed values:
sustain_trackbool If TRUE energy consumption is tracked during training via the python library 'codecarbon'.
sustain_iso_codestring ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any
sustain_regionstring Region within a country. Only available for USA and Canada See the documentation of
codecarbon for more information. https://docs.codecarbon.io/latest/ Allowed values: any
sustain_intervalint Interval in seconds for measuring power usage. Allowed values:
sustain_log_levelstring Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'
epochsint Number of training epochs. Allowed values:
batch_sizeint Size of the batches for training. Allowed values:
tracebool TRUE if information about the estimation phase should be printed to the console.
ml_traceint ml_trace=0 does not print any information about the training process from pytorch on the console. Allowed values:
log_dirstring Path to the directory where the log files should be saved.
If no logging is desired set this argument to NULL. Allowed values: any
log_write_intervalint Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_dir is not NULL. Allowed values:
lr_ratedouble Initial learning rate for the training. Sets the maximal learning rate. Allowed values:
lr_mindouble Minimal learning rate during training. Allowed values:
lr_warm_up_ratiodouble Number of epochs used for warm up. To disable warm up set this value to 0.0. Allowed values:
lr_schedulerstring Learning rate scheduler. To use a constant learning rate for the whole training set this parameter to 'None'. Allowed values: 'None', 'Linear', 'Cyclic'
optimizerstring determining the optimizer used for training. Allowed values: 'Adam', 'RMSprop', 'AdamW', 'SGD'
ampbool Apply automatic mixed precision to spped up computations. It is generally
recommended to set this parameter to TRUE. If you encounter problems set to FALSE.
* FALSE: Use full precision.
* TRUE: Use automatic mixed precision (amp) with gradient scaling.
Function does not return a value. It changes the object into a trained classifier.
extract_features()
Method for extracting features. Applying this method reduces the number of dimensions of the text
embeddings. Please note that this method should only be used if a small number of cases should be compressed
since the data is loaded completely into memory. For a high number of cases please use the method
extract_features_large.
TEFeatureExtractor$extract_features(data_embeddings, batch_size)
data_embeddingsObject of class EmbeddedText,LargeDataSetForTextEmbeddings,
datasets.arrow_dataset.Dataset or array containing the text embeddings which should be reduced in their
dimensions.
batch_sizeint batch size.
Returns an object of class EmbeddedText containing the compressed embeddings.
extract_features_large()
Method for extracting features from a large number of cases. Applying this method reduces the number of dimensions of the text embeddings.
TEFeatureExtractor$extract_features_large( data_embeddings, batch_size, trace = FALSE )
data_embeddingsObject of class EmbeddedText or LargeDataSetForTextEmbeddings containing the text embeddings which should be reduced in their dimensions.
batch_sizeint batch size.
tracebool If TRUE information about the progress is printed to the console.
Returns an object of class LargeDataSetForTextEmbeddings containing the compressed embeddings.
plot_training_history()
Method for requesting a plot of the training history. This method requires the R package 'ggplot2' to work.
TEFeatureExtractor$plot_training_history( x_min = NULL, x_max = NULL, y_min = NULL, y_max = NULL, ind_best_model = TRUE, text_size = 10L )
x_minint Minimal value for x-axis. Set to NULL for an automatic adjustment. Allowed values:
x_maxint Maximal value for x-axis. Set to NULL for an automatic adjustment. Allowed values:
y_minint Minimal value for y-axis. Set to NULL for an automatic adjustment. Allowed values:
y_maxint Maximal value for y-axis. Set to NULL for an automatic adjustment. Allowed values:
ind_best_modelbool If TRUE the plot indicates the best states of the model according to the chosen measure.
text_sizeint Size of text elements. Allowed values:
Returns a plot of class ggplot visualizing the training process.
clone()
The objects of this class are cloneable with this method.
TEFeatureExtractor$clone(deep = FALSE)
deepWhether to make a deep clone.
features refers to the number of features for the compressed text embeddings.
This model requires pad_value=0. If this condition is not met the
padding value is switched automatically.
This model requires that the underlying TextEmbeddingModel uses pad_value=0. If
this condition is not met the pad value is switched before training.
Other Text Embedding:
TextEmbeddingModel
Function converts tensors within a list into numpy arrays in order to allow further operations in R.
tensor_list_to_numpy(tensor_list)tensor_list_to_numpy(tensor_list)
tensor_list |
|
Returns the same list with the exception that objects of class torch.Tensor are transformed into numpy arrays.
If the tensor requires a gradient and/or is on gpu it is detached and converted.
If the object in a list is not of this class the original object is returned.
Other Utils Python Data Management Developers:
class_vector_to_py_dataset(),
create_py_dataset_cache_file_path(),
data.frame_to_py_dataset(),
extract_column_from_py_dataset(),
get_batches_index(),
prepare_r_array_for_dataset(),
py_dataset_to_embeddings(),
reduce_to_unique(),
tensor_to_numpy()
Function written in C++ for transformation the tensor (with size batch x times x features) to the matrix (with size batch x times*features)
tensor_to_matrix_c(tensor, times, features)tensor_to_matrix_c(tensor, times, features)
tensor |
|
times |
|
features |
|
Returns matrix (with size batch x times*features)
Other Utils Developers:
auto_n_cores(),
create_object(),
create_synthetic_units_from_matrix(),
generate_id(),
get_n_chunks(),
get_synthetic_cases_from_matrix(),
get_time_stamp(),
matrix_to_array_c(),
to_categorical_c()
Function converts a tensor into a numpy array in order to allow further operations in R.
tensor_to_numpy(object)tensor_to_numpy(object)
object |
Object of any class. |
In the case the object is of class torch.Tensor it returns a numpy error.
If the tensor requires a gradient and/or is on gpu it is detached and converted.
If the object is not of class torch.Tensor the original object is returned.
Other Utils Python Data Management Developers:
class_vector_to_py_dataset(),
create_py_dataset_cache_file_path(),
data.frame_to_py_dataset(),
extract_column_from_py_dataset(),
get_batches_index(),
prepare_r_array_for_dataset(),
py_dataset_to_embeddings(),
reduce_to_unique(),
tensor_list_to_numpy()
This R6 class stores a text embedding model which can be used to tokenize, encode, decode, and embed
raw texts. The object provides a unique interface for different text processing methods.
Objects of class TextEmbeddingModel transform raw texts into numerical representations which can be used for downstream tasks. For this aim objects of this class allow to tokenize raw texts, to encode tokens to sequences of integers, and to decode sequences of integers back to tokens.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> TextEmbeddingModel
BaseModel('BaseModelCore')
Object of class BaseModelCore.
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEBaseModel$count_parameter()configure()
Method for creating a new text embedding model
TextEmbeddingModel$configure( model_name = NULL, model_label = NULL, model_language = NULL, max_length = 512L, chunks = 2L, overlap = 0L, emb_layer_min = 1L, emb_layer_max = 2L, emb_pool_type = "Average", pad_value = -100L, base_model = NULL )
model_namestring Name of the new model. Please refer to common name conventions.
Free text can be used with parameter label. If set to NULL a unique ID
is generated automatically. Allowed values: any
model_labelstring Label for the new model. Here you can use free text. Allowed values: any
model_languagestring Languages that the models can work with. Allowed values: any
max_lengthint Maximal number of token per chunks. Must be equal or lower
as the maximal postional embeddings for the model. Allowed values:
chunksint Maximal number chunks. Allowed values:
overlapint Number of tokens from the previous chunk that should be added at the beginng of the next chunk. Allowed values:
emb_layer_minint Minimal layer from which the embeddings should be calculated. Allowed values:
emb_layer_maxint Maximal layer from which the embeddings should be calculated. Allowed values:
emb_pool_typestring Method to summarize the embedding of single tokens into a text embedding.
In the case of 'CLS' all cls-tokens between emb_layer_min and emb_layer_max are averaged.
In the case of 'Average' the embeddings of all tokens are averaged.
Please note that BaseModelFunnel allows only 'CLS'. Allowed values: 'CLS', 'Average'
pad_valueint Value indicating padding. This value should no be in the range of
regluar values for computations. Thus it is not recommended to chance this value.
Default is -100. Allowed values:
base_modelBaseModelCore BaseModels for processing raw texts.
tracebool TRUE if information about the estimation phase should be printed to the console.
Does nothing return.
load_from_disk()
Loads an object from disk and updates the object to the current version of the package.
TextEmbeddingModel$load_from_disk(dir_path)
dir_pathPath where the object set is stored.
Function does nothin return. It loads an object from disk.
save()
Method for saving a model on disk.
TextEmbeddingModel$save(dir_path, folder_name)
dir_pathPath to the directory where to save the object.
folder_namestring Name of the folder where the model should be saved. Allowed values: any
Function does nothing return. It is used to save an object on disk.
encode()
Method for encoding words of raw texts into integers.
TextEmbeddingModel$encode( raw_text, token_encodings_only = FALSE, token_to_int = TRUE, trace = FALSE )
raw_textvector Raw text.
token_encodings_onlybool
TRUE: Returns a list containg only the tokens.
FALSE: Returns a list containg a list for the tokens, the number of chunks, and
the number potential number of chunks for each document/text.
token_to_intbool
TRUE: Returns the tokens as int index.
FALSE: Returns the tokens as strings.
tracebool TRUE if information about the estimation phase should be printed to the console.
list containing the integer or token sequences of the raw texts with
special tokens.
decode()
Method for decoding a sequence of integers into tokens
TextEmbeddingModel$decode(int_seqence, to_token = FALSE)
int_seqencelist list of integer sequence that should be converted to tokens.
to_tokenbool
FALSE: Transforms the integers to plain text.
TRUE: Transforms the integers to a sequence of tokens.
list of token sequences
embed()
Method for creating text embeddings from raw texts.
This method should only be used if a small number of texts should be transformed
into text embeddings. For a large number of texts please use the method embed_large.
TextEmbeddingModel$embed( raw_text = NULL, doc_id = NULL, batch_size = 8L, trace = FALSE, return_large_dataset = FALSE )
raw_textvector Raw text.
doc_idvector Id for every text.
batch_sizeint Size of the batches for training. Allowed values:
tracebool TRUE if information about the estimation phase should be printed to the console.
return_large_datasetbool If TRUE a LargeDataSetForTextEmbeddings is returned. If FALSE an object if class EmbeddedText is returned.
Method returns an object of class EmbeddedText or LargeDataSetForTextEmbeddings. This object contains the embeddings as a data.frame and information about the model creating the embeddings.
embed_large()
Method for creating text embeddings from raw texts.
TextEmbeddingModel$embed_large( text_dataset, batch_size = 32L, trace = FALSE, log_file = NULL, log_write_interval = 2L )
text_datasetLargeDataSetForText LargeDataSetForText Object storing textual data.
batch_sizeint Size of the batches for training. Allowed values:
tracebool TRUE if information about the estimation phase should be printed to the console.
log_filestring Path to the file where the log files should be saved.
If no logging is desired set this argument to NULL. Allowed values: any
log_write_intervalint Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_dir is not NULL. Allowed values:
Method returns an object of class LargeDataSetForTextEmbeddings.
get_n_features()
Method for requesting the number of features.
TextEmbeddingModel$get_n_features()
Returns a double which represents the number of features. This number represents the
hidden size of the embeddings for every chunk or time.
get_pad_value()
Value for indicating padding.
TextEmbeddingModel$get_pad_value()
Returns an int describing the value used for padding.
set_publication_info()
Method for setting the bibliographic information of the model.
TextEmbeddingModel$set_publication_info(type, authors, citation, url = NULL)
typestring Type of information which should be changed/added.
developer, and modifier are possible.
authorsList of people.
citationstring Citation in free text.
urlstring Corresponding URL if applicable.
Function does not return a value. It is used to set the private members for publication information of the model.
get_sustainability_data()
Method for requesting a summary of tracked energy consumption during training and an estimate of the resulting CO2 equivalents in kg.
TextEmbeddingModel$get_sustainability_data(track_mode = "training")
track_modestring Determines the stept to which the data refer. Allowed values: 'training', 'inference'
Returns a list containing the tracked energy consumption, CO2 equivalents in kg, information on the
tracker used, and technical information on the training infrastructure.
estimate_sustainability_inference_embed()
Calculates the energy consumption for inference of the given task.
TextEmbeddingModel$estimate_sustainability_inference_embed( text_dataset = NULL, batch_size = 32L, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 10L, sustain_log_level = "warning", trace = TRUE )
text_datasetLargeDataSetForText LargeDataSetForText Object storing textual data.
batch_sizeint Size of the batches for training. Allowed values:
sustain_iso_codestring ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any
sustain_regionstring Region within a country. Only available for USA and Canada See the documentation of
codecarbon for more information. https://docs.codecarbon.io/latest/ Allowed values: any
sustain_intervalint Interval in seconds for measuring power usage. Allowed values:
sustain_log_levelstring Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'
tracebool TRUE if information about the estimation phase should be printed to the console.
Returns nothing. Method saves the statistics internally.
The statistics can be accessed with the method get_sustainability_data("inference")
clone()
The objects of this class are cloneable with this method.
TextEmbeddingModel$clone(deep = FALSE)
deepWhether to make a deep clone.
Other Text Embedding:
TEFeatureExtractor
Function transforming a vector of classes (int) into a binary class matrix.
to_categorical_c(class_vector, n_classes)to_categorical_c(class_vector, n_classes)
class_vector |
|
n_classes |
|
Returns a matrix containing the binary representation for
every class.
Other Utils Developers:
auto_n_cores(),
create_object(),
create_synthetic_units_from_matrix(),
generate_id(),
get_n_chunks(),
get_synthetic_cases_from_matrix(),
get_time_stamp(),
matrix_to_array_c(),
tensor_to_matrix_c()
Base class for tokenizers containing all methods shared by the sub-classes.
Does return a new object of this class.
Returns a data.frame containing the estimates.
aifeducation::AIFEMaster -> TokenizerBase
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()save()
Method for saving a model on disk.
TokenizerBase$save(dir_path, folder_name)
dir_pathPath to the directory where to save the object.
folder_namestring Name of the folder where the model should be saved. Allowed values: any
Function does nothing return. It is used to save an object on disk.
load_from_disk()
Loads an object from disk and updates the object to the current version of the package.
TokenizerBase$load_from_disk(dir_path)
dir_pathPath where the object set is stored.
Function does nothin return. It loads an object from disk.
get_tokenizer_statistics()
Tokenizer statistics
TokenizerBase$get_tokenizer_statistics()
Returns a data.frame containing the tokenizer's statistics.
get_tokenizer()
Python tokenizer
TokenizerBase$get_tokenizer()
Returns the python tokenizer within the model.
encode()
Method for encoding words of raw texts into integers.
TokenizerBase$encode( raw_text, token_overlap = 0L, max_token_sequence_length = 512L, n_chunks = 1L, token_encodings_only = FALSE, token_to_int = TRUE, return_token_type_ids = TRUE, trace = FALSE )
raw_textvector Raw text.
token_overlapint Number of tokens from the previous chunk that should be added at the beginng of the next chunk. Allowed values:
max_token_sequence_lengthint Maximal number of tokens per chunk. Allowed values:
n_chunksint Maximal number chunks. Allowed values:
token_encodings_onlybool
TRUE: Returns a list containg only the tokens.
FALSE: Returns a list containg a list for the tokens, the number of chunks, and
the number potential number of chunks for each document/text.
token_to_intbool
TRUE: Returns the tokens as int index.
FALSE: Returns the tokens as strings.
return_token_type_idsbool If TRUE additionally returns the return_token_type_ids.
tracebool TRUE if information about the estimation phase should be printed to the console.
list containing the integer or token sequences of the raw texts with
special tokens.
decode()
Method for decoding a sequence of integers into tokens
TokenizerBase$decode(int_seqence, to_token = FALSE)
int_seqencelist list of integer sequence that should be converted to tokens.
to_tokenbool
FALSE: Transforms the integers to plain text.
TRUE: Transforms the integers to a sequence of tokens.
list of token sequences
get_special_tokens()
Method for receiving the special tokens of the model
TokenizerBase$get_special_tokens()
Returns a matrix containing the special tokens in the rows
and their type, token, and id in the columns.
n_special_tokens()
Method for receiving the special tokens of the model
TokenizerBase$n_special_tokens()
Returns an 'int' counting the number of special tokens.
calculate_statistics()
Method for calculating tokenizer statistics as suggested by Kaya and Tantuğ (2024).
Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. <https://doi.org/10.1016/j.iswa.2024.200335>
TokenizerBase$calculate_statistics( text_dataset, statistics_max_tokens_length, step = "creation" )
text_datasetLargeDataSetForText LargeDataSetForText Object storing textual data.
statistics_max_tokens_lengthint Maximum sequence length for calculating the statistics. Allowed values:
stepstring describing the context of the estimation.
Returns an 'int' counting the number of special tokens.
clone()
The objects of this class are cloneable with this method.
TokenizerBase$clone(deep = FALSE)
deepWhether to make a deep clone.
Other R6 Classes for Developers:
AIFEBaseModel,
AIFEMaster,
BaseModelCore,
ClassifiersBasedOnTextEmbeddings,
DataManagerClassifier,
LargeDataSetBase,
ModelsBasedOnTextEmbeddings,
TEClassifiersBasedOnProtoNet,
TEClassifiersBasedOnRegular
Named list containing all tokenizers as a string.
TokenizerIndexTokenizerIndex
An object of class list of length 2.
Other Parameter Dictionary:
BaseModelsIndex,
DataSetsIndex,
doc_formula(),
get_TEClassifiers_class_names(),
get_called_args(),
get_depr_obj_names(),
get_magnitude_values(),
get_param_def(),
get_param_dict(),
get_param_doc_desc()
Function for updating 'aifeducation' on a machine.
The function tries to find an existing environment on the machine, removes the environment and installs the environment with the new python modules.
In the case env_type = "auto" the function tries to update an existing virtual environment.
If no virtual environment exits it tries to update a conda environment.
update_aifeducation( update_aifeducation_studio = TRUE, env_type = "auto", cuda_version = "13.0", envname = "aifeducation" )update_aifeducation( update_aifeducation_studio = TRUE, env_type = "auto", cuda_version = "13.0", envname = "aifeducation" )
update_aifeducation_studio |
|
env_type |
|
cuda_version |
|
envname |
|
Function does nothing return. It installs python, optional R packages, and necessary 'python' packages on a machine.
On MAC OS torch will be installed without support for cuda.
Other Installation and Configuration:
check_aif_py_modules(),
get_recommended_py_versions(),
install_aifeducation(),
install_aifeducation_studio(),
install_py_modules(),
prepare_session(),
set_transformers_logger()
Tokenizer based on the WordPiece model (Wu et al. 2016).
Does return a new object of this class.
aifeducation::AIFEMaster -> aifeducation::TokenizerBase -> WordPieceTokenizer
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::TokenizerBase$calculate_statistics()aifeducation::TokenizerBase$decode()aifeducation::TokenizerBase$encode()aifeducation::TokenizerBase$get_special_tokens()aifeducation::TokenizerBase$get_tokenizer()aifeducation::TokenizerBase$get_tokenizer_statistics()aifeducation::TokenizerBase$load_from_disk()aifeducation::TokenizerBase$n_special_tokens()aifeducation::TokenizerBase$save()configure()
Configures a new object of this class.
WordPieceTokenizer$configure(vocab_size = 10000L, vocab_do_lower_case = FALSE)
vocab_sizeint Size of the vocabulary. Allowed values:
vocab_do_lower_casebool TRUE if all tokens should be lower case.
Does nothing return.
train()
Trains a new object of this class
WordPieceTokenizer$train( text_dataset, statistics_max_tokens_length = 512L, sustain_track = FALSE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15L, sustain_log_level = "warning", trace = FALSE )
text_datasetLargeDataSetForText LargeDataSetForText Object storing textual data.
statistics_max_tokens_lengthint Maximum sequence length for calculating the statistics. Allowed values:
sustain_trackbool If TRUE energy consumption is tracked during training via the python library 'codecarbon'.
sustain_iso_codestring ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any
sustain_regionstring Region within a country. Only available for USA and Canada See the documentation of
codecarbon for more information. https://docs.codecarbon.io/latest/ Allowed values: any
sustain_intervalint Interval in seconds for measuring power usage. Allowed values:
sustain_log_levelstring Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'
tracebool TRUE if information about the estimation phase should be printed to the console.
Does nothing return.
clone()
The objects of this class are cloneable with this method.
WordPieceTokenizer$clone(deep = FALSE)
deepWhether to make a deep clone.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., . . . Dean, J. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. <https://doi.org/10.48550/arXiv.1609.08144>
Other Tokenizer:
HuggingFaceTokenizer
Function for writing a log file from R containing three rows and three columns. The log file can report the current status of maximal three processes. The first row describes the top process. The second row describes the status of the process within the top process. The third row can be used to describe the status of a process within the middle process.
The log can be read with read_log.
write_log( log_file, value_top = 0L, total_top = 1L, message_top = NA, value_middle = 0L, total_middle = 1L, message_middle = NA, value_bottom = 0L, total_bottom = 1L, message_bottom = NA, last_log = NULL, write_interval = 2L )write_log( log_file, value_top = 0L, total_top = 1L, message_top = NA, value_middle = 0L, total_middle = 1L, message_middle = NA, value_bottom = 0L, total_bottom = 1L, message_bottom = NA, last_log = NULL, write_interval = 2L )
log_file |
|
value_top |
|
total_top |
|
message_top |
|
value_middle |
|
total_middle |
|
message_middle |
|
value_bottom |
|
total_bottom |
|
message_bottom |
|
last_log |
|
write_interval |
|
This function writes a log file to the given location. If log_file is NULL the function will not try to
write a log file.
If log_file is a valid path to a file the function will write a log if the time specified by
write_interval has passed. In addition the function will return an object of class POSIXct describing the time
when the log file was successfully updated. If the initial attempt for writing log fails the function returns the
value of last_log which is NULL by default.
Other Utils Log Developers:
cat_message(),
clean_pytorch_log_transformers(),
output_message(),
print_message(),
read_log(),
read_loss_log(),
reset_log(),
reset_loss_log()