datafed_torchflow package#

Submodules#

datafed_torchflow.JSON module#

class datafed_torchflow.JSON.UniversalEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]#

Bases: JSONEncoder

A custom JSON encoder that can handle numpy data types, sets, and objects with __dict__ attributes.

default(obj)[source]#

Override the default method to provide custom serialization for unsupported data types.

Parameters: obj (any): The object to serialize.

Returns: any: The serialized form of the object.

datafed_torchflow.computer module#

datafed_torchflow.computer.get_cpu_info()[source]#

Retrieves CPU information.

Returns:

CPU details including physical cores, total cores, frequency, and usage.

Return type:

dict

datafed_torchflow.computer.get_gpu_info()[source]#

Retrieves GPU information using GPUtil.

Returns:

GPU details such as model, memory, and load.

Return type:

dict

datafed_torchflow.computer.get_memory_info()[source]#

Retrieves memory information.

Returns:

Memory details including total, available, used, and percentage used.

Return type:

dict

datafed_torchflow.computer.get_python_info()[source]#

Retrieves Python environment details, including version and installed packages.

Returns:

Python details including version, implementation, and installed packages.

Return type:

dict

datafed_torchflow.computer.get_system_info()[source]#

Extracts CPU, memory, GPU, and Python environment details.

Returns:

A dictionary containing CPU, memory, GPU, and Python details.

Return type:

dict

datafed_torchflow.computer.save_to_json(data, filename='system_info.json')[source]#

Saves the given data to a JSON file.

Parameters:
  • data (dict) – The data to be saved.

  • filename (str) – The filename for the JSON file.

datafed_torchflow.datafed module#

class datafed_torchflow.datafed.DataFed(datafed_path, local_model_path='./Trained Models', log_file_path='log.txt', dataset_id_or_path=None, download_kwargs={'orig_fname': True, 'wait': True}, upload_kwargs={'wait': True}, logging=False)[source]#

Bases: API

A class to interact with DataFed API.

Inherits from:

API: The base class for interacting with the DataFed API.

datafed_path#

DataFed path to store model script and checkpoints.

Type:

str

local_model_path#

Local directory to store model files.

Type:

str

log_file_path#

Local file to store a log of the code evaluation.

Type:

str

logging#

Flag to enable logging.

Type:

bool

project_id#

The ID of the project.

Type:

str

dataset_id#

The ID of the dataset.

Type:

str

static addDerivedFrom(deps=None)[source]#

Adds derived from information to the data record, skipping any None values.

Parameters:

deps (list or str, optional) – A list of dependencies or a single dependency to add. Defaults to None.

Returns:

A list of lists containing the “derived from” information, excluding None entries.

Return type:

list

check_if_endpoint_set()[source]#

Checks if the Globus endpoint is set up.

Raises:

Exception – If the Globus endpoint is not set up.

check_if_file_data(file_name, path_name=None)[source]#

Check if a file exists in the specified data path.

Parameters:

file_name (str) – The name of the file to check.

Returns:

True if the file exists in the data path, False otherwise.

Return type:

bool

check_if_logged_in()[source]#

Checks if the user is authenticated with DataFed.

Raises:

Exception – If the user is not authenticated.

check_no_files(record_ids)[source]#

Checks if any of the specified DataFed records have no associated files.

Parameters:

record_ids (list) – A list of DataFed record IDs to check.

Returns:

A list of record IDs that have no associated files, or None if all records have files.

Return type:

list or None

static check_string_for_dot_or_slash(s)[source]#

Checks if a string starts with a ‘.’ or ‘/’ and raises an exception if it does.

Parameters:

s (str) – The string to check.

Raises:

ValueError – If the string starts with either ‘.’ or ‘/’.

create_subfolder_if_not_exists()[source]#

Creates sub-folders (collections) if they do not already exist.

Iterates through the sub-collections specified in datafed_collection, creating any that are missing. Updates collection_id with the ID of the last created or found sub-collection.

data_record_create(metadata=None, record_title=None, parent_collection=None, deps=None, **kwargs)[source]#

Creates the DataFed record for the saved checkpoint and uploads the relevant metadata

Parameters:
  • metadata (dict) – The relevant model and system metadata for the checkpoint.

  • record_title (str) – The title of the DataFed record.

  • deps (list or str, optional) – A list of dependencies or a single dependency to add. Defaults to None.

Raises:

Exception – If user is not authenticated or must re-authenticate

data_record_update(record_id=None, record_title=None, metadata=None, deps=None, overwrite_metadata=False, **kwargs)[source]#

updates the DataFed record for the saved checkpoint including the relevant metadata if it changed

Parameters:
  • metadata (dict) – The relevant model and system metadata for the checkpoint.

  • record_title (str) – The title of the DataFed record.

  • deps (list or str, optional) – A list of dependencies or a single dependency to add. Defaults to None.

  • overwrite_metadata (bool, default=False) – Whether to overwrite the record metadata. Merges with existing metadata if false; overwrites if true.

Raises:

Exception – If user is not authenticated or must re-authenticate

exclude_keys(dict_list, excluded_keys)[source]#

Filters a list of dictionaries to exclude those that contain any of the specified excluded keys.

Parameters:
  • dict_list (list) – A list of dictionaries to filter.

  • excluded_keys (str, list, or set) – The keys that, if present in a dictionary, will exclude it from the result. Can be a single string, a list of strings, or a set of strings.

Returns:

A list of dictionaries that do not contain any of the specified excluded keys.

Return type:

list

Raises:

ValueError – If the excluded_keys parameter is not a string, list of strings, or set of strings.

static find_id_by_title(listing_reply, title_to_find)[source]#

Finds the ID of an item with a specific title from a listing response.

Parameters:
  • listing_reply (object) – The response object containing a list of items.

  • title_to_find (str) – The title of the item to find.

Returns:

The DataFed ID of the item with the specified title.

Return type:

str

Raises:

ValueError – If no item with the specified title is found.

getCollList(collection_id)[source]#

Retrieves a list of sub-collections within a specified collection.

Parameters:

collection_id (str) – The ID of the collection to query.

Returns:

A tuple containing:
  • list: The list of sub-collection titles.

  • ls_resp: The full response object from the API call.

Return type:

tuple

getCollectionProjectID()[source]#

Retrieves the project ID associated with a specific collection.

This method fetches the parent collection of the given collection ID and extracts the project ID from it.

Returns:

The project ID associated with the specified collection.

Return type:

str

getData(dataset_id=None)[source]#

Downloads the data from the dataset

getFileExtension()[source]#

Retrieves the file extension of the dataset file.

Returns:

The file extension of the dataset file, including the leading dot.

Return type:

str

getFileName(record_id)[source]#

Retrieves the file name (without extension) associated with a record ID.

Parameters:

record_id (str) – The ID of the record to retrieve the file name for.

Returns:

The file name without the extension.

Return type:

str

getIDs(listing_reply)[source]#

Gets the IDs of items from a listing response.

Parameters:

listing_reply (object) – The response object containing a list of items.

Returns:

A list of item IDs.

Return type:

list

getIDsInCollection(collection_id=None)[source]#

Gets the IDs of items in a collection. :param collection_id: The ID of the collection to query. :type collection_id: str

Returns:

A list of item IDs in the collection.

Return type:

list

getRecordTitle(record_id)[source]#

Retrieves the title of a record from its ID.

Parameters:

record_id (str) – The ID of the record to retrieve the title for.

Returns:

The title of the record.

Return type:

str

property getRootColl#

Gets the root collection identifier for the current project.

Returns:

The root collection identifier formatted with the project ID.

Return type:

str

get_metadata(collection_id=None, exclude_metadata=None, excluded_keys=None, non_unique=None, format='pandas')[source]#

Retrieves the metadata record for a specified record ID.

Parameters:
  • collection_id (str) – The ID of the collection to retrieve metadata from.

  • exclude_metadata (str, list, or None, optional) – Metadata fields to exclude from the extraction record.

  • excluded_keys (str, list, or None, optional) – Keys if the metadata record contains to exclude.

  • non_unique (str, list, or None, optional) – Keys which are expected to be unique independent of record uniqueness - these are not considered when finding unique records.

  • format (str, optional) – The format to return the metadata in. Defaults to “pandas”.

Returns:

The metadata record.

Return type:

dict

get_notebook_DataFed_ID_from_path_and_title(notebook_filename, path_id=None)[source]#

Gets the DataFed ID for the Jupyter notebook from the file name and DataFed path

Parameters:

notebook_filename (str) – The filename of the notebook. Can be the local filepath or just the filename.

Returns:

The DataFed ID of the specified notebook

Return type:

str

Raises

ValueError: If no item with the specified title is found

property get_projects#

Retrieves a list of projects from DataFed.

Parameters:

count (int, optional) – The number of projects to retrieve. Defaults to 500.

Returns:

A tuple containing:
  • list: The list of projects.

  • response: The full response object from the API call.

Return type:

tuple

static get_unique_dicts(dict_list, exclude_keys=None)[source]#

Filters a list of dictionaries to include only unique dictionaries, excluding specified keys.

Parameters:
  • dict_list (list) – A list of dictionaries to filter for uniqueness.

  • exclude_keys (list or None, optional) – Keys to exclude when determining uniqueness. Defaults to None.

Returns:

A list of unique dictionaries, excluding specified keys from the uniqueness check.

Return type:

list

identify_collection_id()[source]#
joinPath(file_name, path_name=None)[source]#

Joins the data path and the file name to create a full file path.

Parameters:
  • file_name (str) – The name of the file.

  • path_name (str,default=None) – The name of the file path. Defaults to self.local_model_path

Returns:

The full file path.

Return type:

str

replace_missing_records(collection_id=None, file_path=None, upload_kwargs=None, logging=True)[source]#
static required_keys(self, dict_list, required_keys)[source]#

Filters a list of dictionaries to include only those that contain all specified required keys.

Parameters:
  • dict_list (list) – A list of dictionaries to filter.

  • required_keys (str, list, or set) – The keys that each dictionary must contain. Can be a single string, a list of strings, or a set of strings.

Returns:

A list of dictionaries that contain all the specified required keys.

Return type:

list

Raises:

ValueError – If the required_keys parameter is not a string, list of strings, or set of strings.

upload_dataset_to_DataFed()[source]#

Checks whether the dataset record already exists on DataFed and uploads it to a collection called ``dataset” (which it will create if necessary) whose parent collection is self.collection_id (where the checkpoints are stored). Works with any number of dataset files, specified by torchlogger.dataset_id_or_path in the instantiation of the torchlogger. The dataset files can be specified as either their file names or DataFed IDs (specified as a string for a single dataset file and a list of strings for multiple dataset files)

Parameters:

None (self)

Returns:

The DataFed record ID for the dataset files, as a string for a single dataset file and a list of strings for multiple dataset files.

upload_file(DataFed_ID, file_path, wait=False)[source]#

Uploads the file to the DataFed record

Parameters:
  • DataFed_ID (str) – The DataFed ID the data record to upload the file

  • file_path (str) – The local filepath of the file to upload to DataFed

  • wait (bool, optional) – whether or not to pause the script until the file has been uploaded. Defaults to False

property user_id#

Gets the user ID from the authenticated user’s information.

Returns:

The user ID extracted from the authenticated user information.

Return type:

str

datafed_torchflow.pytorch module#

class datafed_torchflow.pytorch.InferenceEvaluation(dataframe, dataset, df_api, root_directory=None, save_directory='./tmp/', skip=None, **Kwargs)[source]#

Bases: object

build_model()[source]#

Builds and returns the model to be used for inference.

This method should be implemented by the child class to define the specific model architecture and any necessary configurations.

Returns:

The model object to be used for inference.

Return type:

torch.nn.Module

evaluate(row, file_path)[source]#

Evaluates the model on the given data. This method should be implemented by the child class. The parent class does not implement this method.

Parameters:
  • row (pd.Series) – A row from the dataframe containing metadata and other information.

  • file_path (str) – The path to the file to be used for evaluation.

Returns:

The evaluation results as a dictionary.

Return type:

dict

file_not_found(filename, row)[source]#
static get_first_entry_if_list(data)[source]#
run()[source]#
run_inference(row)[source]#
class datafed_torchflow.pytorch.TorchLogger(model_dict, DataFed_path, script_path=None, local_model_path='/.', log_file_path='log.txt', input_data_shape=None, dataset_id_or_path=None, logging=False, download_kwargs={'orig_fname': True, 'wait': True})[source]#

Bases: object

TorchLogger is a class designed to log PyTorch model training details, including model architecture, optimizer state, and system information. It also integrates with the DataFed API for file and metadata management.

model_dict#

a dictionary containing the Pytorch model architecture to be logged, with the name of the block as the key and the block as the value. For example: {“vae”:vae, “encoder: encoder, “decoder”:decoder,”optimizer”:optimizer}

Type:

dict

DataFed_path#

The path to the DataFed configuration or API.

Type:

str

script_path#

Path to the script or notebook for checksum calculation.

Type:

str

local_model_path#

Local directory to store model files.

Type:

str

input_data_shape#

Shape of the input training data for the model.

Type:

tuple

logging#

Whether to display logging output.

Type:

bool

getMetadata(local_vars=None, model_hyperparameters=None, **kwargs)[source]#

Gathers metadata including the serialized model, optimizer, system info, and user details.

Parameters:
  • local_vars (list) – a list containing the local variables for the model training code, from list(locals().items()). Used to determine the metadata

  • **kwargs – Additional key-value pairs to be added to the metadata.

Returns:

A dictionary containing the metadata including model, optimizer,

system information, user, timestamp, and optional script checksum.

Return type:

dict

getModelArchitectureStateDict()[source]#

generates a dictionary where the key is the model architecture block and the value is the corresponding state dictionary to go in the saved checkpoint, for example

Returns:

A dictionary containing the model architecture state dictionaries

Return type:

dict

getUserClock()[source]#

Gathers system information including CPU, memory, and GPU details.

Returns:

A dictionary containing system information.

Return type:

dict

property optimizer#

Returns the optimizer used for training.

Returns:

The optimizer instance.

Return type:

torch.optim.Optimizer

reset()[source]#
save(record_file_name, datafed=True, local_file_path=None, local_vars=None, model_hyperparameters=None, **kwargs)[source]#

Saves the model’s state dictionary locally unless one has already been saved and optionally uploads it to DataFed along with the model’s metadata. If you want to upload multiple files to the same DataFed data record you can zip them together and pass in the local path to the zip file as “local_file_path”.

Parameters:
  • record_file_name (str) – The name of the file to save the model locally.

  • datafed (bool, optional) – If True, the record is uploaded to DataFed. Default is True.

  • local_file_path (str or Path.PosixPath, optional) – The local file path to the directory to save the weights or to the presaved file to upload to DataFed.

  • local_vars (list) – a list containing the local variables for the model training code, from list(locals().items()). Used to determine the metadata

  • model_hyperparameters (dict) – a dictionary where the keys are the model hyperparameters names and the values are the model hyperparameter names. Used in the saved checkpoint.

  • **kwargs – Additional metadata or attributes to include in the record.

save_notebook()[source]#

Saves the Jupyter notebook that runs the code training the model

class datafed_torchflow.pytorch.TorchViewer(DataFed_path, **kwargs)[source]#

Bases: Module

getModelCheckpoints(exclude_metadata='computing', excluded_keys='script', non_unique=['id', 'timestamp', 'total_time'], format='pandas')[source]#

Retrieves the metadata record for a specified record ID.

Parameters:
  • record_id (str) – The ID of the record to retrieve.

  • exclude_metadata (str, list, or None, optional) – Metadata fields to exclude from the extraction record.

  • excluded_keys (str, list, or None, optional) – Keys if the metadata record contains to exclude.

  • non_unique (str, list, or None, optional) – Keys which are expected to be unique independent of record uniqueness - these are not considered when finding unique records.

  • format (str, optional) – The format to return the metadata in. Defaults to “pandas”.

Returns:

The metadata record.

Return type:

dict

datafed_torchflow.utils module#

datafed_torchflow.utils.extract_instance_attributes(obj={})[source]#

Recursively extracts attributes from class instances, converting NumPy integers to Python int, NumPy arrays and Torch tensors to lists, while ignoring keys that start with ‘_’.

This helper function traverses the attributes of a given object and returns a dictionary representation of those attributes. If the object has a __dict__ attribute, it means the object is likely an instance of a class, and its attributes are stored in __dict__. The function will recursively call itself to extract attributes from nested objects, convert any NumPy integers to Python int, and convert NumPy arrays and Torch tensors to lists.

Parameters:

obj (object) – The object from which to extract attributes. Defaults to an empty dictionary.

Returns:

A dictionary containing the extracted attributes, excluding those whose keys start with ‘_’.

Return type:

dict

datafed_torchflow.utils.getNotebookMetadata(file)[source]#

Calculates the checksum of the script or notebook file and includes it in the metadata.

Returns:

A dictionary containing the path and checksum of the script or notebook file.

Return type:

dict

datafed_torchflow.utils.get_return_variables(func)[source]#
datafed_torchflow.utils.is_jsonable(x)[source]#
datafed_torchflow.utils.serialize_model(model_block)[source]#

Serializes the model architecture into a dictionary format with detailed layer information.

Returns:

A dictionary containing the model’s architecture with layer types,

names, and configurations.

Return type:

dict

datafed_torchflow.utils.serialize_pytorch_optimizer(optimizer)[source]#

Serializes the optimizer’s state dictionary, converting tensors to lists for JSON compatibility.

Returns:

A dictionary containing the optimizer’s serialized parameters.

Return type:

dict

Module contents#