datafed_torchflow package#
Submodules#
datafed_torchflow.JSON module#
- class datafed_torchflow.JSON.UniversalEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]#
Bases:
JSONEncoder
A custom JSON encoder that can handle numpy data types, sets, and objects with __dict__ attributes.
datafed_torchflow.computer module#
- datafed_torchflow.computer.get_cpu_info()[source]#
Retrieves CPU information.
- Returns:
CPU details including physical cores, total cores, frequency, and usage.
- Return type:
- datafed_torchflow.computer.get_gpu_info()[source]#
Retrieves GPU information using GPUtil.
- Returns:
GPU details such as model, memory, and load.
- Return type:
- datafed_torchflow.computer.get_memory_info()[source]#
Retrieves memory information.
- Returns:
Memory details including total, available, used, and percentage used.
- Return type:
- datafed_torchflow.computer.get_python_info()[source]#
Retrieves Python environment details, including version and installed packages.
- Returns:
Python details including version, implementation, and installed packages.
- Return type:
datafed_torchflow.datafed module#
- class datafed_torchflow.datafed.DataFed(datafed_path, local_model_path='./Trained Models', log_file_path='log.txt', dataset_id_or_path=None, download_kwargs={'orig_fname': True, 'wait': True}, upload_kwargs={'wait': True}, logging=False)[source]#
Bases:
API
A class to interact with DataFed API.
- Inherits from:
API: The base class for interacting with the DataFed API.
- static addDerivedFrom(deps=None)[source]#
Adds derived from information to the data record, skipping any None values.
- check_if_endpoint_set()[source]#
Checks if the Globus endpoint is set up.
- Raises:
Exception – If the Globus endpoint is not set up.
- check_if_file_data(file_name, path_name=None)[source]#
Check if a file exists in the specified data path.
- check_if_logged_in()[source]#
Checks if the user is authenticated with DataFed.
- Raises:
Exception – If the user is not authenticated.
- check_no_files(record_ids)[source]#
Checks if any of the specified DataFed records have no associated files.
- static check_string_for_dot_or_slash(s)[source]#
Checks if a string starts with a ‘.’ or ‘/’ and raises an exception if it does.
- Parameters:
s (str) – The string to check.
- Raises:
ValueError – If the string starts with either ‘.’ or ‘/’.
- create_subfolder_if_not_exists()[source]#
Creates sub-folders (collections) if they do not already exist.
Iterates through the sub-collections specified in datafed_collection, creating any that are missing. Updates collection_id with the ID of the last created or found sub-collection.
- data_record_create(metadata=None, record_title=None, parent_collection=None, deps=None, **kwargs)[source]#
Creates the DataFed record for the saved checkpoint and uploads the relevant metadata
- Parameters:
- Raises:
Exception – If user is not authenticated or must re-authenticate
- data_record_update(record_id=None, record_title=None, metadata=None, deps=None, overwrite_metadata=False, **kwargs)[source]#
updates the DataFed record for the saved checkpoint including the relevant metadata if it changed
- Parameters:
metadata (dict) – The relevant model and system metadata for the checkpoint.
record_title (str) – The title of the DataFed record.
deps (list or str, optional) – A list of dependencies or a single dependency to add. Defaults to None.
overwrite_metadata (bool, default=False) – Whether to overwrite the record metadata. Merges with existing metadata if false; overwrites if true.
- Raises:
Exception – If user is not authenticated or must re-authenticate
- exclude_keys(dict_list, excluded_keys)[source]#
Filters a list of dictionaries to exclude those that contain any of the specified excluded keys.
- Parameters:
- Returns:
A list of dictionaries that do not contain any of the specified excluded keys.
- Return type:
- Raises:
ValueError – If the excluded_keys parameter is not a string, list of strings, or set of strings.
- static find_id_by_title(listing_reply, title_to_find)[source]#
Finds the ID of an item with a specific title from a listing response.
- Parameters:
- Returns:
The DataFed ID of the item with the specified title.
- Return type:
- Raises:
ValueError – If no item with the specified title is found.
- getCollList(collection_id)[source]#
Retrieves a list of sub-collections within a specified collection.
- getCollectionProjectID()[source]#
Retrieves the project ID associated with a specific collection.
This method fetches the parent collection of the given collection ID and extracts the project ID from it.
- Returns:
The project ID associated with the specified collection.
- Return type:
- getFileExtension()[source]#
Retrieves the file extension of the dataset file.
- Returns:
The file extension of the dataset file, including the leading dot.
- Return type:
- getFileName(record_id)[source]#
Retrieves the file name (without extension) associated with a record ID.
- getIDsInCollection(collection_id=None)[source]#
Gets the IDs of items in a collection. :param collection_id: The ID of the collection to query. :type collection_id: str
- Returns:
A list of item IDs in the collection.
- Return type:
- property getRootColl#
Gets the root collection identifier for the current project.
- Returns:
The root collection identifier formatted with the project ID.
- Return type:
- get_metadata(collection_id=None, exclude_metadata=None, excluded_keys=None, non_unique=None, format='pandas')[source]#
Retrieves the metadata record for a specified record ID.
- Parameters:
collection_id (str) – The ID of the collection to retrieve metadata from.
exclude_metadata (str, list, or None, optional) – Metadata fields to exclude from the extraction record.
excluded_keys (str, list, or None, optional) – Keys if the metadata record contains to exclude.
non_unique (str, list, or None, optional) – Keys which are expected to be unique independent of record uniqueness - these are not considered when finding unique records.
format (str, optional) – The format to return the metadata in. Defaults to “pandas”.
- Returns:
The metadata record.
- Return type:
- get_notebook_DataFed_ID_from_path_and_title(notebook_filename, path_id=None)[source]#
Gets the DataFed ID for the Jupyter notebook from the file name and DataFed path
- Parameters:
notebook_filename (str) – The filename of the notebook. Can be the local filepath or just the filename.
- Returns:
The DataFed ID of the specified notebook
- Return type:
- Raises
ValueError: If no item with the specified title is found
- property get_projects#
Retrieves a list of projects from DataFed.
- static get_unique_dicts(dict_list, exclude_keys=None)[source]#
Filters a list of dictionaries to include only unique dictionaries, excluding specified keys.
- joinPath(file_name, path_name=None)[source]#
Joins the data path and the file name to create a full file path.
- replace_missing_records(collection_id=None, file_path=None, upload_kwargs=None, logging=True)[source]#
- static required_keys(self, dict_list, required_keys)[source]#
Filters a list of dictionaries to include only those that contain all specified required keys.
- Parameters:
- Returns:
A list of dictionaries that contain all the specified required keys.
- Return type:
- Raises:
ValueError – If the required_keys parameter is not a string, list of strings, or set of strings.
- upload_dataset_to_DataFed()[source]#
Checks whether the dataset record already exists on DataFed and uploads it to a collection called ``dataset” (which it will create if necessary) whose parent collection is self.collection_id (where the checkpoints are stored). Works with any number of dataset files, specified by torchlogger.dataset_id_or_path in the instantiation of the torchlogger. The dataset files can be specified as either their file names or DataFed IDs (specified as a string for a single dataset file and a list of strings for multiple dataset files)
- Parameters:
None (self)
- Returns:
The DataFed record ID for the dataset files, as a string for a single dataset file and a list of strings for multiple dataset files.
datafed_torchflow.pytorch module#
- class datafed_torchflow.pytorch.InferenceEvaluation(dataframe, dataset, df_api, root_directory=None, save_directory='./tmp/', skip=None, **Kwargs)[source]#
Bases:
object
- build_model()[source]#
Builds and returns the model to be used for inference.
This method should be implemented by the child class to define the specific model architecture and any necessary configurations.
- Returns:
The model object to be used for inference.
- Return type:
torch.nn.Module
- class datafed_torchflow.pytorch.TorchLogger(model_dict, DataFed_path, script_path=None, local_model_path='/.', log_file_path='log.txt', input_data_shape=None, dataset_id_or_path=None, logging=False, download_kwargs={'orig_fname': True, 'wait': True})[source]#
Bases:
object
TorchLogger is a class designed to log PyTorch model training details, including model architecture, optimizer state, and system information. It also integrates with the DataFed API for file and metadata management.
- model_dict#
a dictionary containing the Pytorch model architecture to be logged, with the name of the block as the key and the block as the value. For example: {“vae”:vae, “encoder: encoder, “decoder”:decoder,”optimizer”:optimizer}
- Type:
- getMetadata(local_vars=None, model_hyperparameters=None, **kwargs)[source]#
Gathers metadata including the serialized model, optimizer, system info, and user details.
- Parameters:
local_vars (list) – a list containing the local variables for the model training code, from list(locals().items()). Used to determine the metadata
**kwargs – Additional key-value pairs to be added to the metadata.
- Returns:
- A dictionary containing the metadata including model, optimizer,
system information, user, timestamp, and optional script checksum.
- Return type:
- getModelArchitectureStateDict()[source]#
generates a dictionary where the key is the model architecture block and the value is the corresponding state dictionary to go in the saved checkpoint, for example
- Returns:
A dictionary containing the model architecture state dictionaries
- Return type:
- getUserClock()[source]#
Gathers system information including CPU, memory, and GPU details.
- Returns:
A dictionary containing system information.
- Return type:
- property optimizer#
Returns the optimizer used for training.
- Returns:
The optimizer instance.
- Return type:
torch.optim.Optimizer
- save(record_file_name, datafed=True, local_file_path=None, local_vars=None, model_hyperparameters=None, **kwargs)[source]#
Saves the model’s state dictionary locally unless one has already been saved and optionally uploads it to DataFed along with the model’s metadata. If you want to upload multiple files to the same DataFed data record you can zip them together and pass in the local path to the zip file as “local_file_path”.
- Parameters:
record_file_name (str) – The name of the file to save the model locally.
datafed (bool, optional) – If True, the record is uploaded to DataFed. Default is True.
local_file_path (str or Path.PosixPath, optional) – The local file path to the directory to save the weights or to the presaved file to upload to DataFed.
local_vars (list) – a list containing the local variables for the model training code, from list(locals().items()). Used to determine the metadata
model_hyperparameters (dict) – a dictionary where the keys are the model hyperparameters names and the values are the model hyperparameter names. Used in the saved checkpoint.
**kwargs – Additional metadata or attributes to include in the record.
- class datafed_torchflow.pytorch.TorchViewer(DataFed_path, **kwargs)[source]#
Bases:
Module
- getModelCheckpoints(exclude_metadata='computing', excluded_keys='script', non_unique=['id', 'timestamp', 'total_time'], format='pandas')[source]#
Retrieves the metadata record for a specified record ID.
- Parameters:
record_id (str) – The ID of the record to retrieve.
exclude_metadata (str, list, or None, optional) – Metadata fields to exclude from the extraction record.
excluded_keys (str, list, or None, optional) – Keys if the metadata record contains to exclude.
non_unique (str, list, or None, optional) – Keys which are expected to be unique independent of record uniqueness - these are not considered when finding unique records.
format (str, optional) – The format to return the metadata in. Defaults to “pandas”.
- Returns:
The metadata record.
- Return type:
datafed_torchflow.utils module#
- datafed_torchflow.utils.extract_instance_attributes(obj={})[source]#
Recursively extracts attributes from class instances, converting NumPy integers to Python int, NumPy arrays and Torch tensors to lists, while ignoring keys that start with ‘_’.
This helper function traverses the attributes of a given object and returns a dictionary representation of those attributes. If the object has a __dict__ attribute, it means the object is likely an instance of a class, and its attributes are stored in __dict__. The function will recursively call itself to extract attributes from nested objects, convert any NumPy integers to Python int, and convert NumPy arrays and Torch tensors to lists.
- datafed_torchflow.utils.getNotebookMetadata(file)[source]#
Calculates the checksum of the script or notebook file and includes it in the metadata.
- Returns:
A dictionary containing the path and checksum of the script or notebook file.
- Return type: