pantheonrl.algos.modular.policies

Implementation of the policy for the ModularAlgorithm

Classes

ModularPolicy

Policy class for actor-critic algorithms (has both policy and value prediction). Used by A2C, PPO and the likes. :param observation_space: (gym.spaces.Space) Observation space :param action_space: (gym.spaces.Space) Action space :param lr_schedule: (Callable) Learning rate schedule (could be constant) :param net_arch: ([int or dict]) The specification of the policy and value networks. :param device: (str or torch.device) Device on which the code should run. :param activation_fn: (Type[nn.Module]) Activation function :param ortho_init: (bool) Whether to use or not orthogonal initialization :param use_sde: (bool) Whether to use State Dependent Exploration or not :param log_std_init: (float) Initial value for the log standard deviation :param full_std: (bool) Whether to use (n_features x n_actions) parameters for the std instead of only (n_features,) when using gSDE :param sde_net_arch: ([int]) Network architecture for extracting features when using gSDE. If None, the latent features from the policy will be used. Pass an empty list to use the states as features. :param use_expln: (bool) Use expln() function instead of exp() to ensure a positive standard deviation (cf paper). It allows to keep variance above zero and prevent it from growing too fast. In practice, exp() is usually enough. :param squash_output: (bool) Whether to squash the output using a tanh function, this allows to ensure boundaries when using gSDE. :param features_extractor_class: (Type[BaseFeaturesExtractor]) Features extractor to use. :param features_extractor_kwargs: (Optional[Dict[str, Any]]) Keyword arguments to pass to the feature extractor. :param normalize_images: (bool) Whether to normalize images or not, dividing by 255.0 (True by default) :param optimizer_class: (Type[torch.optim.Optimizer]) The optimizer to use, torch.optim.Adam by default :param optimizer_kwargs: (Optional[Dict[str, Any]]) Additional keyword arguments, excluding the learning rate, to pass to the optimizer.