POMDPPlanners.environments.light_dark_pomdp package

Subpackages

Submodules

POMDPPlanners.environments.light_dark_pomdp.continuous_light_dark_pomdp module

Continuous Light-Dark POMDP Environment Implementation.

This module implements the continuous Light-Dark domain, a classic POMDP benchmark where an agent must navigate to a goal position in a continuous 2D space while dealing with position-dependent observation noise.

The Continuous Light-Dark POMDP features: - Continuous 2D state space representing agent position - Discrete or continuous action space for movement - Light source at a specific location that affects observation quality - Observation noise that decreases closer to the light source - Goal region that agent must reach to maximize reward - Optional obstacles that cause negative rewards when hit

Key characteristics: - State: [x, y] position in continuous 2D space - Actions: Movement vectors or discrete directions - Observations: Noisy position estimates (noise depends on distance from light) - Rewards: Goal reaching bonus, movement costs, obstacle penalties - Multiple reward model variants available

Classes:: RewardModelType: Enumeration of available reward model types StateTransitionModel: Continuous movement with Gaussian noise ContinuousLightDarkPOMDP: Main environment class ContinuousLightDarkPOMDPDiscreteActions: Discrete action variant

class POMDPPlanners.environments.light_dark_pomdp.continuous_light_dark_pomdp.ContinuousLightDarkPOMDP(discount_factor, name='ContinuousLightDarkPOMDP', state_transition_cov_matrix=array([[0.05, 0.], [0., 0.05]]), observation_cov_matrix=array([[0.05, 0.], [0., 0.05]]), beacons=[(0, 0), (0, 5), (0, 10), (5, 0), (5, 5), (5, 10), (10, 0), (10, 5), (10, 10)], goal_state=array([10, 5]), start_state=array([0, 5]), obstacles=[(3, 7), (5, 5)], obstacle_hit_probability=0.2, obstacle_reward=-10.0, goal_reward=10.0, fuel_cost=2.0, grid_size=11, goal_state_radius=1.5, beacon_radius=1.0, obstacle_radius=1.5, reward_model_type=RewardModelType.STANDARD, observation_model_type=ObservationModelType.NORMAL_NOISE, penalty_decay=1.0, is_obstacle_hit_terminal=True)[source]

Bases: BaseLightDarkPOMDP

Continuous Light-Dark POMDP environment with continuous actions.

This environment extends the base Light-Dark problem to continuous 2D space with continuous action vectors. The agent navigates toward a goal while dealing with position-dependent observation noise and optional obstacles.

Key features: - Continuous 2D state and action spaces - Light beacons reduce observation noise when nearby - Multiple observation models available (normal noise, normal noise with no observation in dark) - Multiple reward models available (standard, decaying hit probability, dangerous states) - Optional obstacles with configurable hit penalties - Terminal conditions for goal reaching, obstacle hits, and boundary violations

Example

>>> import numpy as np
>>> np.random.seed(42)  # For reproducible results
>>>
>>> # Initialize environment
>>> env = ContinuousLightDarkPOMDP(
...     discount_factor=0.95,
...     goal_state=np.array([10, 5]),
...     start_state=np.array([0, 5])
... )
>>>
>>> # Get initial state
>>> initial_state = env.initial_state_dist().sample()[0]
>>>
>>> # Sample complete step (action must be provided based on environment type)
>>> action = np.array([1.0, 0.0])  # Move right
>>> next_state, observation, reward = env.sample_next_step(initial_state, action)
>>>
>>> # Check terminal condition
>>> env.is_terminal(initial_state)
False

Parameters:

discount_factor (float)
name (str)
state_transition_cov_matrix (ndarray)
observation_cov_matrix (ndarray)
beacons (List[Tuple[float, float]])
goal_state (ndarray)
start_state (ndarray)
obstacles (List[Tuple[float, float]])
obstacle_hit_probability (float)
obstacle_reward (float)
goal_reward (float)
fuel_cost (float)
grid_size (int)
goal_state_radius (float)
beacon_radius (float)
obstacle_radius (float)
reward_model_type (RewardModelType)
observation_model_type (ObservationModelType)
penalty_decay (float)
is_obstacle_hit_terminal (bool)

compute_metrics(histories)[source]

Compute environment-specific metrics from episode histories.

This method can be overridden by subclasses to provide custom metric calculations beyond standard return and episode length.

Parameters:: histories (List[History]) – List of episode histories to analyze
Return type:: List[MetricValue]
Returns:: List of computed metrics with confidence intervals

get_metric_names()[source]

Get names of Continuous Light-Dark POMDP specific metrics.

Returns:: goal_reaching_rate, obstacle_hit_rate, avg_obstacle_hit_counter, out_of_grid_rate, and avg_dangerous_states_counter
Return type:: List[str]

is_terminal(state)[source]

Check if a state is terminal.

Parameters:: state (ndarray) – State to check for terminal condition
Return type:: bool
Returns:: True if the state is terminal, False otherwise

Note

Subclasses must implement this method to define terminal conditions.

observation_model(next_state, action)[source]

Get the observation model for a given next state and action.

Parameters:

next_state (ndarray) – The resulting state after taking an action
action (ndarray) – The action that was executed

Return type:

ObservationModel

Returns:

Observation model that can sample observations

Note

Subclasses must implement this method to define observation generation.

reward(state, action)[source]

Calculate the immediate reward for a state-action pair.

Parameters:

state (ndarray) – Current state
action (ndarray) – Action executed from the state

Return type:

float

Returns:

Immediate reward value

Note

Subclasses must implement this method to define reward structure.

reward_batch(states, action)[source]

Calculate rewards for a batch of states given a single action.

Provides a loop-based default that subclasses can override with vectorized numpy implementations for better performance.

Parameters:

states (Union[ndarray, Sequence[Any]]) – Sequence of states of length N.
action (ndarray) – Action executed from each state.

Return type:

ndarray

Returns:

1-D array of reward values with shape (N,).

state_transition_model(state, action)[source]

Get the state transition model for a given state-action pair.

Parameters:

state (ndarray) – Current state
action (ndarray) – Action to be executed

Return type:

StateTransitionModel

Returns:

State transition model that can sample next states

Note

Subclasses must implement this method to define state dynamics.

class POMDPPlanners.environments.light_dark_pomdp.continuous_light_dark_pomdp.ContinuousLightDarkPOMDPDiscreteActions(discount_factor, state_transition_cov_matrix=array([[1., 0.], [0., 1.]]), observation_cov_matrix=array([[1., 0.], [0., 1.]]), obstacle_hit_probability=0.2, obstacle_reward=-10.0, goal_reward=10.0, fuel_cost=2.0, grid_size=11, goal_state_radius=1.5, beacon_radius=1.0, obstacle_radius=1.5, name='ContinuousLightDarkPOMDPDiscreteActions', beacons=[(0, 0), (0, 5), (0, 10), (5, 0), (5, 5), (5, 10), (10, 0), (10, 5), (10, 10)], goal_state=array([10, 5]), start_state=array([0, 5]), obstacles=[(3, 7), (5, 5)], reward_model_type=RewardModelType.STANDARD, observation_model_type=ObservationModelType.NORMAL_NOISE, penalty_decay=1.0, is_obstacle_hit_terminal=True)[source]

Bases: ContinuousLightDarkPOMDP, DiscreteActionsEnvironment

Continuous Light-Dark POMDP environment with discrete actions.

This variant of the Continuous Light-Dark POMDP uses discrete directional actions (up, down, left, right) instead of continuous action vectors. The continuous state space and observation model are preserved.

Actions are mapped to unit vectors: - “up”: [0, 1] - “down”: [0, -1] - “right”: [1, 0] - “left”: [-1, 0]

Example

>>> import numpy as np
>>> np.random.seed(42)  # For reproducible results
>>>
>>> # Initialize environment
>>> env = ContinuousLightDarkPOMDPDiscreteActions(
...     discount_factor=0.95,
...     goal_state=np.array([10, 5]),
...     start_state=np.array([0, 5])
... )
>>>
>>> # Get initial state and actions
>>> initial_state = env.initial_state_dist().sample()[0]
>>> actions = env.get_actions()
>>>
>>> # Sample complete step using convenience method
>>> action = actions[0]
>>> next_state, observation, reward = env.sample_next_step(initial_state, action)
>>>
>>> # Check terminal condition
>>> env.is_terminal(initial_state)
False

Parameters:

discount_factor (float)
state_transition_cov_matrix (ndarray)
observation_cov_matrix (ndarray)
obstacle_hit_probability (float)
obstacle_reward (float)
goal_reward (float)
fuel_cost (float)
grid_size (int)
goal_state_radius (float)
beacon_radius (float)
obstacle_radius (float)
name (str)
beacons (List[Tuple[float, float]])
goal_state (ndarray)
start_state (ndarray)
obstacles (List[Tuple[float, float]])
reward_model_type (RewardModelType)
observation_model_type (ObservationModelType)
penalty_decay (float)
is_obstacle_hit_terminal (bool)

get_actions()[source]

Get all possible actions in the discrete action space.

Return type:: List[Any]
Returns:: List containing all valid actions that can be executed

Note

Subclasses must implement this method to enumerate all possible actions. This is used by planning algorithms that need to iterate over actions.

observation_model(next_state, action)[source]

Get the observation model for a given next state and action.

Parameters:

next_state (ndarray) – The resulting state after taking an action
action (Any) – The action that was executed

Return type:

ObservationModel

Returns:

Observation model that can sample observations

Note

Subclasses must implement this method to define observation generation.

reward(state, action)[source]

Calculate the immediate reward for a state-action pair.

Parameters:

state (ndarray) – Current state
action (Any) – Action executed from the state

Return type:

float

Returns:

Immediate reward value

Note

Subclasses must implement this method to define reward structure.

reward_batch(states, action)[source]

Calculate rewards for a batch of states given a single action.

Provides a loop-based default that subclasses can override with vectorized numpy implementations for better performance.

Parameters:

states (Union[ndarray, Sequence[Any]]) – Sequence of states of length N.
action (Any) – Action executed from each state.

Return type:

ndarray

Returns:

1-D array of reward values with shape (N,).

state_transition_model(state, action)[source]

Get the state transition model for a given state-action pair.

Parameters:

state (ndarray) – Current state
action (Any) – Action to be executed

Return type:

StateTransitionModel

Returns:

State transition model that can sample next states

Note

Subclasses must implement this method to define state dynamics.

class POMDPPlanners.environments.light_dark_pomdp.continuous_light_dark_pomdp.ContinuousLightDarkPOMDPMetrics(*values)[source]

Bases: Enum

Metric names for Continuous Light-Dark POMDP environment.

AVG_DANGEROUS_STATES_COUNTER = 'avg_dangerous_states_counter'

AVG_OBSTACLE_HIT_COUNTER = 'avg_obstacle_hit_counter'

GOAL_REACHING_RATE = 'goal_reaching_rate'

OBSTACLE_HIT_RATE = 'obstacle_hit_rate'

OUT_OF_GRID_RATE = 'out_of_grid_rate'

class POMDPPlanners.environments.light_dark_pomdp.continuous_light_dark_pomdp.ContinuousLightDarkStateTransitionModel(state, action, state_dist)[source]

Bases: StateTransitionModel

State transition model for Continuous Light-Dark POMDP.

This model implements continuous movement in 2D space with Gaussian noise. The agent’s next position is determined by adding the action vector to the current position, with additional Gaussian noise to model uncertainty.

Parameters:

state (ndarray)
action (ndarray)
state_dist (CovarianceParameterizedMultivariateNormal)

state: Current 2D position [x, y]

action: Movement vector [dx, dy]

mean: Expected next position (state + action)

Example

>>> import numpy as np
>>> np.random.seed(42)  # For reproducible results
>>> # Define current position and movement action
>>> state = np.array([3.0, 4.0])  # Current position
>>> action = np.array([1.0, 0.5])  # Move right and slightly up
>>>
>>> # Define movement noise
>>> cov_matrix = np.eye(2) * 0.1  # Small movement noise
>>> state_dist = CovarianceParameterizedMultivariateNormal(cov_matrix)
>>>
>>> # Create transition model
>>> transition = ContinuousLightDarkStateTransitionModel(
...     state=state,
...     action=action,
...     state_dist=state_dist
... )
>>>
>>> # Sample next position with noise
>>> next_position = transition.sample()[0]
>>> # Returns position around [4.0, 4.5] ± noise
>>>
>>> # Calculate probability of specific next position
>>> prob = transition.probability([next_position])

probability(values)[source]

Calculate transition probabilities for given next states.

Parameters:: values (List[ndarray]) – List of next state values to calculate probabilities for
Return type:: ndarray
Returns:: Array of transition probabilities corresponding to the input values
Raises:: NotImplementedError – This method is not implemented by default. Subclasses should override if probability calculation is needed.

sample(n_samples=1)[source]

Sample next states from the transition model.

Parameters:: n_samples (int) – Number of next state samples to generate. Defaults to 1.
Return type:: List[ndarray]
Returns:: List of sampled next states of length n_samples.

Note

Subclasses must implement this method according to their specific state transition dynamics.

class POMDPPlanners.environments.light_dark_pomdp.continuous_light_dark_pomdp.ObservationModelType(*values)[source]

Bases: Enum

DISTANCE_BASED = 'distance_based'

NORMAL_NOISE = 'normal_noise'

NORMAL_NOISE_NO_OBS_IN_DARK = 'normal_noise_no_obs_in_dark'

class POMDPPlanners.environments.light_dark_pomdp.continuous_light_dark_pomdp.RewardModelType(*values)[source]

Bases: Enum

DANGEROUS_STATES = 'dangerous_states'

DECAYING_HIT_PROBABILITY = 'decaying_hit_probability'

STANDARD = 'standard'

POMDPPlanners.environments.light_dark_pomdp.discrete_light_dark_pomdp module

class POMDPPlanners.environments.light_dark_pomdp.discrete_light_dark_pomdp.DiscreteLightDarkPOMDP(discount_factor, name='DiscreteLightDarkPOMDP', transition_error_prob=0.05, observation_error_prob=0.05, beacons=[(0, 0), (0, 5), (0, 10), (5, 0), (5, 5), (5, 10), (10, 0), (10, 5), (10, 10)], goal_state=array([10, 5]), start_state=array([0, 5]), obstacles=[(3, 7), (5, 5)], obstacle_hit_probability=0.2, obstacle_reward=-10.0, goal_reward=10.0, beacon_radius=1.0, fuel_cost=2.0, grid_size=11, is_stochastic_reward=True, observation_model_type=ObservationModelType.NORMAL)[source]

Bases: BaseLightDarkPOMDPDiscreteActions, DiscreteActionsEnvironment

Discrete Light-Dark POMDP Environment for Robot Navigation with Observation Uncertainty.

This environment implements a discretized version of the classic Light-Dark POMDP problem, where a robot must navigate from a start position to a goal position in a grid world with beacons and obstacles. The key challenge is that the robot’s observation quality depends on its distance from beacons - closer to beacons means more accurate observations.

Problem Description: The robot operates in a discrete grid world where it can move in four cardinal directions. The environment includes: - Beacons: Fixed positions that provide location reference with varying accuracy - Obstacles: Grid cells that incur penalties when hit - Goal: Target position that provides high reward when reached - Observation uncertainty: Decreases with proximity to beacons (light areas)

Key Features: - Discrete state space: Robot positions are restricted to grid cells - Discrete action space: North, South, East, West movements - Multiple observation models available (normal, no observation in dark) - Distance-dependent observation accuracy: Closer to beacons = better observations - Stochastic transitions: Actions may fail with configurable probability - Obstacle avoidance: Penalties for hitting obstacles during navigation - Configurable environment parameters: Grid size, beacon positions, obstacles

State Space: - 2D grid coordinates (x, y) representing robot position - Bounded by grid_size parameter (default: 11x11 grid)

Action Space: - Discrete actions: [‘North’, ‘South’, ‘East’, ‘West’] - Each action moves robot one grid cell in the corresponding direction - Boundary conditions: Actions that would move outside grid are blocked

Observation Space: - Discrete observations based on beacon proximity and noise - Observation accuracy improves with proximity to beacons - Stochastic observation errors controlled by observation_error_prob

Reward Structure: - Goal reward: Large positive reward for reaching the goal state - Obstacle penalty: Negative reward for hitting obstacles - Fuel cost: Small negative reward for each movement action - Distance-based penalties: Encourage efficient navigation

Parameters:

discount_factor (float)
name (str)
transition_error_prob (float)
observation_error_prob (float)
beacons (List[Tuple[float, float]])
goal_state (ndarray)
start_state (ndarray)
obstacles (List[Tuple[float, float]])
obstacle_hit_probability (float)
obstacle_reward (float)
goal_reward (float)
beacon_radius (float)
fuel_cost (float)
grid_size (int)
is_stochastic_reward (bool)
observation_model_type (ObservationModelType)

transition_error_prob: Probability that an action fails (results in different movement)

observation_error_prob: Probability of observation noise/error

is_stochastic_reward: Whether rewards include stochastic components

beacons: List of (x, y) beacon positions that provide navigation references

goal_state: Target position (x, y) that robot should reach

start_state: Initial robot position (x, y)

obstacles: List of (x, y) obstacle positions to avoid

grid_size: Dimension of the square grid world

Example

>>> import numpy as np
>>> np.random.seed(42)  # For reproducible results
>>>
>>> # Initialize environment
>>> env = DiscreteLightDarkPOMDP(
...     discount_factor=0.95,
...     transition_error_prob=0.1,
...     observation_error_prob=0.15,
...     beacons=[(1, 1), (2, 2)],
...     grid_size=11
... )
>>>
>>> # Get initial state and actions
>>> initial_state = env.initial_state_dist().sample()[0]
>>> actions = env.get_actions()
>>>
>>> # Sample complete step using convenience method
>>> action = actions[0]
>>> next_state, observation, reward = env.sample_next_step(initial_state, action)
>>>
>>> # Check terminal condition
>>> env.is_terminal(initial_state)
False

References: - Platt, R., et al. “Belief space planning assuming maximum likelihood observations.” (2010) - Kurniawati, H., et al. “SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces.” (2008) - Light-Dark domain: Classic POMDP benchmark for testing observation uncertainty

compute_metrics(histories)[source]

Compute environment-specific metrics from episode histories.

This method can be overridden by subclasses to provide custom metric calculations beyond standard return and episode length.

Parameters:: histories (List[History]) – List of episode histories to analyze
Return type:: List[MetricValue]
Returns:: List of computed metrics with confidence intervals

get_metric_names()[source]

Get names of Discrete Light-Dark POMDP specific metrics.

Returns:: goal_reaching_rate, obstacle_hit_rate, avg_obstacle_hit_counter, out_of_grid_rate, and avg_dangerous_states_counter
Return type:: List[str]

is_terminal(state)[source]

Check if a state is terminal.

Parameters:: state (ndarray) – State to check for terminal condition
Return type:: bool
Returns:: True if the state is terminal, False otherwise

Note

Subclasses must implement this method to define terminal conditions.

observation_model(next_state, action)[source]

Get the observation model for a given next state and action.

Parameters:

next_state (ndarray) – The resulting state after taking an action
action (Any) – The action that was executed

Return type:

ObservationModel

Returns:

Observation model that can sample observations

Note

Subclasses must implement this method to define observation generation.

reward(state, action)[source]

Calculate the immediate reward for a state-action pair.

Parameters:

state (ndarray) – Current state
action (Any) – Action executed from the state

Return type:

float

Returns:

Immediate reward value

Note

Subclasses must implement this method to define reward structure.

reward_batch(states, action)[source]

Calculate rewards for a batch of states given a single action.

Provides a loop-based default that subclasses can override with vectorized numpy implementations for better performance.

Parameters:

states (Union[ndarray, Sequence[Any]]) – Sequence of states of length N.
action (str) – Action executed from each state.

Return type:

ndarray

Returns:

1-D array of reward values with shape (N,).

sample_next_step(state, action)[source]

Sample a complete state transition step.

This convenience method combines state transition, observation generation, and reward calculation in a single operation.

Parameters:

state (ndarray) – Current state
action (Any) – Action to execute

Returns:

next_state: Sampled next state
next_observation: Sampled observation
reward: Immediate reward

Return type:

Tuple[Any, Any, float]

state_transition_model(state, action)[source]

Get the state transition model for a given state-action pair.

Parameters:

state (ndarray) – Current state
action (Any) – Action to be executed

Return type:

StateTransitionModel

Returns:

State transition model that can sample next states

Note

Subclasses must implement this method to define state dynamics.

class POMDPPlanners.environments.light_dark_pomdp.discrete_light_dark_pomdp.DiscreteLightDarkPOMDPMetrics(*values)[source]

Bases: Enum

Metric names for Discrete Light-Dark POMDP environment.

AVG_DANGEROUS_STATES_COUNTER = 'avg_dangerous_states_counter'

AVG_OBSTACLE_HIT_COUNTER = 'avg_obstacle_hit_counter'

GOAL_REACHING_RATE = 'goal_reaching_rate'

OBSTACLE_HIT_RATE = 'obstacle_hit_rate'

OUT_OF_GRID_RATE = 'out_of_grid_rate'

class POMDPPlanners.environments.light_dark_pomdp.discrete_light_dark_pomdp.ObservationModelType(*values)[source]

Bases: Enum

DISTANCE_BASED = 'distance_based'

NORMAL = 'normal'

NO_OBS_IN_DARK = 'no_obs_in_dark'