Scene 1: Countertop Scene 2: Art Table Scene 3: Floor Scene 4: Kitchen A Scene 5: Kitchen B Scene 6: Salad Bar Scene 7: Living Room Scene 8: Shelf Robot Scene 1 Robot Scene 2

Scene 2: Art Table

A: suitcase [blue crate]

B: stool

C: hair dryer [mirror]

D: chair [chair that the mirror is on]

E: dishwasher [metal cabinet in top right]

F: chair [blue chair]

G: bottle [Elmer glue container]

H: bottle [Mod Podge container]

I: container [paint thinner container]

J: desk

K: mug [mug with paintbrushes]

L: facial tissue holder [container with glitter]

M: pencil

Here we provide a scene from our real scene planning evaluation, and one example task for it. We provide the object detections and labels from OWL-ViT, which is what our planner has access to, but we provide more precise labels in square brackets. We provide color-coded bounding boxes for the object detections that are relevant for the example task. For each version of the LLM planner, we provide its output, including possible interaction with a VLM, and the generated plan. Outputs generated by the VLMs are in orange, and we mark correct and incorrect plans with green and red respectively.

Task 10: Bring me the clear container with art supplies.

In this task, we want to determine if the VLM maintains the ability to reason about "art supplies." The only container that should be brought is L .

Generated Plans


Thought: There is no clear indication that the container (I) has art supplies, but it is the only container mentioned. I will assume it has the art supplies and bring it to the human.


1. Go to object I
2. Pick up object I
3. Bring to human object I
4. Done


Thought: I can first ask which objects are clear containers out of all the objects I can carry.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M]: Is this object a clear container?

A: No (0.94), Yes (0.05), Unknown (0.0)
B: No (0.73), Yes (0.22), Unknown (0.03)
C: No (0.74), Yes (0.22), Unknown (0.02)
D: No (0.64), Yes (0.28), Unknown (0.05)
E: No (0.73), Yes (0.23), Unknown (0.01)
F: No (0.54), Yes (0.42), Unknown (0.02)
G: Yes (0.58), No (0.39), Unknown (0.01)
H: Yes (0.5), No (0.47), Unknown (0.01)
I: No (0.82), Yes (0.16), Unknown (0.01)
J: No (0.58), Yes (0.35), Unknown (0.03)
K: No (0.83), Yes (0.15), Unknown (0.01)
L: Yes (0.95), No (0.04), Clear (0.0)
M: No (0.8), Yes (0.15), Unknown (0.03)

Thought: The clear containers are G, H, and L. I should ask which of these contains art supplies next.

Question about [G, H, L]: Does this object contain art supplies?

G: Yes (0.69), No (0.29), Unknown (0.01)
H: Yes (0.95), No (0.04), Unknown (0.0)
L: Yes (0.97), No (0.02), Unknown (0.0)

Thought: The clear container with art supplies is the one with the highest Yes value. The answer is L.


1. Go to object L
2. Pick up object L
3. Bring to human object L
4. Done

PG-InstructBLIP (ours)

Thought: I can first ask which objects are containers out of all the objects I can carry. Of these containers, I can ask which is clear and has art supplies.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M]: Is this object a container?

A: Yes (0.97), No (0.02)
B: No (0.78), Yes (0.19), Unknown (0.01)
C: No (0.51), Yes (0.47), Unknown (0.01)
D: No (0.77), Yes (0.21), Unknown (0.0)
E: Yes (0.95), No (0.05), Container (0.0)
F: No (0.67), Yes (0.31), Unknown (0.01)
G: Yes (0.98), No (0.02), Container (0.0)
H: Yes (0.98), No (0.02), Container (0.0)
I: Yes (0.98), No (0.02)
J: Yes (0.61), No (0.34), Unknown (0.02)
K: Yes (0.98), No (0.02), Container (0.0)
L: Yes (0.97), No (0.03), Container (0.0)
M: No (0.95), Yes (0.04), Nothing (0.0)

Thought: The objects that are confidently containers are A, E, G, H, I, K, and L. I should ask which of these is clear and has art supplies.

Question about [A, E, G, H, I, K, L]: Is this object clear and contains art supplies?

A: No (0.87), Yes (0.09), Unknown (0.03)
E: No (0.89), Yes (0.07), Unknown (0.03)
G: No (0.7), Yes (0.27), Unknown (0.01)
H: Yes (0.64), No (0.35), Unknown (0.0)
I: No (0.82), Yes (0.16), Unknown (0.01)
K: No (0.53), Yes (0.46), Unknown (0.0)
L: Yes (0.94), No (0.05), Unknown (0.0)

Thought: The clear container with art supplies is the one with the highest Yes value. The answer is L.


1. Go to object L
2. Pick up object L
3. Bring to human object L
4. Done