Home
Scene 1: Countertop Scene 2: Art Table Scene 3: Floor Scene 4: Kitchen A Scene 5: Kitchen B Scene 6: Salad Bar Scene 7: Living Room Scene 8: Shelf Robot Scene 1 Robot Scene 2

Scene 4: Kitchen A

A: facial tissue holder [paper towel dispenser]

B: light switch [left electric outlet]

C: light switch [right electric outlet]

D: mixer

E: toaster

F: kettle

G: paper towel

H: water glass [plastic cup]

I: salt and pepper shakers [salt]

J: bottle [jam container]

K: frying pan [baking pan]

L: container [salmon-colored container]

M: salt and pepper shakers [pepper]

N: countertop


Here we provide a scene from our real scene planning evaluation, and one example task for it. We provide the object detections and labels from OWL-ViT, which is what our planner has access to, but we provide more precise labels in square brackets. We provide color-coded bounding boxes for the object detections that are relevant for the example task. For each version of the LLM planner, we provide its output, including possible interaction with a VLM, and the generated plan. Outputs generated by the VLMs are in orange, and we mark correct and incorrect plans with green and red respectively.

Task 25: Bring me the most deformable container with a lid.

In this task, we want to determine if the VLM makes can understand deformability and whether containers have a lid. For instance objects H and L are deformable, but do not have lids, while object F has a lid but is not easily deformable. Only J should be brought since it has a lid and is deformable.

Generated Plans

No VLM

Thought: Since there is no clear information about the deformability of the containers, I will choose the container L as it has a lid.

Plan:

1. Go to object L
2. Pick up object L
3. Bring to human object L
4. Done

InstructBLIP

Thought: I can first ask which objects are containers with lids out of all the objects I can carry. Of these containers, I can ask which is deformable.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N]: Is this object a container with a lid?

Answer
A: Yes (0.82), No (0.16), Unknown (0.0)
B: No (0.86), Yes (0.11), Unknown (0.01)
C: No (0.83), Yes (0.14), Unknown (0.01)
D: Yes (0.64), No (0.3), Unknown (0.02)
E: Yes (0.6), No (0.36), Unknown (0.02)
F: Yes (0.79), No (0.19), Unknown (0.0)
G: No (0.59), Yes (0.35), Unknown (0.03)
H: Yes (0.97), No (0.02)
I: Yes (0.97), No (0.02), Jar (0.0)
J: Yes (0.85), No (0.14), Unknown (0.0)
K: No (0.5), Yes (0.47), Unknown (0.01)
L: Yes (0.93), No (0.06), Unknown (0.0)
M: Yes (0.98), No (0.02)
N: Yes (0.82), No (0.15), Unknown (0.01)

Thought: The objects that are confidently Yes are A, D, E, F, H, I, J, L, and M. I should ask which is deformable next.

Question about [A, D, E, F, H, I, J, L, M]: Is this object deformable?

Answer
A: No (0.49), Yes (0.46), Unknown (0.03)
D: Yes (0.83), No (0.12), Unknown (0.02)
E: Yes (0.55), No (0.38), Unknown (0.03)
F: No (0.55), Yes (0.42), Unknown (0.02)
H: No (0.52), Yes (0.44), Unknown (0.02)
I: No (0.59), Yes (0.36), Unknown (0.02)
J: Yes (0.83), No (0.15), Unknown (0.01)
L: Yes (0.85), No (0.13), Unknown (0.01)
M: No (0.63), Yes (0.32), Unknown (0.02)

Thought: The most deformable object is the one with the highest Yes value or the lowest No value (if none are Yes). Since some of the answers are Yes, the answer is L (the highest yes value).

Plan:

1. Go to object L
2. Pick up object L
3. Bring to human object L
4. Done

PG-InstructBLIP (ours)

Thought: I can first ask which objects are containers with a lid out of all the objects I can carry. Of these containers, I can ask which is deformable.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N]: Is this object a container with a lid?

Answer
A: Yes (0.51), No (0.46), Unknown (0.02)
B: No (0.95), Yes (0.04), Unknown (0.0)
C: No (0.93), Yes (0.06), Unknown (0.0)
D: No (0.79), Yes (0.2), Unknown (0.0)
E: No (0.87), Yes (0.12), Unknown (0.0)
F: No (0.6), Yes (0.39), Unknown (0.0)
G: No (0.94), Yes (0.05), Unknown (0.01)
H: Yes (0.68), No (0.3), Unknown (0.0)
I: Yes (0.96), No (0.04)
J: Yes (0.88), No (0.11), Unknown (0.0)
K: No (0.86), Yes (0.13), Unknown (0.0)
L: No (0.7), Yes (0.28), Unknown (0.01)
M: Yes (0.97), No (0.03)
N: No (0.64), Yes (0.31), Unknown (0.02)

Thought: The objects that are confidently Yes are A, H, I, J, and M. I should ask which is deformable next.

Question about [A, H, I, J, M]: Is this object deformable?

Answer
A: No (0.75), Yes (0.22), Unknown (0.01)
H: No (0.7), Yes (0.28), Unknown (0.01)
I: No (0.85), Yes (0.13), Unknown (0.01)
J: Yes (0.56), No (0.42), Unknown (0.01)
M: No (0.88), Yes (0.1), Unknown (0.01)

Thought: The most deformable object is the one with the highest Yes value or the lowest No value (if none are Yes). Since some of the answers are Yes, the answer is J (the highest yes value).

Plan:

1. Go to object J
2. Pick up object J
3. Bring to human object J
4. Done