Home
Scene 1: Countertop Scene 2: Art Table Scene 3: Floor Scene 4: Kitchen A Scene 5: Kitchen B Scene 6: Salad Bar Scene 7: Living Room Scene 8: Shelf Robot Scene 1 Robot Scene 2

Scene 7: Living Room

A: whiteboard

B: door [leftmost door]

C: paper

D: window [window of left door of rightmost pair]

E: door [left door of rightmost pair]

F: table [taller table]

G: chair [leftmost short chair facing towards the camera]

H: chair [tall chair]

I: chair [short chair behind pillar]

J: chair [rightmost short chair, facing towards the camera]

K: table [long wooden table]

L: door [rightmost door]

M: couch

N: chair [left side, facing away from camera]

O: coffee table


Here we provide a scene from our real scene planning evaluation, and one example task for it. We provide the object detections and labels from OWL-ViT, which is what our planner has access to, but we provide more precise labels in square brackets. We provide color-coded bounding boxes for the object detections that are relevant for the example task. For each version of the LLM planner, we provide its output, including possible interaction with a VLM, and the generated plan. Outputs generated by the VLMs are in orange, and we mark correct and incorrect plans with green and red respectively.

Task 44: Go to the table that does not have a wooden surface.

In this task, we want to determine if the VLM can understand which tables have a wooden surface. Only O is not made of wood.

Generated Plans

No VLM

Thought: I need to identify the tables and determine which one does not have a wooden surface .

Plan:

1. Go to object F
2. Go to object K
3. Done

InstructBLIP

Thought: I can first ask which objects are tables out of all the objects I can carry. Of these tables, I can ask which does not have a wooden surface.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N, O]: Is this object a table?

Answer
A: No (0.86), Yes (0.11), Unknown (0.01)
B: Yes (0.64), No (0.33), Table (0.01)
C: No (0.81), Yes (0.11), Unknown (0.04)
D: No (0.64), Yes (0.28), Unknown (0.04)
E: No (0.64), Yes (0.3), Unknown (0.03)
F: Yes (0.78), No (0.2), Table (0.01)
G: Yes (0.97), No (0.03), Table (0.0)
H: Yes (0.97), No (0.03), Table (0.0)
I: No (0.91), Yes (0.06), Chair (0.01)
J: Yes (0.8), No (0.18), Table (0.0)
K: Yes (0.97), No (0.03), Table (0.0)
L: No (0.74), Yes (0.19), Unknown (0.03)
M: No (0.82), Yes (0.15), Unknown (0.01)
N: No (0.69), Yes (0.23), Chair (0.02)
O: Yes (0.75), No (0.22), Table (0.01)

Thought: The objects that are confidently Yes are F, G, H, J, K, and O. I should ask which of these does not have a wooden surface next.

Question about [F, G, H, J, K, O]: Does this object have a wooden surface?

Answer
F: Yes (0.96), No (0.03), Wood (0.0)
G: Yes (0.97), No (0.03), Table (0.0)
H: Yes (0.98), No (0.02), Wood (0.0)
J: Yes (0.88), No (0.1), Table (0.01)
K: Yes (0.98), No (0.01), Table (0.0)
O: No (0.82), Yes (0.14), Unknown (0.02)

Thought: The table that does not have a wooden surface is the one with the highest No value or the lowest Yes value (if none are No). The answer is O (the highest No value).

Plan:

1. Go to object O
2. Done

PG-InstructBLIP (ours)

Thought: I can first ask which objects are tables out of all the objects I can carry. Of these tables, I can ask which does not have a wooden surface.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N, O]: Is this object a table?

Answer
A: No (0.94), Yes (0.05), Unknown (0.0)
B: No (0.75), Yes (0.22), Unknown (0.02)
C: No (0.95), Yes (0.03), Unknown (0.01)
D: No (0.8), Yes (0.16), Unknown (0.02)
E: No (0.73), Yes (0.23), Unknown (0.02)
F: Yes (0.9), No (0.1), Table (0.0)
G: Yes (0.98), No (0.01), Table (0.0)
H: Yes (0.92), No (0.07), Table (0.0)
I: No (0.97), Yes (0.02), Chair (0.0)
J: No (0.73), Yes (0.2), Chair (0.05)
K: Yes (0.98), No (0.01), Table (0.0)
L: No (0.84), Yes (0.13), Unknown (0.01)
M: No (0.95), Yes (0.03), Couch (0.01)
N: No (0.92), Yes (0.05), Chair (0.02)
O: Yes (0.77), No (0.21), Table (0.01)

Thought: The objects that are confidently Yes are F, K, and O. I should ask which of these does not have a wooden surface next.

Question about [F, K, O]: Does this object have a wooden surface?

Answer
F: Yes (0.97), No (0.02), Wood (0.0)
K: Yes (0.99), No (0.01), Wood (0.0)
O: No (0.72), Yes (0.25), Unknown (0.01)

Thought: The table that does not have a wooden surface is the one with the highest No value or the lowest Yes value (if none are No). Since some of the answers are No, the answer is O (the highest no value).

Plan:

1. Go to object O
2. Done