Scene 7: Living Room
A: whiteboard
B: door [leftmost door]
C: paper
D: window [window of left door of rightmost pair]
E: door [left door of rightmost pair]
F: table [taller table]
G: chair [leftmost short chair facing towards the camera]
H: chair [tall chair]
I: chair [short chair behind pillar]
J: chair [rightmost short chair, facing towards the camera]
K: table [long wooden table]
L: door [rightmost door]
M: couch
N: chair [left side, facing away from camera]
O: coffee table
Here we provide a scene from our real scene planning evaluation, and one example task for it. We provide the object detections and labels from OWL-ViT, which is what our planner has access to, but we provide more precise labels in square brackets. We provide color-coded bounding boxes for the object detections that are relevant for the example task. For each version of the LLM planner, we provide its output, including possible interaction with a VLM, and the generated plan. Outputs generated by the VLMs are in orange, and we mark correct and incorrect plans with green and red respectively.
Task 44: Go to the table that does not have a wooden surface.
In this task, we want to determine if the VLM can understand which tables have a wooden surface. Only O is not made of wood.
Generated Plans
No VLM
Thought: I need to identify the tables and determine which one does not have a wooden surface .
Plan:
1. Go to object F
2. Go to object K
3. Done
InstructBLIP
Thought: I can first ask which objects are tables out of all the objects I can carry. Of these tables, I can ask which does not have a wooden surface.
Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N, O]: Is this object a table?
Answer
A: No (0.86), Yes (0.11), Unknown (0.01)
B: Yes (0.64), No (0.33), Table (0.01)
C: No (0.81), Yes (0.11), Unknown (0.04)
D: No (0.64), Yes (0.28), Unknown (0.04)
E: No (0.64), Yes (0.3), Unknown (0.03)
F: Yes (0.78), No (0.2), Table (0.01)
G: Yes (0.97), No (0.03), Table (0.0)
H: Yes (0.97), No (0.03), Table (0.0)
I: No (0.91), Yes (0.06), Chair (0.01)
J: Yes (0.8), No (0.18), Table (0.0)
K: Yes (0.97), No (0.03), Table (0.0)
L: No (0.74), Yes (0.19), Unknown (0.03)
M: No (0.82), Yes (0.15), Unknown (0.01)
N: No (0.69), Yes (0.23), Chair (0.02)
O: Yes (0.75), No (0.22), Table (0.01)
Thought: The objects that are confidently Yes are F, G, H, J, K, and O. I should ask which of these does not have a wooden surface next.
Question about [F, G, H, J, K, O]: Does this object have a wooden surface?
Answer
F: Yes (0.96), No (0.03), Wood (0.0)
G: Yes (0.97), No (0.03), Table (0.0)
H: Yes (0.98), No (0.02), Wood (0.0)
J: Yes (0.88), No (0.1), Table (0.01)
K: Yes (0.98), No (0.01), Table (0.0)
O: No (0.82), Yes (0.14), Unknown (0.02)
Thought: The table that does not have a wooden surface is the one with the highest No value or the lowest Yes value (if none are No). The answer is O (the highest No value).
Plan:
1. Go to object O
2. Done
PG-InstructBLIP (ours)
Thought: I can first ask which objects are tables out of all the objects I can carry. Of these tables, I can ask which does not have a wooden surface.
Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N, O]: Is this object a table?
Answer
A: No (0.94), Yes (0.05), Unknown (0.0)
B: No (0.75), Yes (0.22), Unknown (0.02)
C: No (0.95), Yes (0.03), Unknown (0.01)
D: No (0.8), Yes (0.16), Unknown (0.02)
E: No (0.73), Yes (0.23), Unknown (0.02)
F: Yes (0.9), No (0.1), Table (0.0)
G: Yes (0.98), No (0.01), Table (0.0)
H: Yes (0.92), No (0.07), Table (0.0)
I: No (0.97), Yes (0.02), Chair (0.0)
J: No (0.73), Yes (0.2), Chair (0.05)
K: Yes (0.98), No (0.01), Table (0.0)
L: No (0.84), Yes (0.13), Unknown (0.01)
M: No (0.95), Yes (0.03), Couch (0.01)
N: No (0.92), Yes (0.05), Chair (0.02)
O: Yes (0.77), No (0.21), Table (0.01)
Thought: The objects that are confidently Yes are F, K, and O. I should ask which of these does not have a wooden surface next.
Question about [F, K, O]: Does this object have a wooden surface?
Answer
F: Yes (0.97), No (0.02), Wood (0.0)
K: Yes (0.99), No (0.01), Wood (0.0)
O: No (0.72), Yes (0.25), Unknown (0.01)
Thought: The table that does not have a wooden surface is the one with the highest No value or the lowest Yes value (if none are No). Since some of the answers are No, the answer is O (the highest no value).
Plan:
1. Go to object O
2. Done