Scene 5: Kitchen B

A: toaster
B: light switch [electric outlet]
C: envelope [napkin on microwave]
D: light switch
E: microwave oven [microwave]
F: door
G: bottle [glass sauce bottle]
H: picnic basket [drying rack]
I: soap dispenser
J: bottle [plastic bottle with blue vanilla flavor]
K: mug [dry mug]
L: sink
M: frying pan [dirty pan in sink]
N: mug [dirty mug in sink]
O: countertop
P: waste container
Q: cupboard
R: plastic bag [trashbag]
Here we provide a scene from our real scene planning evaluation, and one example task for it. We provide the object detections and labels from OWL-ViT, which is what our planner has access to, but we provide more precise labels in square brackets. We provide color-coded bounding boxes for the object detections that are relevant for the example task. For each version of the LLM planner, we provide its output, including possible interaction with a VLM, and the generated plan. Outputs generated by the VLMs are in orange, and we mark correct and incorrect plans with green and red respectively.
Task 31: Bring me the glass object.
In this task, we want to determine if the VLM can understand that G is made of glass.
Generated Plans
No VLM
Thought: There is no specific glass object mentioned in the list, but bottles are commonly made of glass. I will assume that one of the bottles (G or J) is the glass object.
Plan:
1. Go to object G
2. Pick up object G
3. Bring to human object G
4. Done
InstructBLIP
Thought: I can ask which objects are made of glass out of all the objects.
Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R]: Is this object made of glass?
Answer:
A: No (0.67), Yes (0.25), Unknown (0.04)
B: No (0.75), Yes (0.16), Unknown (0.06)
C: No (0.58), Yes (0.32), Unknown (0.06)
D: No (0.7), Yes (0.24), Unknown (0.03)
E: Yes (0.56), No (0.34), Unknown (0.05)
F: Yes (0.75), No (0.2), Bottle (0.01)
G: Yes (0.88), No (0.1), Glass (0.01)
H: No (0.71), Yes (0.22), Unknown (0.04)
I: No (0.87), Yes (0.1), Plastic (0.01)
J: Yes (0.48), No (0.43), Unknown (0.04)
K: No (0.64), Yes (0.33), Unknown (0.01)
L: Yes (0.6), No (0.32), Unknown (0.04)
M: Yes (0.5), No (0.45), Unknown (0.02)
N: No (0.49), Yes (0.46), Unknown (0.02)
O: Yes (0.58), No (0.32), Unknown (0.05)
P: No (0.84), Yes (0.12), Unknown (0.03)
Q: Yes (0.66), No (0.29), Unknown (0.02)
R: No (0.69), Yes (0.24), Unknown (0.04)
Thought: The object with the highest Yes value for being made of glass is G. I will bring object G to the human.
Plan:
1. Go to object G
2. Pick up object G
3. Bring to human object G
4. Done
PG-InstructBLIP (ours)
Thought: I can ask which object is made of glass out of all the objects.
Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R]: Is this object made of glass?
Answer
A: No (0.84), Yes (0.12), Unknown (0.02)
B: No (0.96), Yes (0.02), Unknown (0.0)
C: No (0.97), Yes (0.01), Nothing (0.0)
D: No (0.94), Yes (0.04), Unknown (0.01)
E: No (0.74), Yes (0.21), Glass (0.01)
F: Yes (0.6), No (0.34), Glass (0.03)
G: Yes (0.95), No (0.04), Glass (0.0)
H: No (0.93), Yes (0.05), Unknown (0.01)
I: No (0.93), Yes (0.05), Plastic (0.01)
J: No (0.68), Yes (0.24), Plastic (0.05)
K: Yes (0.89), No (0.11), Glass (0.0)
L: No (0.94), Yes (0.04), Metal (0.01)
M: No (0.9), Yes (0.08), Glass (0.0)
N: Yes (0.73), No (0.25), Glass (0.01)
O: No (0.88), Yes (0.08), Unknown (0.01)
P: No (0.95), Yes (0.03), Unknown (0.0)
Q: No (0.94), Yes (0.04), Unknown (0.0)
R: No (0.87), Yes (0.09), Unknown (0.01)
Thought: The glass object is the one with the highest Yes value. The glass object is G.
Plan:
1. Go to object G
2. Pick up object G
3. Bring to human object G
4. Done