Home
Scene 1: Countertop Scene 2: Art Table Scene 3: Floor Scene 4: Kitchen A Scene 5: Kitchen B Scene 6: Salad Bar Scene 7: Living Room Scene 8: Shelf Robot Scene 1 Robot Scene 2

Scene 5: Kitchen B

A: toaster

B: light switch [electric outlet]

C: envelope [napkin on microwave]

D: light switch

E: microwave oven [microwave]

F: door

G: bottle [glass sauce bottle]

H: picnic basket [drying rack]

I: soap dispenser

J: bottle [plastic bottle with blue vanilla flavor]

K: mug [dry mug]

L: sink

M: frying pan [dirty pan in sink]

N: mug [dirty mug in sink]

O: countertop

P: waste container

Q: cupboard

R: plastic bag [trashbag]


Here we provide a scene from our real scene planning evaluation, and one example task for it. We provide the object detections and labels from OWL-ViT, which is what our planner has access to, but we provide more precise labels in square brackets. We provide color-coded bounding boxes for the object detections that are relevant for the example task. For each version of the LLM planner, we provide its output, including possible interaction with a VLM, and the generated plan. Outputs generated by the VLMs are in orange, and we mark correct and incorrect plans with green and red respectively.

Task 31: Bring me the glass object.

In this task, we want to determine if the VLM can understand that G is made of glass.

Generated Plans

No VLM

Thought: There is no specific glass object mentioned in the list, but bottles are commonly made of glass. I will assume that one of the bottles (G or J) is the glass object.

Plan:

1. Go to object G
2. Pick up object G
3. Bring to human object G
4. Done

InstructBLIP

Thought: I can ask which objects are made of glass out of all the objects.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R]: Is this object made of glass?

Answer:
A: No (0.67), Yes (0.25), Unknown (0.04)
B: No (0.75), Yes (0.16), Unknown (0.06)
C: No (0.58), Yes (0.32), Unknown (0.06)
D: No (0.7), Yes (0.24), Unknown (0.03)
E: Yes (0.56), No (0.34), Unknown (0.05)
F: Yes (0.75), No (0.2), Bottle (0.01)
G: Yes (0.88), No (0.1), Glass (0.01)
H: No (0.71), Yes (0.22), Unknown (0.04)
I: No (0.87), Yes (0.1), Plastic (0.01)
J: Yes (0.48), No (0.43), Unknown (0.04)
K: No (0.64), Yes (0.33), Unknown (0.01)
L: Yes (0.6), No (0.32), Unknown (0.04)
M: Yes (0.5), No (0.45), Unknown (0.02)
N: No (0.49), Yes (0.46), Unknown (0.02)
O: Yes (0.58), No (0.32), Unknown (0.05)
P: No (0.84), Yes (0.12), Unknown (0.03)
Q: Yes (0.66), No (0.29), Unknown (0.02)
R: No (0.69), Yes (0.24), Unknown (0.04)

Thought: The object with the highest Yes value for being made of glass is G. I will bring object G to the human.

Plan:

1. Go to object G
2. Pick up object G
3. Bring to human object G
4. Done

PG-InstructBLIP (ours)

Thought: I can ask which object is made of glass out of all the objects.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R]: Is this object made of glass?

Answer
A: No (0.84), Yes (0.12), Unknown (0.02)
B: No (0.96), Yes (0.02), Unknown (0.0)
C: No (0.97), Yes (0.01), Nothing (0.0)
D: No (0.94), Yes (0.04), Unknown (0.01)
E: No (0.74), Yes (0.21), Glass (0.01)
F: Yes (0.6), No (0.34), Glass (0.03)
G: Yes (0.95), No (0.04), Glass (0.0)
H: No (0.93), Yes (0.05), Unknown (0.01)
I: No (0.93), Yes (0.05), Plastic (0.01)
J: No (0.68), Yes (0.24), Plastic (0.05)
K: Yes (0.89), No (0.11), Glass (0.0)
L: No (0.94), Yes (0.04), Metal (0.01)
M: No (0.9), Yes (0.08), Glass (0.0)
N: Yes (0.73), No (0.25), Glass (0.01)
O: No (0.88), Yes (0.08), Unknown (0.01)
P: No (0.95), Yes (0.03), Unknown (0.0)
Q: No (0.94), Yes (0.04), Unknown (0.0)
R: No (0.87), Yes (0.09), Unknown (0.01)

Thought: The glass object is the one with the highest Yes value. The glass object is G.

Plan:

1. Go to object G
2. Pick up object G
3. Bring to human object G
4. Done