Home
Scene 1: Countertop Scene 2: Art Table Scene 3: Floor Scene 4: Kitchen A Scene 5: Kitchen B Scene 6: Salad Bar Scene 7: Living Room Scene 8: Shelf Robot Scene 1 Robot Scene 2

Scene 6: Salad Bar

A: envelope [sign on napkin dispenser]

B: humidifier [napkin dispenser]

C: ladle [metal tongs]

D: food [two salad containers on the right]

E: bottle [red wine vinegar bottle]

F: frying pan [closer salad tray]

G: paper [napkin coming out of dispenser]

H: countertop

I: bottle [olive oil bottle]

J: bottle [black container on the right]

K: bottle [black container on the left]

L: juice [olive oil inside bottle]

M: cabinetry

N: countertop [more cropped in view of countertop]

O: bowl [paper plate under the counter]


Here we provide a scene from our real scene planning evaluation, and one example task for it. We provide the object detections and labels from OWL-ViT, which is what our planner has access to, but we provide more precise labels in square brackets. We provide color-coded bounding boxes for the object detections that are relevant for the example task. For each version of the LLM planner, we provide its output, including possible interaction with a VLM, and the generated plan. Outputs generated by the VLMs are in orange, and we mark correct and incorrect plans with green and red respectively.

Task 38: Bring me an empty container that you can confidently use to contain liquids, if one exists. Otherwise, tell the user that no suitable containers exist.

In this task, we want to determine if the VLM understands whether containers are empty and can contain liquids. The bottles on the counter (E, I, J, K) are not empty. Also, the paper plates (O) cannot be used to carry water since it is too flat. Therefore, the correct answer is that no suitable containers exist.

Generated Plans

No VLM

Thought: I will check if any of the bottles are empty and suitable for containing liquids.

Plan:

1. Go to object E
2. Pick up object E
3. Bring to human object E
4. Done

InstructBLIP

Thought: I can first ask which objects are containers out of all the objects I can carry. Of these containers, I can ask which is empty and can confidently contain liquids.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N, O]: Is this object a container?

Answer
A: Yes (0.9), No (0.08), Container (0.0)
B: Yes (0.91), No (0.08), Unknown (0.0)
C: Yes (0.85), No (0.13), Unknown (0.0)
D: Yes (0.93), No (0.06), Unknown (0.0)
E: Yes (0.98), No (0.02), Jar (0.0)
F: Yes (0.91), No (0.08), Unknown (0.0)
G: Yes (0.64), No (0.32), Unknown (0.01)
H: Yes (0.89), No (0.08), Unknown (0.01)
I: Yes (0.97), No (0.02), Jar (0.0)
J: Yes (0.98), No (0.02), Jar (0.0)
K: Yes (0.98), No (0.01), Jar (0.0)
L: Yes (0.95), No (0.04), Unknown (0.0)
M: Yes (0.62), No (0.33), Unknown (0.01)
N: Yes (0.91), No (0.07), Bottle (0.0)
O: Yes (0.96), No (0.04), Container (0.0)

Thought: The objects that are confidently containers are A, B, C, D, E, F, G, H, I, J, K, L, N, and O. I should ask which of these is empty and can confidently contain liquids.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, N, O]: Is this object empty and can confidently contain liquids?

Answer:
A: Yes (0.76), No (0.21), Unknown (0.01)
B: Yes (0.57), No (0.39), Unknown (0.01)
C: Yes (0.65), No (0.31), Unknown (0.01)
D: Yes (0.6), No (0.36), Unknown (0.01)
E: Yes (0.79), No (0.19), Unknown (0.0)
F: Yes (0.68), No (0.29), Unknown (0.01)
G: No (0.47), Yes (0.46), Unknown (0.03)
H: Yes (0.75), No (0.22), Unknown (0.01)
I: Yes (0.77), No (0.21), Unknown (0.0)
J: Yes (0.77), No (0.21), Unknown (0.0)
K: Yes (0.73), No (0.25), Unknown (0.0)
L: Yes (0.76), No (0.22), Unknown (0.0)
N: Yes (0.73), No (0.24), Bottle (0.0)
O: Yes (0.79), No (0.19), Unknown (0.01)

Thought: The objects that are empty and can confidently contain liquids are A, B, C, D, E, F, H, I, J, K, L, N, and O. The object with the highest Yes value is E.

Plan:

1. Go to object E
2. Pick up object E
3. Bring to human object E
4. Done

PG-InstructBLIP (ours)

Thought: I can first ask which objects are empty containers out of all the objects I can carry. Of these containers, I can ask which can confidently contain liquids.

Question about [A, B, C, D, E, F, G, H, I, J, K, L, M, N, O]: Is this object an empty container?

Answer
A: No (0.51), Yes (0.45), Unknown (0.02)
B: No (0.69), Yes (0.29), Unknown (0.01)
C: No (0.7), Yes (0.28), Unknown (0.01)
D: No (0.85), Yes (0.14), Food (0.0)
E: No (0.77), Yes (0.21), Unknown (0.0)
F: No (0.85), Yes (0.14), Food (0.0)
G: No (0.57), Yes (0.39), Unknown (0.02)
H: No (0.64), Yes (0.34), Unknown (0.01)
I: No (0.85), Yes (0.14), Unknown (0.0)
J: No (0.54), Yes (0.45), Unknown (0.01)
K: No (0.79), Yes (0.2), Unknown (0.0)
L: No (0.65), Yes (0.34), Unknown (0.0)
M: No (0.67), Yes (0.31), Unknown (0.0)
N: No (0.7), Yes (0.27), Unknown (0.01)
O: Yes (0.8), No (0.19), Unknown (0.0)

Thought: The only object that is confidently an empty container is O. I should ask if it can confidently contain liquids next.

Question about [O]: Can this object confidently contain liquids?

Answer
O: No (0.76), Yes (0.23), Unknown (0.0)

Thought: Since object O cannot confidently contain liquids, there are no suitable containers.

Plan:

1. Tell the human that no suitable containers exist
2. Done