Home
Scene 1: Countertop Scene 2: Art Table Scene 3: Floor Scene 4: Kitchen A Scene 5: Kitchen B Scene 6: Salad Bar Scene 7: Living Room Scene 8: Shelf Robot Scene 1 Robot Scene 2

Scene 3: Floor

A: clothing [green hoodie]

B: towel

C: clothing [striped shirt]

D: bottle [sunscreen bottle]

E: towel [socks]

F: mouse [ear thermometer]

G: suitcase

H: bottle [hand sanitizer]

I: hair dryer [dumbbell]

J: clothing [blue shirt]


Here we provide a scene from our real scene planning evaluation, and one example task for it. We provide the object detections and labels from OWL-ViT, which is what our planner has access to, but we provide more precise labels in square brackets. We provide color-coded bounding boxes for the object detections that are relevant for the example task. For each version of the LLM planner, we provide its output, including possible interaction with a VLM, and the generated plan. Outputs generated by the VLMs are in orange, and we mark correct and incorrect plans with green and red respectively.

Task 17: Bring me the lightest piece of clothing.

In this task, we want to determine if the VLM can reason about mass and maintain an understanding of what "clothes" are. Only E should be brought, since socks are lighter than any other piece of clothing in this picture.

Generated Plans

No VLM

Thought: I cannot determine the weight of the objects, but I will bring one piece of clothing to the human.

Plan:

1. Go to object A
2. Pick up object A
3. Bring to human object A
4. Done

InstructBLIP

Thought: I can ask which objects are clothing out of all the objects I can carry. Of these clothing items, I can ask which is the lightest.

Question about [A, B, C, D, E, F, G, H, I, J]: Is this object clothing?

Answer
A: Yes (0.78), No (0.19), Unknown (0.01)
B: Yes (0.85), No (0.13), Unknown (0.0)
C: Yes (0.95), No (0.04), Clothing (0.0)
D: No (0.9), Yes (0.09), Unknown (0.0)
E: Yes (0.96), No (0.03), Clothing (0.0)
F: No (0.8), Yes (0.16), Unknown (0.01)
G: No (0.63), Yes (0.33), Luggage (0.01)
H: No (0.72), Yes (0.23), Unknown (0.01)
I: No (0.48), Yes (0.47), Unknown (0.02)
J: Yes (0.84), No (0.14), Unknown (0.0)

Thought: The objects that are confidently clothing are A, B, C, E, and J. I should ask which is the lightest next.

Question about [A, B, C, E, J]: Is this object light?

Answer
A: Yes (0.65), No (0.23), Unknown (0.09)
B: Yes (0.69), No (0.2), Unknown (0.08)
C: Yes (0.63), No (0.29), Unknown (0.05)
E: Yes (0.76), No (0.18), Unknown (0.04)
J: Yes (0.76), No (0.15), Unknown (0.06)

Thought: The lightest piece of clothing is the one with the highest Yes value or the lowest No value (if none are Yes). Since some of the answers are Yes, the answer is E (the highest yes value).

Plan:

1. Go to object E
2. Pick up object E
3. Bring to human object E
4. Done

PG-InstructBLIP (ours)

Thought: I can first ask which objects are clothing out of all the objects I can carry. Of these clothing items, I can ask which is the lightest.

Question about [A, B, C, D, E, F, G, H, I, J]: Is this object clothing?

Answer
A: Yes (0.9), No (0.09), Fabric (0.0)
B: Yes (0.84), No (0.14), Fabric (0.01)
C: Yes (0.9), No (0.08), Clothing (0.01)
D: No (0.93), Yes (0.06), Unknown (0.0)
E: Yes (0.95), No (0.04), Clothing (0.0)
F: No (0.95), Yes (0.04), Nothing (0.0)
G: Yes (0.58), No (0.38), Bag (0.0)
H: No (0.94), Yes (0.05), Unknown (0.0)
I: No (0.75), Yes (0.24), Unknown (0.0)
J: Yes (0.94), No (0.05), Fabric (0.0)

Thought: The objects that are confidently clothing are A, C, and J. I should ask which is the lightest next.

Question about [A, C, J]: Is this object light?

Answer
A: No (0.6), Yes (0.35), Unknown (0.03)
C: No (0.55), Yes (0.38), Unknown (0.05)
J: Yes (0.51), No (0.45), Unknown (0.03)

Thought: The lightest piece of clothing is the one with the highest Yes value or the lowest No value (if none are Yes). Since some of the answers are Yes, the answer is J (the highest yes value).

Plan:

1. Go to object J
2. Pick up object J
3. Bring to human object J
4. Done