Freeform Preference Learning for Robotic Manipulation

Reward design remains a central bottleneck for autonomous robot policy improvement, especially in long-horizon manipulation tasks where sparse success labels provide too little signal and binary preferences collapse many competing notions of quality into one ambiguous signal. We introduce Freeform Preference Learning (FPL), a method for learning robot policies from freeform human preferences. Rather than asking annotators which of two trajectories is better overall, FPL lets them define natural-language preference axes, such as speed, safety, quality of placement, or carefulness, and provide pairwise preferences along each axis. These annotations are used to learn a language-conditioned reward model that maps a trajectory and preference label to an axis-specific reward. We use this model to train a reward-conditioned policy that optimizes across the multiple human-specified dimensions. Across four real-world and two simulated long-horizon manipulation tasks, FPL improves over sparse-reward and binary-preference methods by 38 percentage points. Beyond improved performance, FPL learns dense progress signals without explicit subtask segmentation, shows compositionality of behavior not present in the data, and allows users to steer the policy towards different behaviors at test time without retraining. Blog post with videos available at https://freeform-pl.github.io/fpl.website/

Image: Daily English Reader / Local generated SVG (Project-owned local asset)

5 min read C1

0:00 0:00

แปลไทยทั้งบท

การออกแบบของรางวัลยังคงเป็นขัดขวางหลักในการปรับปรุงนโยบายของหุ่นยนต์อิสระ โดยเฉพาะในงานการ thao túngระยะยาว ที่เล็บความสําเร็จที่หายากจะส่งสัญญาณน้อยเกินไป. และความนิยมไบนารี่ ทำให้แนวคิดด้านคุณภาพที่แข่งขันกันหลายๆ อย่างล้มเหลว เป็นสัญญาณหนึ่งที่ไม่ชัดเจน. เรานําเสนอ Freeform Preference Learning (FPL) เป็นวิธีการเรียนรู้นโยบายของหุ่นยนต์จากความชอบของมนุษย์ในรูปแบบฟรีฟอร์ม. แทนที่จะถามผู้อธิบายว่าเส้นทางสองเส้นทางไหนดีกว่าโดยรวม, FPL ให้พวกเขากําหนดแกนความชอบภาษาธรรมชาติ เช่น ความเร็ว, ความปลอดภัย, คุณภาพของ. การจัดตั้งหรือความระมัดระวัง และให้เลือกแบบคู่ตามแกนแต่ละแกน.

คํานวณเหล่านี้ถูกใช้ในการเรียนรู้รูปแบบของรางวัลที่กําหนดด้วยภาษาที่แสดงเส้นทางและตราความชอบไปยังรางวัลออกเฉพาะแกน. เราใช้รูปแบบนี้ เพื่อฝึกนโยบายที่กําหนดรางวัล ที่อุดมสมบูรณ์แบบ. ผ่านสี่วิถีจริงและสองภารกิจการปรับมือระยะยาวแบบจําลอง, FPL ปรับปรุงเทพเหนือวิธีการตอบแทนแสนน้อยและการเลือกสองร้อยละ 38 แต้ม.

นอกเหนือจากการปรับปรุงผลงาน, FPL เรียนรู้สัญลักษณ์ความก้าวหน้าแน่นโดยไม่มีการแบ่งแยก subtask อย่างชัดเจน, แสดงความประกอบของพฤติกรรมที่ไม่ได้อยู่ในข้อมูล, และอนุญาตให้ผู้ใช้. เพื่อนํานโยบายไปสู่พฤติกรรมที่แตกต่างกันในเวลาทดสอบโดยไม่ต้องฝึกอบรมใหม่. โพสต์ในบล็อกที่มีวีดีโอที่เปิดให้บริการที่.

ประโยคและวลีที่ใช้ได้จริงจากเรื่องนี้

Useful phrases from this story

competing notions of quality intoCollocation

ความคิดการแข่งขันของคุณภาพ.

From the storyReward design remains a central bottleneck for autonomous robot policy improvement, especially in long-horizon manipulation tasks where sparse success labels provide too little signal and binary preferences collapse many competing notions of quality into one ambiguous signal.

learning robot policies from freeformCollocation

การเรียนรู้นโยบายของหุ่นยนต์จากฟรีฟอร์ม.

From the storyWe introduce Freeform Preference Learning (FPL), a method for learning robot policies from freeform human preferences.

asking annotators which of twoCollocation

สอบถามผู้เขียน.

From the storyRather than asking annotators which of two trajectories is better overall, FPL lets them define natural-language preference axes, such as speed, safety, quality of placement, or carefulness, and provide pairwise preferences along each axis.

is better overallCollocation

ดีกว่าโดยรวม.

are used to learn aCollocation

ใช้ในการเรียนรู้.

From the storyThese annotations are used to learn a language-conditioned reward model that maps a trajectory and preference label to an axis-specific reward.

Save & Review

Only words saved from this story appear here.