HW 1: RL and IRL

Table of contents

Introduction
1. Collaboration Policy
2. Code Re-use Policy
Part 1: Understanding Your Data
Part 2: Implementing Algorithms and Recording Feedback
Part 3: Evaluation
Part 4: Report
What to submit

Introduction

The goal of this project is for you to gain experience with:

Loading, visualizing, modifying, and writing demonstration and reward data
Implementing/modifying code for RL and IRL algorithms
Guiding robot behavior by selecting demonstrations and rewards
Evaluating and comparing the performance of RL/IRL algorithms

You’ll submit the following deliverables via Canvas:

Your code and ReadMe file for two RL/IRL algorithms
A series of videos showing the robot behaviors that each algorithm produced
A report containing both (i) your evaluation results and (ii) your answers to the reflection questions

Collaboration Policy

You are welcome (and encouraged!) to collaborate with others. However, you must submit your own code and fully understand how it works. In your report, you must state your collaborators and acknowledge how they assisted.

Code Re-use Policy

You are welcome to directly use existing packages or libraries as “helper code” within your project. You are also welcome to reference papers and pseudocode, and adapt online implementation examples of the algorithms you are using. However, you must write your own algorithm code, fully understand how it works, and acknowledge any resources you have referenced or adapted.

Part 1: Understanding Your Data

Download and expand config.zip and planning.zip. This contains 37 trajectories, each one consisting of a series of joint-space poses, formatted as the radian value of each joint j0, j1, j2, j3, j4, j5, j6.
Now let’s replay one of these trajectories.
- Launch Gazebo and the planning service:
```
ros2 launch xarm_planner xarm7_planner_gazebo.launch.py add_gripper:=true
```
- Within the planning directory, run the following to simulate the objects in environment 3:
```
python3 spawn_goals.py -env 3
```
- Then, run a trajectory. This will run a trajectory that is intended to reach goal 1 in environment 3:
```
python3 xarmJointPlanningClient.py -env 3 -g 1 -traj 1
```
- To change the environment, you’ll need to delete the existing objects:
```
python3 delete_goals.py
```
- To observe the robot’s end-effector pose, you’ll need to get the transform from the robot’s base to its gripper. You can do this through the command-line like this:
```
ros2 run tf2_ros tf2_echo world link_eef 
```
- Or programmatically as shown in eef_publisher.py. This file sets up a ROS node that “listens” to the transform data. Every time it receives new data, it publishes that to a rostopic. Check out this tutorial for more information about transforms.
- Running python3 eef_publisher.py will publish the end-effector pose to the /eef_traj topic. In a separate terminal window, you can watch its output by running:
```
ros2 topic echo /eef_traj
```
Try simulating a few of these trajectories. You’ll notice some patterns:
- There are 4 cubes per environment, each representing different objects that the robot culd be trying to pick up.
- There are 3 trajectories that are intended to reach each goal pose.
- There are some spheres in the way of some of these trajectories. They don’t mean anything on their own, but you could decide that they are important for whatever behavior you’d like the robot to learn.

Part 2: Implementing Algorithms and Recording Feedback

Choose two algorithms to implement from any RL or IRL paper we’ve discussed on/before Feb 5. Think about how these two algorithms differ from each other in their approach, training data, and output.
Decide on three different behaviors that you want the robot to learn. You’ll teach the robot these three different behaviors based on how you assign feedback to the trajectories. Record your three sets of feedback.
- Your algorithm choices will dictate what kind of feedback you need to provide: demonstrations, rewards, preferences, etc. It’s up to you to decide how you’d like to record this feedback.
- For demonstrations, you may wish to indicate a subset of the trajectories that should be used to train the model.

Part 3: Evaluation

Now we’ll compare these algorithms based on their sample efficiency and the trained model’s performance. This performance metric should reflect the distance from the optimal policy or weight vector.

Create an evaluation pipeline with the following steps:
- Read in the evaluation parameters: # of training datapoints, # of testing datapoints, and filename containing the relevant feedback for the behavior you’d like to train/test.
- Randomly sample the training and testing datasets. Note: make sure that the test data contains only environment configurations that are unseen in the training data.
- Train your policy/reward model over the training dataset.
- Test the trained model over the training dataset first to see how well it reproduces the training data. Obtain your evaluation metrics over the test data. Save these to a file.
- Test the same model over the testing dataset and obtain those evaluation metrics. Save these metrics to a different file.
Run this evaluation pipeline multiple times so you get data over different training and testing data splits.
- Note: make sure you use different seeds for the random sampler. Otherwise, it’ll just select the same random data split every time you run it.
Repeat this evaluation for each algorithm and with multiple ratios of training/test data. Create a graph showing the relationship between the # of training samples and the algorithms’ performance metrics.
Record some videos showing examples of the behavior resulting from each algorithm when using different amounts of data.

Part 4: Report

Write up a report that answers these questions:

If applicable: who were your collaborators? Describe everyone’s role within the collaboration.
What algorithms did you implement? At a high level, how are they similar or different? How did you modify them for this assignment?
Why did you choose these two algorithms?
What were your hypotheses for how these algorithms would perform at this problem?
How did you modify the demonstration data to incorporate rewards, feedback, etc?
Present the result graphs and describe them. What trends do you see? When do you recommend using one algorithm over the other?
How did these results compare to your hypotheses? Did anything surprise you?

What to submit

On Canvas, upload your:

Code, feedback data, and ReadMe file
Report
Videos