Video

Project Summary

The goal of this project is to create an shepherd AI which can lead sheep randomly spawned in the map to pen. For the current stage, we are building on a small and peaceful environment. Eventually the agent should be capable to dodge the trap, survive in the arena, and bring more sheep as possible as it can.

Approach

For our project, we decide to use tabular Q-learning method to train our Shepherd AI. We start working on our project from the assignment 2. We formalize our AL strategy as:

where is the learning rate as we learn from the lecture. In a more intuitive way to implement our code, we have:

For each epoch:
	Initialize state S
	While the state is not end state or ecceed time limit:
		Choose A from S based on epsilon greedy policy from Q table
		Take action A, observe R, S_t+1
		update Q table and state S

For keeping the simplicity of the problem, we build a 60X60 small arena to hold only two sheep. The herding pen is next to the east side of the arena. We do not require the agent to do what players will do in real Minecraft environment like leading sheep into the yard and close the gate for now. The agent will only seeking sheep, luring them, and lead them out of the arena to the yard.

Our action space consists of move (east west north south), and whether to hold wheat in hand (by shifting inventory slot). There are six actions in total. In the agent class we implement as following:

self.possible_actions = {0: "move 0.5", 1: "move -0.5", 2: "strafe 0.5", 3: "strafe -0.5", 4: "hotbar.2 1", 5: "hotbar.1 1"}

Our observation space is a list of position including the agent and all sheep. Here is the code:

if world_state.number_of_observations_since_last_state > 0:
    msg = world_state.observations[-1].text
    ob = json.loads(msg)
    for ent in ob["entities"]:
        if ent["name"] == "Jesus":
            self.prev_location = self.location
            self.location = (ent["x"], ent["z"])
        if ent["name"] == "Sheep":
            sheep_location.append((ent["x"], ent["z"]))
    self.sheep = sheep_location
    return (self.location, (ent["x"], ent["z"]))

We simply reward for the sheep herded and punish for not reaching the yard.

rewards_ledger = {
    "sheep are near": 4,
    "some sheep herded": 25,
    "all sheep herded": 10000,
    "no sheep herded": -100,
    "pen not reached": -150
}

In each timestep, the q-table will be updated on the state, action, reward, and time. The code below show how it works:

def update_q_table(self, tau, S, A, R, T): # got from assignment 2
        """Performs relevant updates for state tau.
        Args
            tau: <int>  state index to update
            S:   <dequqe>   states queue
            A:   <dequqe>   actions queue
            R:   <dequqe>   rewards queue
            T:   <int>      terminating state index
        """
        curr_s, curr_a, curr_r = S.popleft(), A.popleft(), R.popleft()
        G = sum([self.gamma ** i * R[i] for i in range(len(S))])
        if tau + self.n < T:
            G += self.gamma ** self.n * self.q_table[S[-1]][A[-1]]
            
        old_q = self.q_table[curr_s][curr_a]
        self.q_table[curr_s][curr_a] = old_q + self.alpha * (G - old_q)

The agent will follow a -greedy policy to choose action from the q-table with the best q-value. The long-term reward will gradually degenerate as step goes:

    def choose_action(self,state,possible_actions):
        curr_state = self.get_q_state()
        if curr_state not in self.q_table:
            self.q_table[curr_state] = {}
        for action in possible_actions:
            if action not in self.q_table[curr_state]:
                self.q_table[curr_state][action] = 0

        action = ""
        if random.random() < self.epsilon: 
            action = possible_actions[random.randint(0, len(possible_actions)-1)]
        else: 
            temp_r_map = self.q_table[self.get_q_state()]
            actions_by_reward = defaultdict(list)
            sorted_acts = []
            for key, val in temp_r_map.items():
                actions_by_reward[val].append(key)
            for reward, acts_list in actions_by_reward.items():
                sorted_acts.append((reward, acts_list))
            sorted_acts.sort(key=lambda tup: tup[0], reverse=True)
            action = random.choice(sorted_acts[0][1])
        return action    

Through several test of runs, we foudn that setting the random action rate at 0.12 will help the agent figure out how to save the sheep faster. It needs 50 runs for epsilon at 0.2 and only 30 runs for epsilon at 0.12.

def __init__(self, alpha=0.3, gamma=1, n=1):
    """Constructing an RL agent.
    Args
        alpha:  <float>  learning rate      (default = 0.3)
        gamma:  <float>  value decay rate   (default = 1)
        n:      <int>    number of back steps to update (default = 1)
    """
    # q-learning variables
    self.epsilon = 0.12 # chance of taking a random action instead of the best
    self.q_table = {}
    self.n, self.alpha, self.gamma = n, alpha, gamma

Evaluation

traning result
This is the first 100 session in one run. The blue line is the reward of each sesion. The orange line is the average reward, which is the total rewared at the current sesoin divided by the number of seesions have run. The average reward increases as we train our model.Generally, the agent improve its performance as time goes by. Although the agent’s decision is not always stable, for now the agent has reach the goal of status report.

Remaining Goals and Challenges

Resources Used

Libraries

Documentation

Other Resources