How I Explained Reinforcement Learning to My Clients

Johnny Chan

--

from Crissy Jarvis @ Unsplash

Reinforcement Learning (RL) has been a popular topic since DeepMind showcased the capability of AlphaGo. Personally, I am a big fan of this company and what they do inspires me to work in this field. It’s true, in my day-to-day work, I don’t whip up AlphaGo-level models, but I often find myself explaining RL to clients.

Imagine giving someone a lengthy Wikipedia page when they ask, “What’s RL?” Spoiler alert: it doesn’t work. Most clients might have heard phrases like “AI,” “ML,” and the various “Alphas” out there, but these buzzwords don’t shed much light on the subject.

There is a simple example that trains a Taxi in a simplified map using OpenAI Gym. It demonstrates how Q-learning works, but for an outsider, it’s still hard to grasp the concept. You can check out the code example in the following link:

So I decided to go with something simple. How about we teach a machine to do addition?

“Addition? Seriously?” Yes, hear me out.

Breaking Down Reinforcement Learning

Think of Reinforcement Learning as teaching a computer through trial and error, much like teaching a dog new tricks. The dog (our machine) performs an action, and based on the outcome, gets a reward (or a gentle scolding).

Consider this scenario: A major energy company client of mine operates a vast number of wells with a limited team. They wanted to use ML to auto-tune their operations. Interestingly, when asked how they make decisions on specific parameters, the engineers gave varied answers. They relied heavily on experience. This is a classic case for RL as there isn’t always a right or wrong answer.

Now, let’s dive into our example. If you’ve got a pair of numbers, let’s say 2 and 3, you know the answer is 5. But how would a machine figure this out through RL? Let’s see.

This is a class I wrote to create the arithmetic environment. All you need is the numpy library. I won’t dwell on the details here. You can learn how Q-learning algorithm is constructed.

import numpy as np 

class Arithmetic():

def __init__(self):
self.done = False
self.mapp = None
self.observation_space = None
self.action_space = None
self.Q = None
self.score = None
self.prob = None

def make_addition(self, a_range, b_range):

mapp = {}
state = 0
for ai in a_range:
for bi in b_range:
mapp[state] = (ai, bi)
state += 1
observation_space = list(mapp.keys())
action_space = list(range(min(a_range)+min(b_range), max(a_range)+max(b_range)+1))
Q = np.random.rand(len(observation_space), len(action_space))
self.observation_space = observation_space
self.action_space = action_space
self.mapp = mapp
self.Q = Q

def map_state2action(self,state):
(a,b) = self.mapp[state]
return a+b

def map_ab2state(self,a,b):
return list(self.mapp.keys())[list(self.mapp.values()).index((a,b))]

def exam(self):
checker = 0
for state in range(0, self.Q.shape[0]):
(a,b) = self.mapp[state]
answer = a + b
if answer == np.argmax(self.Q[state]):
checker += 1
self.prob[state] = max(self.prob)/10
self.prob = self.prob / sum(self.prob)
score = checker / self.Q.shape[0]
self.score = score
return score

def quiz(self,a,b):
state = self.map_ab2state(a,b)
action = np.argmax(self.Q[state])
return action

def reset(self):
prob = np.ones(len(self.observation_space))/len(self.observation_space)
state = np.random.choice(
a = self.observation_space,
p = prob
)
self.prob = prob
return state

def next_state(self):
state = np.random.choice(
a = self.observation_space,
p = self.prob
)
return state


def step(self, state, action,counter):
(a,b) = self.mapp[state]
if a + b == action:
reward = 3
else:
reward = -1
state = self.next_state()
if counter % 5000 == 0:
if self.exam() >= 0.98:
done = True
else:
done = False
else:
done = False
info = None
return state, reward, done, info

Initiating the environment.

env = Arithmetic()
env.make_addition(range(0,31),range(0,31))

This creates a training space for our machine to practice adding any two numbers from 0 to 30. You can change the range and it will increase the training time.

Let’s quiz our machine before any training:

print('What’s 5 + 10?')
print('Machine: ', env.quiz(5,10))

Most likely, it’ll guess wrong since it hasn’t learned anything yet.

Now, for the fun part: training!

During training, the machine makes a guess. If it’s right, it gets positive feedback (a reward). If it’s wrong, it gets a penalty. Over time, it learns from this feedback loop:

while done != True:
action = np.argmax(env.Q[state])
state2, reward, done, info = env.step(state,action,counter)
env.Q[state,action] += alpha * (reward + env.Q[state,action])
state = state2

After a good amount of practice (iterations), you can quiz the machine again:

print('Okay, smarty pants. What’s 15 + 7?')
print('Machine: ', env.quiz(15,7))

This time, it should get it right!

How the Machine Learns

You can find the complete script here. After you run python addition_RL.py in your terminal, the prompt will ask you to give the machine a quiz:

No surprise, the machine got them all wrong! Let’s quit the quiz and kick off the training.

After it is trained, you can quiz it again.

Here you go. It got the quiz correct!

Wrapping Up

The beauty of this example is its simplicity. It strips away the intimidating jargon and complex mechanics of RL and boils it down to the essence: learning from feedback.

If you want to dive deeper into RL or AI, there are tons of resources. But if you just wanted to grasp the basic idea of RL, I hope this kindergarten math lesson hit the mark! Remember, every sophisticated idea can be broken down into simpler terms. Sometimes, it’s all about 1+2.

--

--

Johnny Chan
Johnny Chan

Written by Johnny Chan

Co-founder of Hazl AI -- a platform for your one-stop AI and cloud services. Visit us at hazl.ca

No responses yet