Solving OpenAI gym's environments using reinforcement and imitation learning techniques

What’s this post about?

This post mainly focuses on the implementation of RL and imitation learning techniques for classical OpenAI gym' environments like cartpole-v0, breakout, mountain car, bipedwalker-v2, etc. I have implemented several RL algorithms such as dqn, policy gradient, etc. as well as generative adversaral learning approach like GAIL for imitation learning. I would like to sincerely thank my colleague Arun Kumar for his constant help and valuable expertise in this work.

Reinforcment learning techniques.

[1] Deep Q-Networks for Breakout-v0: Maximize the score in the Atari 2600 game Breakout. In this environment, the observation is an RGB image of the screen, which is an array of shape (210, 160, 3) Each action is repeatedly performed for a duration of kkk frames, where kkk is uniformly sampled from {2,3,4}.

Code

[2] DQN & Policy Gradient for CartPole-v1: Cartpole - known also as an Inverted Pendulum is a pendulum with a center of gravity above its pivot point. It’s unstable, but can be controlled by moving the pivot point under the center of mass. The goal is to keep the cartpole balanced by applying appropriate forces to a pivot point. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

Code

[3] Policy Gradient for custom CartPole model in Gazebo: Policy gradient method based on actor-critic learning framework is implemented over the custom cartpole environment in gazebo simulation environment.

Code

[3] DQN & Policy Gradient for MountainCar-v0: A car is on a one-dimensional track, positioned between two "mountains". The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.

Code

[3] DQN for LunarLander-v2: If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Four discrete actions: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.

Code

Imitation learning techniques.

[1] GAIL for cartpole-v0 : A TensorFlow implementation of Generatve Adversarial Imitation Learning (GAIL) and Behavioural Cloning (BC) for classic cartpole-v0 environment from OpenAI Gym. The expert policies are generated using Proximal Policy Optimization (PPO).

State space:Continuos
Action Space:Discrete

Code

[2] GAIL for bipedwalker-v2: Pytorch implementation of Generatve Adversarial Imitation Learning (GAIL) for bipedwalker-v2 environment from OpenAI Gym. The expert policies are generated using Proximal Policy Optimization (PPO).

State space:(Continuos) (1) hull angle, (2) angular velocity, (3) horizontal speed, (4) vertical speed, (5) position of joints (6) joints angular speed, (7) legs contact with ground, and (8) lidar rangefinder measurements
Action Space:(Continuos) joint motor torques

Code