{"id":140,"date":"2023-11-18T23:38:33","date_gmt":"2023-11-18T23:38:33","guid":{"rendered":"https:\/\/sonic.fabio.org.uk\/?p=140"},"modified":"2023-11-18T23:46:42","modified_gmt":"2023-11-18T23:46:42","slug":"quick-start-reinforcement-racer","status":"publish","type":"post","link":"https:\/\/sonic.fabio.org.uk\/?p=140","title":{"rendered":"Quick start: Reinforcement Racer"},"content":{"rendered":"\n<p>Reinforcement learning is the often forgotten sibling of supervised and unsupervised machine learning. But now, with its important role in generative AI learning, I wanted to touch on the subject with a fun introductory example. Fun projects like these can always teach you a lot.<\/p>\n\n\n\n<p>I was inspired by YouTuber Nick Renotte (video: <a href=\"https:\/\/www.youtube.com\/watch?v=Mut_u40Sqz4\">https:\/\/www.youtube.com\/watch?v=Mut_u40Sqz4<\/a>) and retried his Project 2 from the two-year-old tutorial. In this tutorial, we teach a model to race a car. <\/p>\n\n\n\n<p>F1 drivers learn from experience, and so does a reinforcement learning agent. Let&#8217;s look at teaching such an agent to race through training and making it learn from repeated experience. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">A quick recap of Reinforcement Learning (RL)<\/h2>\n\n\n\n<p>The docs of RL library, called &#8220;Stable Baselines3&#8221;,  states the following,<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\">\n<p>&#8220;Reinforcement Learning differs from other machine learning methods in several ways. The data used to train the agent is collected through interactions with the environment by the agent itself (compared to supervised learning where you have a fixed dataset for instance).&#8221;<\/p>\n<cite>https:\/\/stable-baselines3.readthedocs.io\/en\/master\/guide\/rl_tips.html<\/cite><\/blockquote>\n\n\n\n<p>So in RL, there is the concept of an agent\/model interacting with its surroundings and learning through rewards. Actions that lead to achieving the goal are rewarded and so are reinforced as being &#8220;good&#8221; actions.<\/p>\n\n\n\n<p>For the environment, i.e. our race track, we can use the already created, &#8220;CarRacing&#8221; environment from the &#8220;Gymnasium&#8221; project &#8211; the actively developed fork of OpenAI&#8217;s project called &#8220;Gym&#8221;. Gymnasium helps to get started quickly with reinforcement learning with many pre-made environments.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2023\/10\/image.png\" alt=\"\" class=\"wp-image-141\" width=\"841\" height=\"562\" srcset=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2023\/10\/image.png 600w, https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2023\/10\/image-300x200.png 300w\" sizes=\"(max-width: 841px) 100vw, 841px\" \/><figcaption class=\"wp-element-caption\">CarRacing environment. The agent controls a car.<\/figcaption><\/figure>\n\n\n\n<p>In CarRacing, the reward is negative (-0.1) for every frame and positive (+1000\/N) for every track tile visited, where N is the total number of tiles visited in the track. Hence, the car is encouraged to move as quickly as possible and visit new parts of the track. The negative reward penalises staying still as the agent is encouraged to get to the finish asap. The positive reward encourages the car to visit new parts of the track as quickly as possible. <\/p>\n\n\n\n<p>For our agent, I will use the PPO algorithm from the Stable Baselines3 library which comes with a general but good choice of hyperparameters for its models.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">PPO: Proximal Policy Optimization<\/h2>\n\n\n\n<p>PPO is an algorithm for training RL agents. It has been used before in the fine-tuning of large language models via Reinforcement learning from human feedback (RLHF:  <a href=\"https:\/\/en.wikipedia.org\/wiki\/Reinforcement_learning_from_human_feedback\">https:\/\/en.wikipedia.org\/wiki\/Reinforcement_learning_from_human_feedback<\/a>).<\/p>\n\n\n\n<p>In PPO, the agent selects random actions which get less random over time as it learns. PPO is conservative in its learning process and makes sure not to change its policy (essentially its brain) by changing too much in each update. Constraining updates in this way leads to consistent performance, stability and speedy learning. However, this may also cause the policy to get stuck in a sub-optimal method for success. PPO seems to work well for this task in a short amount of time without requiring huge memory resources. It&#8217;s great to get started.<\/p>\n\n\n\n<p>Read more about it: <a href=\"https:\/\/spinningup.openai.com\/en\/latest\/algorithms\/ppo.html\">https:\/\/spinningup.openai.com\/en\/latest\/algorithms\/ppo.html<\/a><\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2023\/10\/image-1-1024x666.png\" alt=\"\" class=\"wp-image-142\" width=\"840\" height=\"545\" srcset=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2023\/10\/image-1-1024x666.png 1024w, https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2023\/10\/image-1-300x195.png 300w, https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2023\/10\/image-1-768x499.png 768w, https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2023\/10\/image-1.png 1254w\" sizes=\"(max-width: 840px) 100vw, 840px\" \/><figcaption class=\"wp-element-caption\">Stable baselines3 guide to choosing agents <a href=\"https:\/\/stable-baselines3.readthedocs.io\/en\/master\/guide\/rl_tips.html.\" data-type=\"URL\" data-id=\"https:\/\/stable-baselines3.readthedocs.io\/en\/master\/guide\/rl_tips.html.\">https:\/\/stable-baselines3.readthedocs.io\/en\/master\/guide\/rl_tips.html.<\/a><\/figcaption><\/figure>\n\n\n\n<p>Whereas Nick used the stable baslines3 library directly, I will use Stable Baseline&#8217;s &#8220;RL zoo&#8221; &#8211; a framework for using Stable Baselines3 in an arguably quicker and more efficient way &#8211; all on the command line. RL Zoo docs:  https:\/\/stable-baselines3.readthedocs.io\/en\/master\/guide\/rl_zoo.html<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Training with RL Zoo &#8211; Reinforcement boogaloo<\/h2>\n\n\n\n<p>I found an easy way to get training our PPO agent (see: https:\/\/github.com\/DLR-RM\/rl-baselines3-zoo). I followed the installation from source (this requires git). I also recommend you create a virtual environment for all the dependencies. <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">git clone https:\/\/github.com\/DLR-RM\/rl-baselines3-zoo<\/code><\/pre>\n\n\n\n<p>Your system is probably different from mine and my commands for the command line are aimed at Linux\/MacOS so you may need to translate some commands for your system.<\/p>\n\n\n\n<p>Really make sure to have all the dependencies!<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">pip install -r requirements.txt\npip install -e .<\/code><\/pre>\n\n\n\n<p>Now let&#8217;s start training. Specify the algorithm (PPO), the environment (CarRacing-v2) and the number of time-steps (try different numbers!). For example you can try 500000 time steps.<\/p>\n\n\n\n<p><code class=\"\" data-line=\"\">python -m rl_zoo3.train --algo ppo --env CarRacing-v2 --n-timesteps 500000<\/code><\/p>\n\n\n\n<p>This may take a few minutes.  If you&#8217;re on Linux and get error: legacy-install-failure you may also need to install a few more packages.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">sudo dnf install python3-devel&lt;br&gt;sudo dnf install swig&lt;br&gt;sudo dnf groupinstall &quot;Development Tools&quot;<\/code><\/pre>\n\n\n\n<p>Once training is complete, you can visualise the learning process by plotting the reward over time\/training episodes.<\/p>\n\n\n\n<p><code class=\"\" data-line=\"\">python scripts\/plot_train.py -a ppo -e CarRacing-v2 -y reward -f logs\/ -x steps<\/code><\/p>\n\n\n\n<p>Increasing rewards with episodes indicates that the agent is learning.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"640\" height=\"480\" src=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2023\/10\/Training_Episodic_Reward2.png\" alt=\"\" class=\"wp-image-144\" srcset=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2023\/10\/Training_Episodic_Reward2.png 640w, https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2023\/10\/Training_Episodic_Reward2-300x225.png 300w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/figure>\n\n\n\n<p>Record a video of your latest agent with<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">python -m rl_zoo3.record_video --algo ppo --env CarRacing-v2 -n 1000 --folder logs\/<\/code><\/pre>\n\n\n\n<p>I tested the agent after 4,000,000 time steps of training (~2hr for me) and got this episode below &#8211; what a good run!<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video controls src=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2023\/10\/final-model-ppo-CarRacing-v2-step-0-to-step-1000-online-video-cutter.com_.mp4\"><\/video><\/figure>\n\n\n\n<p>RL zoo allows you to continue training if you want to improve it.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">python train.py --algo ppo --env CarRacing-v2 -i logs\/ppo\/CarRacing-v2_1\/CarRacing-v2.zip -n 50000<\/code><\/pre>\n\n\n\n<p>Thanks for reading!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Reinforcement learning is the often forgotten sibling of supervised and unsupervised machine learning. But now, with its important role in generative AI learning, I wanted to touch on the subject with a fun introductory example. Fun projects like these can always teach you a lot. I was inspired by YouTuber Nick Renotte (video: https:\/\/www.youtube.com\/watch?v=Mut_u40Sqz4) and [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[8,12],"tags":[21,11,22],"_links":{"self":[{"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=\/wp\/v2\/posts\/140"}],"collection":[{"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=140"}],"version-history":[{"count":12,"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=\/wp\/v2\/posts\/140\/revisions"}],"predecessor-version":[{"id":157,"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=\/wp\/v2\/posts\/140\/revisions\/157"}],"wp:attachment":[{"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=140"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=140"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=140"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}