{"id":11,"date":"2022-02-19T19:10:53","date_gmt":"2022-02-19T19:10:53","guid":{"rendered":"https:\/\/sonic.fabio.org.uk\/?p=11"},"modified":"2022-02-19T19:18:54","modified_gmt":"2022-02-19T19:18:54","slug":"pca-in-r","status":"publish","type":"post","link":"https:\/\/sonic.fabio.org.uk\/?p=11","title":{"rendered":"PCA in R"},"content":{"rendered":"\n<h5 class=\"wp-block-heading\" id=\"a-data-analytics-using-r-topic\">A Data Analytics using R Topic<\/h5>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-pca\">What is PCA?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"pca-for-dimensionality-reduction\">PCA for Dimensionality Reduction<\/h3>\n\n\n\n<p>Large datasets have many columns and variables. Having many variables (called features) makes the data high dimensional. Imagine a dataset with 100 features. To represent the data points on a graph, we would need 100 axes, one for each feature. <\/p>\n\n\n\n<p>Principal Component Analysis (or PCA) is one method of identifying the most important axes with the most variance. Transforming your data to the new axes, called Principal Components, allows you to see the data from an improved perspective. Plotting data using the Principal Components instead of the original features can make clusters and patterns more apparent. Not only this, PCA is a simple way to discard the less important features to reduce the dimensionality.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"problems-at-high-dimensions\">Problems at High Dimensions<\/h3>\n\n\n\n<p>Say you wanted to train a statistical learning algorithm to distinguish between groups or classes in the data. For this learner to perform well it needs a large sample set. Feeding the algorithm with a representative training sample will make a learner better at predicting classes. Gathering these examples in high dimensional space poses problems. This is because high dimensional spaces are very big and more space has to be searched.<\/p>\n\n\n\n<p>There are more problems than just gathering examples. As the number of dimensions increases, the Euclidean distance between points increases. Consequently, the data becomes more sparse and dissimilar making it more difficult for our learner to group points together by class.<\/p>\n\n\n\n<p>The issue attributed to high dimensional datasets for machine learning algorithms is known as the <strong>Curse of Dimensionality<\/strong>. <\/p>\n\n\n\n<p>Principal Component Analysis (or PCA) can help. Lets&#8217;s take a look at how.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-to-do-pca-in-r\">How to do PCA in R<\/h2>\n\n\n\n<p>I generated simulated data with 30 observations for each of the three classes. Each observation has 100 different variables. To make the different classes distinct in some way, each class cluster has a different mean. Let&#8217;s visualise a snapshot in just two dimensions with the first two columns.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"\" data-line=\"\"># obs is the data with 100 variables\nplot(obs, col = c(rep(&quot;black&quot;,30), rep(&quot;blue&quot;, 30), rep(&quot;red&quot;, 30)), pch=19)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"645\" height=\"398\" src=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2022\/02\/00000b.png\" alt=\"\" class=\"wp-image-18\" srcset=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2022\/02\/00000b.png 645w, https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2022\/02\/00000b-300x185.png 300w\" sizes=\"(max-width: 645px) 100vw, 645px\" \/><\/figure>\n\n\n\n<p>You can see there are three colours representing the three different classes. However, the groups overlap quite a bit. Let&#8217;s apply PCA to make the clusters more apparent. To do this we use the <em>prcomp <\/em>function in R. Let&#8217;s plot the first and second Principal Components now.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"\" data-line=\"\">pr.out &lt;- prcomp(obs)\n\nplot(obs, col = c(rep(&quot;black&quot;,30), rep(&quot;blue&quot;, 30), rep(&quot;red&quot;, 30)), pch=19)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"645\" height=\"398\" src=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2022\/02\/000007.png\" alt=\"\" class=\"wp-image-19\" srcset=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2022\/02\/000007.png 645w, https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2022\/02\/000007-300x185.png 300w\" sizes=\"(max-width: 645px) 100vw, 645px\" \/><figcaption>The data plotted using Principal Components (PC) 1 and 2<\/figcaption><\/figure>\n\n\n\n<p>The three classes become much more apparent. This is an improved visualisation of the data. The training process of a clustering algorithm like K-means will benefit from PCA.<\/p>\n\n\n\n<p>You can see that the classes are spread along PC1. We can plot the proportion of the total variance that each PC accounts for.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"\" data-line=\"\">pr.sd &lt;- pr.out$sdev # standard deviations\n\n\npr.var &lt;- pr.sd ^ 2 # variance\n\npve &lt;- pr.var\/sum(pr.var) # proportion of variance explained\n\nplot(pve[1:20],\n     xlab = &#039;Principal Component&#039;,\n     ylab = &#039;PVE&#039;,\n     type = &#039;b&#039;,\n     col = &#039;blue&#039;)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"646\" height=\"399\" src=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2022\/02\/000023.png\" alt=\"\" class=\"wp-image-20\" srcset=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2022\/02\/000023.png 646w, https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2022\/02\/000023-300x185.png 300w\" sizes=\"(max-width: 646px) 100vw, 646px\" \/><figcaption>The data varies along PC1 much more than any other component<\/figcaption><\/figure>\n\n\n\n<p>Principal Component one always has the greatest proportion of variance explained (PVE) followed by PC2 and PC3 etc. In my data, 40% of the variance is along PC1 alone. The PVE decreases from then on. We can see that the PC1 axis is the most important because the data is separated most along it. Therefore, PC1 will be the most valuable variable for a classifier to consider.<\/p>\n\n\n\n<p>So for this dataset, how much can we reduce the dimensionality by? Let&#8217;s plot the cumulative proportion of variance explained against the number of PCs considered with <em>cumsum<\/em>.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"\" data-line=\"\">cumsum(pve)\n\nplot(cumsum(pve),\n     xlab = &#039;PC&#039;,\n     ylab = &#039;CPVE&#039;,\n     type = &#039;b&#039;,\n     col = &#039;red&#039;)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"646\" height=\"399\" src=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2022\/02\/000029.png\" alt=\"\" class=\"wp-image-24\" srcset=\"https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2022\/02\/000029.png 646w, https:\/\/sonic.fabio.org.uk\/wp-content\/uploads\/2022\/02\/000029-300x185.png 300w\" sizes=\"(max-width: 646px) 100vw, 646px\" \/><figcaption>Cumulative Proportion of Variance Explained with the number of Principal Components<\/figcaption><\/figure>\n\n\n\n<p>With just 50 PCs, we can half the dimensionalilty of the dataset while retaining over 90% of the variance of the data. Our classifier algorithm can learn to distinguish classes much more effieciently with less dimensions. <\/p>\n\n\n\n<p>In summary then, PCA can reduce the dimensions of the dataset while keeping the most important infromation in the data.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A Data Analytics using R Topic What is PCA? PCA for Dimensionality Reduction Large datasets have many columns and variables. Having many variables (called features) makes the data high dimensional. Imagine a dataset with 100 features. To represent the data points on a graph, we would need 100 axes, one for each feature. Principal Component [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[7],"tags":[5,6,4,3],"_links":{"self":[{"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=\/wp\/v2\/posts\/11"}],"collection":[{"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=11"}],"version-history":[{"count":8,"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=\/wp\/v2\/posts\/11\/revisions"}],"predecessor-version":[{"id":25,"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=\/wp\/v2\/posts\/11\/revisions\/25"}],"wp:attachment":[{"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=11"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=11"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sonic.fabio.org.uk\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=11"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}