February | 2022 | Data Science & Tech Blog

Getting started

SQLite is a common SQL database engine. In this blog, I will go over how you can connect and query a SQLite database in Python.

Let’s import the necessary libraries for this example, and establish a connection to a database. We will be using the sqlite3 library and pandas.

import sqlite3
import pandas as pd
conn = sqlite3.connect('supermarket.db') # connect to a supermarket database

To execute SQL commands, we need to create a cursor object. Let’s use the cursor to create a table called shops which will contain basic information about our supermarket stores (shop ID, size, postcode) . The store sizes are small, medium and large.

cursor = conn.cursor() # create a cursor
cursor.execute("CREATE TABLE shops(shopid VARCHAR(100) PRIMARY KEY, Store_size VARCHAR(30), Postcode VARCHAR(8))")

So, cursor.execute() executes SQL commands.

We can insert data into the table using the INSERT command. If we do this, use conn.commit() to save the data to the table.

However, in this example, we have a pandas DataFrame called shops_df that we wish to insert into our SQL table called shops. It has the same columns as our shops table. If you’re more familiar with pandas than SQL like me, I will use pandas to clean data first. Only after this will I want to store the data in an SQL database.

The to_sql() function allows us to write records in a DataFrame to an SQL database.

shops_df.to_sql('shops', conn, if_exists='append', index=False) # write shops_df to shops table

Queries

The cursor behaves like an iterator object. We can use a for loop to retrieve all rows from a SELECT statement. To make it easier to query in future, I’ll create a query function.

def query(q):
    try:
        for row in cursor.execute(q):
            print(row)
    except Exception as e:
        print("Something went wrong... ", e)

query('SELECT * FROM shops')

We can also save the results of a query by assigning it to a variable. The variable will be a list of tuples. A tuple for each row.

q="SELECT shopid, Postcode FROM shops"
query_results = cursor.execute(q).fetchall() # save results as a list of tuples

Storing results from a query like this can allow you to represent table data with graphs. For example, if you had a table called sales in your database which has the shop profits, you could plot trends in sales. I will illustrate this now.

Above: A preview of the sales table which is in our SQLite database already

Let’s query the minimum, mean and maximum yearly sales…

c="""
SELECT shops.Store_size, sales.Year, MIN(sales.Sales), AVG(sales.Sales), MAX(sales.Sales)
FROM shops
JOIN sales
  ON shops.shopid = sales.shopid
GROUP BY shops.Store_size, sales.Year 
"""
query(c)
result = cursor.execute(c).fetchall()

…next I plot the mean sales for medium sized stores against the year…

years=[] # store the year
means = [] # store the mean sales
size = 'medium'
for results in result:
    if results[0] == size:
        means.append(results[3])
        years.append(results[1])

plt.plot(years, means, linestyle = 'dashed', marker = 'o', ms= '5')
plt.title(f"{size} store mean sales")
plt.xlabel("year")
plt.ylabel("sales")
plt.savefig(f"{size}_store_mean_sales.png")
plt.show()

Above: the mean sales for stores that are medium sized

Now we are done, close the connection to the database.

conn.close() # don't forget to close the connection afterwards

I hope this blog was helpful for getting started with the sqlite3 library, to connect to a database, and use a cursor to execute SQL commands in python.

A Data Analytics using R Topic

What is PCA?

PCA for Dimensionality Reduction

Large datasets have many columns and variables. Having many variables (called features) makes the data high dimensional. Imagine a dataset with 100 features. To represent the data points on a graph, we would need 100 axes, one for each feature.

Principal Component Analysis (or PCA) is one method of identifying the most important axes with the most variance. Transforming your data to the new axes, called Principal Components, allows you to see the data from an improved perspective. Plotting data using the Principal Components instead of the original features can make clusters and patterns more apparent. Not only this, PCA is a simple way to discard the less important features to reduce the dimensionality.

Problems at High Dimensions

Say you wanted to train a statistical learning algorithm to distinguish between groups or classes in the data. For this learner to perform well it needs a large sample set. Feeding the algorithm with a representative training sample will make a learner better at predicting classes. Gathering these examples in high dimensional space poses problems. This is because high dimensional spaces are very big and more space has to be searched.

There are more problems than just gathering examples. As the number of dimensions increases, the Euclidean distance between points increases. Consequently, the data becomes more sparse and dissimilar making it more difficult for our learner to group points together by class.

The issue attributed to high dimensional datasets for machine learning algorithms is known as the Curse of Dimensionality.

Principal Component Analysis (or PCA) can help. Lets’s take a look at how.

How to do PCA in R

I generated simulated data with 30 observations for each of the three classes. Each observation has 100 different variables. To make the different classes distinct in some way, each class cluster has a different mean. Let’s visualise a snapshot in just two dimensions with the first two columns.

# obs is the data with 100 variables
plot(obs, col = c(rep("black",30), rep("blue", 30), rep("red", 30)), pch=19)

You can see there are three colours representing the three different classes. However, the groups overlap quite a bit. Let’s apply PCA to make the clusters more apparent. To do this we use the prcomp function in R. Let’s plot the first and second Principal Components now.

pr.out <- prcomp(obs)

plot(obs, col = c(rep("black",30), rep("blue", 30), rep("red", 30)), pch=19)

The data plotted using Principal Components (PC) 1 and 2

The three classes become much more apparent. This is an improved visualisation of the data. The training process of a clustering algorithm like K-means will benefit from PCA.

You can see that the classes are spread along PC1. We can plot the proportion of the total variance that each PC accounts for.

pr.sd <- pr.out$sdev # standard deviations


pr.var <- pr.sd ^ 2 # variance

pve <- pr.var/sum(pr.var) # proportion of variance explained

plot(pve[1:20],
     xlab = 'Principal Component',
     ylab = 'PVE',
     type = 'b',
     col = 'blue')

The data varies along PC1 much more than any other component

Principal Component one always has the greatest proportion of variance explained (PVE) followed by PC2 and PC3 etc. In my data, 40% of the variance is along PC1 alone. The PVE decreases from then on. We can see that the PC1 axis is the most important because the data is separated most along it. Therefore, PC1 will be the most valuable variable for a classifier to consider.

So for this dataset, how much can we reduce the dimensionality by? Let’s plot the cumulative proportion of variance explained against the number of PCs considered with cumsum.

cumsum(pve)

plot(cumsum(pve),
     xlab = 'PC',
     ylab = 'CPVE',
     type = 'b',
     col = 'red')

Cumulative Proportion of Variance Explained with the number of Principal Components

With just 50 PCs, we can half the dimensionalilty of the dataset while retaining over 90% of the variance of the data. Our classifier algorithm can learn to distinguish classes much more effieciently with less dimensions.

In summary then, PCA can reduce the dimensions of the dataset while keeping the most important infromation in the data.

Data Science & Tech Blog

Quick summaries and guides

Month: February 2022

SQLite in Python