Articles

Common Date Manipulations on Data Science SQL Interviews

by Rakib Rimon R. SEO

Let’s talk about a common date manipulation on data science interviews and on the job. Extracting components from the date field like year, day, month is super common in analytics. Oftentimes you’re given a date field and asked to calculate or count some metric over years or months. But the date field has a granularity down to the day. You really just need to aggregate the date field up one level to month or year. 

 

This is a super common task in any analytical role. But it can get kind of confusing because (1) there are a bunch of different functions that either do the same thing or does something similar to another function but slightly different. It’s hard to know which function to choose. (2) Different SQL dialects have different functions. So a function you’d use in postgres might not be the same as a function you use in MySQL (2) The column in the database might not be in the proper format or data type, so you have to spend some time converting and manipulating the data and that can get confusing. 

 

So let’s start simple. Let’s go through one SQL example. We’ll illustrate a few functions that can be used to extract parts of a date. That’s a common use case and thing you’d do as a data scientist. 

 

You can also watch a video on how I manipulate dates in SQL here.

 

Date Manipulation In A Data Science Interview Question

Let’s take a look at this question here. It’s a simple example so I won’t go into too much detail on how to solve this problem. Instead I’ll focus more on the date component to the solution.

 

Now that we know what the question is asking us to find. Let’s go through the framework to solve this problem. 

 

Approach

1.    Take a look at the data

2.    Select the columns needed to answer the question

a.            Now that we understand the data, we can select the columns we know we’ll need to answer the question

b.            Inspection_date, violation_id, business_name

                Select what the output going to look like

 .              Another really important step is to visualize what the output and answer is going to look like. More specifically, what columns will be needed in the output?

a.            Year from the inspection_date column, number of inspections which will be a count(). 

b.            I know the business name and violation ID will be used to filter my data so it will be used in the WHERE clause

 

Coding and Manipulating Dates

 

I also like to code in steps and apply one logic and business rule at a time, then test my code. This is obviously a tip for if you’re on the job and have a working database to do this. But it also works on an interview where you’d guide the interviewer through your logic one at a time. 

 

So let’s apply the filters first. That’s usually where I’ll start.

 

SELECT *

FROM sf_restaurant_health_violations

WHERE business_name = 'Roxanne Cafe'

  AND violation_id IS NOT NULL

 

Now let’s try to get the output we want.

 

EXTRACT(year FROM request_date::DATE) to extract the year

https://www.postgresql.org/docs/9.1/functions-datetime.html

Returns a double percisions

 

SELECT EXTRACT (YEAR

                FROM inspection_date) AS YEAR,

               count(*) AS n_inspections

FROM sf_restaurant_health_violations

WHERE business_name = 'Roxanne Cafe'

  AND violation_id IS NOT NULL

GROUP BY YEAR

ORDER BY YEAR ASC



This will break because the inspection_date is not a date column. It’s an object which in this platform means that it’s either a text or varchar datatype. This platform executes the code using python so some of the components that you’ll see are python related. But we’ll try to improve this over time.

Cast the column into a date dtype using either double colons or the casting functions. Double colons is basically a casting function that you can use in postgres. Casting function can be used across other sql dialects like MySQL.

 

It’s also okay to put year in the GROUP BY because this SELECT clause gets executed first so the interpreter knows that there’s a column named year in this query

 

SELECT EXTRACT (YEAR

                FROM cast(inspection_date as DATE)) AS YEAR,

               count(*) AS n_inspections

FROM sf_restaurant_health_violations

WHERE business_name = 'Roxanne Cafe'

  AND violation_id IS NOT NULL

GROUP BY YEAR

ORDER BY YEAR ASC

 

A few other notes that are really important. 

 

It’s really common to have to isolate parts of the date in the SELECT clause just like we’re doing here. But it’s also common to isolate parts of dates in the WHERE clause if you’re trying to filter data. So say we’re trying to filter the data to only consider 2015 records. You could add the filter like this.

 

SELECT EXTRACT (YEAR

                FROM cast(inspection_date as DATE)) AS YEAR,

               count(*) AS n_inspections

FROM sf_restaurant_health_violations

WHERE business_name = 'Roxanne Cafe'

  AND violation_id IS NOT NULL

  AND EXTRACT (YEAR FROM cast(inspection_date as DATE)) = 2015

GROUP BY YEAR

ORDER BY YEAR ASC

 

Another important note to consider

Different sql dialects like MySQL, Postgres, Oracle, MS SQL server all have different functions to manipulate the dates. This Extract() function works on most dialects. 

 

If you’re on postgres, You can use date_part() which works like EXTRACT. 

 

SELECT date_part ('YEAR', inspection_date :: DATE) AS YEAR,

               count(*) AS n_inspections

FROM sf_restaurant_health_violations

WHERE business_name = 'Roxanne Cafe'

  AND violation_id IS NOT NULL

GROUP BY YEAR

ORDER BY YEAR ASC

 

You can also use YEAR() if you’re on MySQL.

 

Other dialects might not have date_part(). But have their own. So don’t get confused if you see different functions that do the same thing.

 

Conclusion

Selecting parts of a date from a date field is a common task in analytics and data science. In addition to year, you also might break the date down by month or year-month. 

 

There are also multiple functions that can accomplish this. A lot of it depends on the SQL dialect you’re using. I find this to be the most confusing part. A lot of people including myself get confused when there’s so many functions that seemingly do the same thing.


If you’d like to see a video on how I solved this problem in real time, watch it here.


Sponsor Ads


About Rakib Rimon R. Innovator   SEO

18 connections, 1 recommendations, 80 honor points.
Joined APSense since, September 13th, 2020, From Rajshahi, Bangladesh.

Created on Feb 19th 2021 12:38. Viewed 154 times.

Comments

No comment, be the first to comment.
Please sign in before you comment.