Today, I continued with my Coursera course on SQL. Other than that, I started listening to the podcast Towards Data Science and heard Joel Grus talk about the downsides of using Jupyter Notebooks. I was a little shocked to hear that he does not think it should be the default coding environment for a data scientist, however, when he explained the problem with reproducibility and hidden states it made sense. Particularly in research, these are key aspects that go towards making a robust model with solid results.
Another aspect of data science he talked about was unit testing. Trying to take a chunk of data and run small tests to ensure your code is working according to plan. This could potentially save hours if there is a bug in the code and it takes many hours to train the model. I will look for ways to try and implement this in my current projects and future work.
Summary — Filtering & Sorting
- Why? To retrieve a specific subset of data, reduce the number of retrieved records, increase query performance and reduce strain on the client database
- WHERE
SELECT column_name
FROM table_name
WHERE column_name operator value;
- Can use IN, OR, BETWEEN, AND, NOT as well for filtering
- IN vs OR — IN can contain list of multiple objects, faster to run, can contain another SELECT, be careful of order with OR
- Wild cards — %value, ‘_value’ (for some), ‘[]’ specific char, can take longer to use a wildcard compared to =/<, placement is important
- ORDER BY — can order by column not selected, multiple columns, must be last in query
SELECT column_name_1, column_name_2
FROM table_name
ORDER BY column_name_1, column_name_5, … ASC|DESC; (choose)