Skip to content

Latest commit

 

History

History
37 lines (19 loc) · 7.82 KB

dictionary_challenge_post.md

File metadata and controls

37 lines (19 loc) · 7.82 KB

Want to really learn some advanced SQL?

Download the dictionary as a CSV. A single file with one column called WORDS where all the words in the English language are there.

Now there is a lot of things you can do with that data, look at lots of distributions, how many words start with A, B, etc. How many words are 1 letter long, 2 letters, etc.

What word comes after kayak? What word comes 10 words after kayak? How about two words before it? The average length of words, the median length, etc.

But that's all basic shit right? EDIT: Now try to find all palindromes in English. Then try to find all anagrams in English.

Let's say you're playing scrabble and you can only pick 7 letters, what 7 letters can you pick to spell the most 7 letter, 6 letter, 5 letter, 4 letter, 3 letter, 2 letter, and 1 letter words in English?

Now in Scrabble the letters you can select are limited to the tiles in the bag... so add that layer of complexity.

Now in Scrabble you can only use all of your letters without joining to another word if you go first. So add 8 letter words to the list.

When you have an answer to the Scrabble question you will more or less understand everything there is to understand about SQL.

edit: If you really want to go to the next level, calculate which 8 letter word in English you can spell to score the maximum amount of Scrabble points, and which 7 letters you'd need to have to do it. You could even add in the bonus tiles if you really wanted to get complex, or you could even add in a 9 or 10, whatever length word which would imply you joined a word above, and a word below to make one much longer word. Getting up to answering those questions would be very complex, and SQL probably isn't the best place to answer questions like that... but you could do it, and you can learn all of the complexities just starting from (1) table with (1) column of data. You may need to build functions, or a sproc, or views, or loops.

There are a lot of ways to approach this general problem, but all you need is the dictionary. You will likely need to create new tables that store information such as what letters are available in the Scrabble bag, and what their scores are. You may find it in your interest to index these new tables, etc. And all of this can be done on a cheap laptop. What starts out as being a very simple data set may have you writing some advanced queries that take a long time to run. Can you improve them? Are there better ways to go about what you're trying to do? This question involves a TON of advanced transformation, and I love talking about it with candidates in an interview. I have no ambition to have a candidate try to solve it on the spot (it would take a long time) but I like talking to them about the idea and seeing what they have to say about ways they might potentially try solving it, then I might ask more probing questions such as, "Have you written a lot of functions before?"

What I really love about this problem is that it really makes you focus on the 'basics' of SQL while touching more complex ideas. As a simple example, if I have all the letters to spell affairs do I also have all the letters I need to spell fair? How can you prove that? Are there 'duplicate' letters in affairs? If you have the letters afirs how can you tell that you have the right letters to spell fair?

So my point here is that you can really dive into the data, and you can really easily check your work. It's easy data to understand because you are a master of of the subject. You can start thinking about the problems mentally and imagining how you might solve them in SQL, and then use Google to ask specific questions about how to do a specific procedure. This will let you learn more. It will force you to architect a solution from the ground up and build your own database, as opposed to inheriting a random dataset and working through problems someone else has given you, you will have to work through your own problems as they come up. Being close to the data and understanding it intimately is a key to really learning SQL on an advanced level. Anyone can write a query that runs and spits out data, but not everyone can write a query that is correct, and which they can demonstrably show is correct. Doing that requires you really learn the data, or the database, and understand every part of it, and how it interacts with every other part.

Don't rely on someone else, or a class to give that data to you --> make it yourself.

Not interested in the dictionary challenge? Make a database of baseball statistics, or fantasy football statistics. Go out there and find data that is meaningful to you, which you already naturally understand, and then use SQL to do something meaningful with it. Find something where you are already a subject matter expert and use SQL to answer interesting questions. Build a database of recipes, and prices of ingredients. If you have 8 things in your fridge what recipes can you cook with 9 things (i.e. only needing to go out and buy one.)

If you do this you will actually learn. You will teach yourself, and more importantly you will teach yourself how to research SQL techniques to fit the objective of what you're trying to achieve. You'll learn how to imagine the problem in your head, and then just need to figure out how to translate that into SQL.

What I like to do whenever I look at a new data source at work is to imagine how I'd answer a question using a piece of paper. You know to answer the question my boss has I'd first have to calculate this... then that... then this... need these dimensions... Then I open up Excel and I mock something up. I share it with him and ask him if that's the format that he wants the data in. From that point I know what the final data should look like, and all I need to do is start asking questions of the data source to see how it operates. I'm currently working with a set of data where each account has (1) row of data each month in a table, unless that account is in a certain type of status and then it can have one row of data per day, or even 5 or 6 rows per day. If I want to do a day over day comparison I need to take the (1) row of data for the month and give it values for every other day of the month, but then for the other accounts I need to take the max value per day based on date, and then for any missing days use that value until we get a new record. It's not really that complex but it took a long time for me to understand how the table records data, and there's no magic bullet there. You do it slowly, you explore the data, you find specific account id's and track. You do things like select accountid, count() from table group by accountid order by count() desc and then find interesting accounts with lots of data to compare with other accounts with less data. You painstakingly review the transactional history of accounts in one table and compare it to the structure of other tables.

You're essentially becoming a subject matter expert all by yourself, and whenever you find something strange you take your questions to the developers that built the solution (if possible.) Then strangely over time you become the actual expert on those tables, and the developers will start asking you questions (hopefully) before they start making changes... to (hopefully) ensure that all the processes you've built touching that table aren't going to explode when they implement something new.

Sorry for the long winded response but this lock down has be bored. Going back up to my previous example of spelling fair from affairs: Write it down on a piece of paper and figure out how you'd solve that with a pen. Then Google SQL concepts that will help you achieve it. Do that with a dictionary (1) column of data... then later do it for baseball stats... then later do it for a company's financial data. Eventually you're a senior SQL developer, or architect, and most of what you know will be a product of what you've taught yourself.