more spark stuff

shameekagarwal · Jan 5, 2025 · 26a76c6 · 26a76c6
1 parent 6511b62
commit 26a76c6
Show file tree

Hide file tree

Showing 26 changed files with 783 additions and 363 deletions.
diff --git a/_posts/2023-07-23-snowflake.md b/_posts/2023-07-23-snowflake.md
@@ -678,9 +678,9 @@ title: Snowflake
 
 ## Cost Governance Framework
 
-- **visibility** - understand, attribute and monitor the spend. e.g. dashboards, tagging, resource monitors. TODO - links
-- **control** - limit and control the spend. e.g. TODO - links
-- **optimization** - optimize the spend. e.g. TODO - links
+- **visibility** - understand, attribute and monitor the spend. e.g. [dashboards](#dashboards), [tagging](#query-tagging), [resource monitors](#resource-monitors)
+- **control** - [limit and control](#control-strategies) the spend
+- **optimization** - optimize the spend
 
 ## Dashboards
 
@@ -741,7 +741,7 @@ title: Snowflake
   - virtual warehouses managed by us
   - virtual warehouses managed y cloud services
 - we can also monitor credits used by serverless compute. recall how it is broken down by services
-- so, we use `metering_history` for this, and filter out the rows for virtual warehouses
+- so, we use `metering_history` for this, and filter out the rows for virtual warehouses by using `service_type != 'WAREHOUSE_METERING'` (we already created a chart for the warehouse_metering table above)
   ```sql
   select
       date_trunc('month', start_time),

diff --git a/_posts/2023-08-12-messaging-systems.md b/_posts/2023-08-12-messaging-systems.md
@@ -309,7 +309,7 @@ if __name__ == '__main__':
 - we could have written kafka producer and consumer codes ourselves. issue - handling failures, retries, scaling elastically, data formats, etc all would have to be done by us
 - miscellaneous advantages - 
   - kafka also acts as a "buffer" for the data, thus applying "back pressure" as needed
-  - once data is in kafka, we can stream it into multiple downstream targets. this is possible because kafka stores this data per entity in a different kafka topic. ss
+  - once data is in kafka, we can stream it into multiple downstream targets. this is possible because kafka stores this data per entity in a different kafka topic
 - some architectures - 
   - an application can just put elements into the kafka topic, while kafka connect can transfer this data from kafka to the sink. application -> kafka -> kafka connect -> database
   - we can have several old applications putting data into their database at rest. kafka can use cdc and put this data into kafka in near realtime. sometimes, kafka can also act as the permanent system of record. so, new applications can directly read from kafka to service requests. applications -> database -> kafka connect -> kafka -> application

diff --git a/_posts/2023-08-19-hadoop.md b/_posts/2023-08-19-hadoop.md
@@ -85,7 +85,7 @@ title: Hadoop
 - we can use the same class for combiner and reducer if we want
 - combine may or may not run. e.g. if hadoop feels the amount of data is too less, the combine operation might not run. so, following points are important - 
   - our combine operation should be optional i.e. we should be sure that even if our combine operation does not run, our results stay the same. e.g. we want to find out all the words that occur 200 or more times. we can only add the values for a key in a combiner. writing the word to the context based the condition that it occurs 200 or more times can only stay inside the reducer since at that point, the reducer has all the values. basically, it might happen that one worker's combine sees count as 150 for a particular word and another worker's combiner sees count as 60 for the same word
-  - input and output format of combine operation should be same so that it whether it runs or not makes no difference (and of course these types should also be the same as output of map and input of reduce)
+  - input and output format of combine operation should be same so that whether it runs or not makes no difference (and of course these types should also be the same as output of map and input of reduce)
 - so, the entire process looks like this? - map -> combine -> partition -> shuffle -> sort -> group -> reduce
 
 ## HDFS Commands

diff --git a/_posts/2023-12-27-spark.md b/_posts/2023-12-27-spark.md
@@ -446,7 +446,7 @@ print(counts_by_genre.collect())
 ## Execution Plan / Catalyst Optimizer Working
 
 - the **catalyst optimizer** works internally in following steps
-- or we can say that spark executes the **execution plan** in the following steps - 
+- or we can say that spark executes the **execution plan** in the following steps
 - generate an ast or **abstract syntax tree**. any errors in our field names, data types, sql function usage, etc would be caught here
 - now we will have a **logical plan**
 - perform optimization on our logical plan. the optimization here includes techniques like - 
@@ -456,7 +456,7 @@ print(counts_by_genre.collect())
 - generate a bunch of **physical plans**, and associate a cost with each of them. e.g. one plan uses shuffle join, another uses broadcast join
 - finally, a **cost model** evaluates the most optimal physical plan
 - **wholestage code generation** - generate the bytecode to run on each executor
-- note - i thing wherever **dag** (directed acyclic graph) is mentioned, it refers to this entire process
+- note - i think wherever **dag** (directed acyclic graph) is mentioned, it refers to this entire process
 
 ![execution plan](/assets/img/spark/execution-plan.jpg)
 
@@ -474,7 +474,7 @@ print(counts_by_genre.collect())
 - there are two ways to access external sources - 
   - ingest using external tools to write data from external sources to internal sources. data goes unmodified from different sources into internal sources. then spark reads from these internal sources directly
   - make spark directly read from these different external sources
-- **batch processing** prefers first option because for e.g. our db capacity was provisioned with otlp workloads in mind, and might not be optimized for spark based big data workloads. thus, it helps decouple the two from performance, security, etc perspective
+- **batch processing** prefers first option because for e.g. our db capacity was provisioned with oltp workloads in mind, and might not be optimized for spark based big data workloads. thus, it helps decouple the two from performance, security, etc perspective
 - **stream processing** prefers second option<br />
   ![spark architecture](/assets/img/spark/spark-architecture.drawio.png)
 - the different tools to my knowledge include kafka by itself (probably the most common), talend, debezium, informatica