<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Adi Polak</title><link>https://adipolak.github.io/adipolak-blog/</link><description>AI and Cloud expert sharing insights on AI systems, cloud computing, distributed systems, data analytics, and technical leadership</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><managingEditor>Adi Polak</managingEditor><copyright>Copyright &amp;#169; 2020 Adi Polak. All rights reserved.</copyright><lastBuildDate>Thu, 12 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://adipolak.github.io/adipolak-blog/feed.xml" rel="self" type="application/rss+xml"/><item><title>Data+AI Summit 2021 is Coming</title><link>https://adipolak.github.io/adipolak-blog/post/data_and_ai_summit2021/</link><pubDate>Sun, 11 Apr 2021 00:00:00 +0000</pubDate><author>Adi Polak</author><guid>https://adipolak.github.io/adipolak-blog/post/data_and_ai_summit2021/</guid><description>Knock Knock! Who's there? It's me, Summit. Summit who? Data + AI Summit</description><content:encoded><![CDATA[<p>It&rsquo;s almost been half a year since the last summit.</p>
<p>Data+AI Summit 2021 starts on Monday, May 24 till Friday, May 28.
The training will be held on May 24-25 and will cater to a large set of practitioners, definitely more extensive than previous times:
Data Analyst, Data Engineer, Data Scientist, ML Engineer, Partner Data Engineer, Platform Engineer, Technical. The wide range of roles makes me curious about the various technical personas in the Data and AI space. It&rsquo;s not only Data Engineers and Data Scientists. There is a wide range of people who can benefit from attending the summit.</p>
<p>Interestingly, Monday training starts at 6 am PT, allowing European timezone folks to join at their after noon time.</p>
<p>The pass for the summit is free upon <a href="https://databricks.cventevents.com/event/45414668-315b-4f08-b539-d9269a28d939/regProcessStep1:699bc051-23ea-466e-991b-2be0ed69ee5c?_ga=2.119768332.560979754.1618133304-1179355131.1607861196&amp;RefId=General%20Attendee&amp;rp=1e9c37cb-3d4a-44ed-9736-1a5ce56f7f05">registration</a>.</p>
<p>The agenda is somewhat missing from the main page, but trying out multiple URLs brought me to the whole <a href="https://databricks.com/dataaisummit/north-america-2021/agenda">agenda on the site</a>.</p>
<p>There are 3 main focus areas:</p>
<ul>
<li>Productionizing Machine Learning</li>
<li>Databricks Experience</li>
<li>Technical Deep Dives</li>
</ul>
<img class="responsive" src="../../images/summit2021/summit-focus.png" alt="drawing">
<p>As you know by now, I think all the sessions are great! Sessions were carefully picked out of hundreds to provide the maximum value to the summit participants.</p>
<p>But, let&rsquo;s assume we can attend only three. Which 3 will you choose?</p>
<p><ins>Here are my top picks:</ins></p>
<h3 id="1st-from-the-machine-learning-space---strategies-for-debugging-machine-learning-systemshttpsdatabrickscomsession_na21real-world-strategies-for-debugging-machine-learning-systems"><strong>1st:</strong> From the machine learning space - <a href="https://databricks.com/session_na21/real-world-strategies-for-debugging-machine-learning-systems">Strategies for Debugging Machine Learning Systems</a></h3>
<p>This session is interesting since there are many ways to build machine learning models, from centralized distributed machine learning to decentralized distributed like Federated Learning. Of course, many places where things can go wrong when training on one machine or multiple machines. The space of security and adversaries in machine learning training is a magical one. Like a rare diamond, you don&rsquo;t know its worth until carefully examined.
For example, in federated learning, each device is training a model locally on its data and shares the model summary. Assuming one of the devices is an adversary, an attacker who wants to change the model&rsquo;s overall result can share a false summary with the rest of the group and impact the model&rsquo;s overall behavior.
In some cases, that means smaller revenue, but when we use machine learning to save lives like healthcare usabilities, that&rsquo;s more complicated.</p>
<h3 id="2nd-from-the-databricks-experience-space---video-analytics-at-scale-dl-cv-ml-on-databricks-platformhttpsdatabrickscomsession_na21video-analytics-at-scale-dl-cv-ml-on-databricks-platform"><strong>2nd</strong> From the Databricks Experience space - <a href="https://databricks.com/session_na21/video-analytics-at-scale-dl-cv-ml-on-databricks-platform">Video Analytics At Scale: DL, CV, ML On Databricks Platform</a></h3>
<p>The future of data is complex data. And complex data for is sound and video. Let me bottom-line this for you; video is here to stay, may it be social media, Netflix, or Autonomous drowns/cars/&hellip; The ability to process complex data at scale and enabling machine learning at scale is the future.</p>
<p>And lastly, from the Technical Deep Dives:</p>
<h3 id="3rd-becoming-a-data-driven-organization-with-modern-lakehousehttpsdatabrickscomsession_na21becoming-a-data-driven-organization-with-modern-lakehouse"><strong>3rd</strong> <a href="https://databricks.com/session_na21/becoming-a-data-driven-organization-with-modern-lakehouse">Becoming a Data-Driven Organization with Modern Lakehouse</a></h3>
<p>Understanding the Modern lakehouse is one thing, but knowing how to rally all the stakeholders and build a robust one is a different challenge.</p>
<iframe src="https://giphy.com/embed/4JVTF9zR9BicshFAb7" width="480" height="360" frameBorder="0" class="responsive" allowFullScreen></iframe>
<hr>
<p>I&rsquo;m going to let you in on a secret, all sessions are going to be recorded and shared after the summit. But there is some magic in attending, asking questions and participating in Twitter and Chat with like minded who came to learn, exchange ideas and network.</p>
<h3 id="links">Links:</h3>
<ul>
<li><a href="https://databricks.cventevents.com/event/45414668-315b-4f08-b539-d9269a28d939/regProcessStep1:699bc051-23ea-466e-991b-2be0ed69ee5c?_ga=2.119768332.560979754.1618133304-1179355131.1607861196&amp;RefId=General%20Attendee&amp;rp=1e9c37cb-3d4a-44ed-9736-1a5ce56f7f05">Registration</a></li>
<li><a href="https://databricks.com/dataaisummit/north-america-2021/agenda">Agenda</a></li>
<li>Learn about <a href="https://bit.ly/3uABica">Azure Databricks</a></li>
</ul>
]]></content:encoded><category>conference</category><category>apache spark</category><category>distributed-systems</category></item><item><title>Machine Learning in Production - Concepts you should know</title><link>https://adipolak.github.io/adipolak-blog/post/machine-learning-in-production---concepts-you-should-know/</link><pubDate>Thu, 04 Mar 2021 00:00:00 +0000</pubDate><author>Adi Polak</author><guid>https://adipolak.github.io/adipolak-blog/post/machine-learning-in-production---concepts-you-should-know/</guid><description>To productionize machine learning, know the concepts first</description><content:encoded><![CDATA[<p>Are you interested in learning about the Machine Learning side of data? Hurry 🎉 , you have reached the right place to start learning about it.</p>
<p>Here is a list of concepts for you to get started:</p>
<h2 id="ml-algorithm">ML Algorithm</h2>
<p>ML algorithm is a procedure that runs on data and produces a machine learning model. Some of the popular ones are Decision trees, Naive Bayes, and Linear Regression.</p>
<h2 id="ml-model">ML Model</h2>
<p>ML model is the ML algorithm process outcome; It often contains a statistical representation of the data ingested into the algorithm. ML model input is data, and the output is either a prediction, decision, or classification.</p>
<h2 id="training-set">Training set</h2>
<p>The training set is the data ingested into the machine learning algorithm; it trains the ML model.</p>
<h2 id="testing-set">Testing set</h2>
<p>The testing set is the dataset we test the ml model with. To test the ML model&rsquo;s accuracy, we ingest the data into the model and measure the accuracy level of the outcome. It helps us reason about the quality of the machine learning model.</p>
<iframe src="https://giphy.com/embed/HUplkVCPY7jTW" width="480" height="360" frameBorder="0" class="responsive" allowFullScreen></iframe>
<h2 id="machine-learning-pipeline">Machine Learning pipeline</h2>
<p>The machine learning pipeline is an automation process of the machine learning workflow. It includes data transformation and correlation to fit the ML algorithm, running the algorithm to produce a model, and testing it with a test set.</p>
<h2 id="model-interpretability">Model interpretability</h2>
<p>ML Model interpretability is the degree to which a human can reason the machine learning model&rsquo;s output. The higher the degree, the easier it is for a human to understand the model&rsquo;s decision or prediction.</p>
<h2 id="data-quality">Data Quality</h2>
<p>Data quality measures the data&rsquo;s condition based on accuracy, precision, legitimacy, validity, reliability, consistency, completeness, and more. In machine learning, data quality is important for producing high-quality, non-bias machine learning models.</p>
<h2 id="data-drifts">Data drifts</h2>
<p>Data drift is unexpected and undocumented changes to the data structure, semantics. Data drift can result in corrupted data and data low quality. Lack of awareness of data drift can result in a lesser quality of ML models.</p>
<h2 id="concept-drift">Concept drift</h2>
<p>Concept drift refers to the changes in target variables.
Target variables are the outcomes of the prediction process you do with machine learning models.
You can detect concept drift by measuring the statistical properties of the target variables.
Machine learning&rsquo;s actual target variable can change over time in unforeseen ways and presents a challenge since the predictions become less accurate as time passes.</p>
<hr>
<iframe src="https://giphy.com/embed/EXFAJtutz5Ig8" width="480" height="360" frameBorder="0" class="responsive" allowFullScreen></iframe>
<p>I hope it was helpful for you and gave you more clarity about the concepts.</p>
<h2 id="-curious-to-learn-more">💡 Curious to learn more?</h2>
<p>Read here about how to create <a href="https://docs.microsoft.com/en-us/learn/paths/create-machine-learn-models/?WT.mc_id=social-00000-adpolak">machine learning models with python</a>.</p>
]]></content:encoded><category>data science</category><category>machine learning</category><category>terminology</category><category>ai</category></item><item><title>Delta Lake essential Fundamentals: Part 4 - Practical Scenarios</title><link>https://adipolak.github.io/adipolak-blog/post/delta-lake-essential-fundamentals---part-4/</link><pubDate>Mon, 22 Feb 2021 00:00:00 +0000</pubDate><author>Adi Polak</author><guid>https://adipolak.github.io/adipolak-blog/post/delta-lake-essential-fundamentals---part-4/</guid><description>Multi-part series that will take you from beginner to expert in Delta Lake</description><content:encoded><![CDATA[<p>🎉 Welcome to the 4th part of Delta Lake essential fundamentals: the practical scenarios! 🎉</p>
<p>There are many great features that you can leverage in delta lake, from the ACID transaction, Schema Enforcement, Time Traveling, Exactly One semantic, and more.</p>
<p>Let&rsquo;s discuss two common data pipelines patterns and solutions:</p>
<h2 id="spark-structured-streaming-etl-with-deltalake-that-serves-multiple-users">Spark Structured Streaming ETL with DeltaLake that serves multiple Users</h2>
<p><strong>Spark Structured Streaming</strong>-
Apache Spark structured steaming are essentially unbounded tables of information. There is a continuous stream of data ingested into the system. As developers, we write the code to process the data continuously.
<strong>ETL</strong> stands for <strong>E</strong>xtract, <strong>T</strong>ransform and <strong>L</strong>oad.</p>
<p><strong>Scenario</strong> - Ingest data from Kafka topic, process the information using Spark structured streaming, and save it to DeltaLake for multiple <em>users</em> on-the-fly queries. <br>
Note! The output of the solution/pipeline is used by real users and not a machine, which means that there is no need to refresh the data every couple of seconds as the person taking actions on the data won&rsquo;t be able to use it.</p>
<h3 id="system-requirements">System Requirements</h3>
<p>Let&rsquo;s assume we have these system requirements: <br><br>
<strong>Input:</strong> any unstructured one input stream for example Kafka topic <br>
<strong>Output:</strong> structured tabular data for users to query <br>
<strong>Latency:</strong> 5 minutes <br>
<strong>Constraints:</strong> multiple users query the table at the same time</p>
<h3 id="high-level-pipeline-architecture">High-level Pipeline Architecture</h3>
<img class="responsive" src="../../images/Detla/kafka-spark-streaming-delta-scenario.png" alt="drawing">
<h3 id="advantages">Advantages</h3>
<p>In our scenario, we have multiple users that query the data on the fly and should see the same data - <em>single source of truth</em>. In distributed steaming, when we query the data, it might be that from two identical queries that ran at the same time, we will get different results. This is why we introduce DeltaLake into the pipeline. We save the streaming tabular data in DeltaLake, which in practice means that the user read operations take place on the DeltaTable snapshot, which guarantees consistency of the data. At the same time, the table is continuously being written.<br>
Simultaneously, the user can also run updates/deletes and fixes on the data when necessary, this is important for controlling incoming data that are bounded to GDPR or other compliances. The conflicts are resolved using Delta conflict resolution mechanism (Discussed in Part 2 - <a href="/post/delta-lake-essential-fundamentals-the-deltalog/">the DeltaLog</a>).</p>
<p>If you are using <a href="https://docs.microsoft.com/en-us/azure/databricks/delta/?WT.mc_id=delta-13569-adpolak">Databricks services</a>, you will get the Auto Optimize out of the box, which coalesces small files into larger files using <a href="https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/auto-optimize?WT.mc_id=delta-13569-adpolak">Auto Compaction</a>.</p>
<h3 id="when-to-exclude-delta-lake">When to exclude Delta Lake</h3>
<p>As much as it&rsquo;s important to know what are the advantages and when to use DeltaLake, it&rsquo;s important to understand when to exclude it. For example, when you want to have a latency of <strong>seconds</strong> to update a Key-Value output for lookup tables, you should probably avoid DeltaLake since it introduce the overhead of the optimistic concurrency and commits to the DeltaLog itself. But if your system can handle a couple minutes of latency, you should consider using it for enforcing data - <em>single source of truth</em> for your users.</p>
<h4 id="what-to-use-instead-of-deltalake-for-updating-lookup-tables-with-seconds-latency">What to use instead of DeltaLake for updating lookup tables with seconds latency</h4>
<p>A lookup table is an array that replaces runtime computation with a simpler array indexing operation, which means that the data is being stored in memory. This makes read queries significantly faster than loading data from disk, which involves I/O operations. Hence for stateful streaming operations, we would prefer to use in-memory databases such as <a href="https://docs.microsoft.com/en-us/azure/azure-cache-for-redis/cache-overview?WT.mc_id=delta-13569-adpolak">Redis</a>, <a href="https://docs.microsoft.com/en-us/azure/cosmos-db/cassandra-introduction?WT.mc_id=delta-13569-adpolak">Cassandra</a> or Amazon DynamoDB.
<strong>This solution is more expensive</strong> since it requires dedicated servers/services to be used in the system, vs. using DeltaLake which is a storage layer, but this is the price to be paid for lookup table that is being updated and accurate to all users with a latency of seconds.</p>
<!-- <highlight>
<p style="font-family:verdana;" >Note: the DetlaLake merge capabilities are currently supported in Databricks environment but not yet in the OSS.
</p>
</highlight> -->
<hr>
<p><br> <br></p>
<h2 id="join-multiple-data-streams-based-on-a-common-key-on-azure-databricks">Join Multiple Data Streams based on a common key on Azure Databricks</h2>
<p><strong>Scenario</strong> -  Ingesting data into the system from multiple different data streams that need to be joined based on a common key written to a shared table for future analytics/ML workloads/lookup tables.</p>
<h3 id="system-requirements-1">System Requirements</h3>
<p>Let&rsquo;s assume we have these system requirements: <br><br>
<strong>Input:</strong> multiple unstructured input streams from various sources <br>
<strong>Output:</strong> tabular data combining the streams inputs <br>
<strong>Latency:</strong> 2 minutes <br>
<strong>Constraints:</strong> support join on one fast and one slow data streams with dimension changes</p>
<p>Slowly changing dimensions is a data management problem where the data warehouse contains relatively static data and schema linked to a dimension table that can change its schema and data as time passes. To learn more about it, read <a href="https://en.wikipedia.org/wiki/Slowly_changing_dimension">here</a>. Imagine a user bank account information that needs to be joined by a user bank transaction.</p>
<p>Suppose you are familiar with <a href="https://docs.microsoft.com/en-us/azure/databricks/kb/sql/bchashjoin-exceeds-bcjointhreshold-oom?WT.mc_id=delta-13569-adpolak">broadcast join mechanism</a>. In that case, the solution might look simple, broadcast the small static data table and use Spark Structured Streaming to stream the fast, ever-changing stream for the join operation.<br>
But, what if you are attempting to join two big tables that constantly change? This is when you need to understand how to leverage Delta Lake versioning capabilities.</p>
<h3 id="high-level-pipeline-architecture-1">High-level Pipeline Architecture</h3>
<p>In this high-level architecture diagram, we have 2 Spark workloads; the first one is a batch, reading data from MongoDB, processing it, and saving it to DetlaLake. The second one is the fast data, ingesting data from a Kafka topic directly into Spark Streaming and joining the fast data with the slow data saved in DeltaLake. After the join and further logic, the data is being kept in a tabular store for future use.</p>
<img class="responsive" src="../../images/Detla/azure-databricks-streaming-with-deltalake.jpg" alt="drawing">
<h3 id="advantages-1">Advantages</h3>
<p>Without DetlaTable, Structured streaming will hold a static view of the first &ldquo;slow&rdquo; data, and it won&rsquo;t be updated until you restart the streaming query. But we already know that this data can be updated; think about the bank customer who changed their home address, and it needs to be updated in the system.
Using DeltaLake helps introduce table versioning and allows you to control the changes and read/use the latest/most updated version inside the streaming query without restarting it.</p>
<highlight>
<p style="font-family:verdana;" > Note: the DetlaLake automatic reload without restart capabilities are currently supported in Databricks environment but not yet in the OSS.
</p>
</highlight>
<hr>
<iframe src="https://giphy.com/embed/Ec5RkrmARxPmTuXgrZ" width="480" height="360" frameBorder="0" class="responsive" allowFullScreen></iframe>
<p>I hope you enjoyed reading about Delta Lake, the two practical scenarios, and the breakdown of the open-source;  <br>
I will continue to share scenarios, insights, and code samples throughout the blog. As always, if you have questions, suggestions, ideas, please don&rsquo;t hesitate to DM me on <a href="https://twitter.com/intent/follow?original_referer=http%3A%2F%2Flocalhost%3A1313%2F&amp;ref_src=twsrc%5Etfw&amp;region=follow_link&amp;screen_name=AdiPolak&amp;tw_p=followbutton">Adi Polak</a> 🐦.</p>
<p>If you would like to get monthly updates, consider <a href="https://sub.adipolak.com/subscribe">subscribing</a>.</p>
<h2 id="-learn-more">💡 Learn more!</h2>
<ul>
<li>Watch this video on how to <a href="https://www.youtube.com/watch?v=eOhAzjf__iQ">architect structured streaming</a>.</li>
<li>Read here about <a href="https://docs.microsoft.com/en-us/azure/databricks/getting-started/spark/streaming?WT.mc_id=delta-13569-adpolak">Azure Databricks and Streaming</a>.</li>
</ul>
<p>If you didn&rsquo;t get a chance to read the previous posts, read here: <br></p>
<ol>
<li><a href="/post/delta-lake-essential-fundamentals/">Delta Lake essential Fundamentals: Part 1 - ACID</a></li>
<li><a href="/post/delta-lake-essential-fundamentals-the-deltalog/">Delta Lake essential Fundamentals: Part 2 - The DeltaLog</a></li>
<li><a href="/post/delta-lake-essential-fundamentals-part-3/">Delta Lake essential Fundamentals: Part 3 - Compaction and Checkpoint</a></li>
<li>Delta Lake essential Fundamentals: Part 4 - Practical Scenarios (You are here)</li>
</ol>
]]></content:encoded><category>open-source</category><category>apache spark</category><category>delta lake</category><category>distributed-systems</category><category>beginner</category><category>scenarios</category></item><item><title>Delta Lake essential Fundamentals: Part 3 - compaction and checkpoint</title><link>https://adipolak.github.io/adipolak-blog/post/delta-lake-essential-fundamentals---part-3/</link><pubDate>Mon, 15 Feb 2021 00:00:00 +0000</pubDate><author>Adi Polak</author><guid>https://adipolak.github.io/adipolak-blog/post/delta-lake-essential-fundamentals---part-3/</guid><description>Multi-part series that will take you from beginner to expert in Delta Lake</description><content:encoded><![CDATA[<p>Let&rsquo;s understand what are Delta Lake compact and checkpoint and why they are important.</p>
<h2 id="checkpoint">Checkpoint</h2>
<p>There are two known checkpoints mechanism in Apache Spark that can confuse us with DeltaLake checkpoint, so let&rsquo;s understand them and how they differ from each other:</p>
<h3 id="spark-rdd-checkpoint">Spark RDD Checkpoint</h3>
<p>Checkpoint in Spark RDD is a mechanism to persist current RDD to a file in a dedicated checkpoint directory while all references to its parent RDDs are removed.
This operation, by default, breaks data lineage when used without auditing.</p>
<h3 id="structured-streaming-checkpoint">Structured Streaming Checkpoint</h3>
<p>Structured Streaming is a scalable and fault-tolerant stream processing built on Spark SQL engine. The queries are processed using a micro-batch processing engine as a series of small-batch jobs. Structured Streaming enables exactly once fault-tolerant guarantees through checkpoint and writing ahead logs. The streaming engine record the offset range of the data that is being processed in each trigger. Hence if a trigger failed, we have the exact range of processed data there and can recover from it.</p>
<h3 id="deltalake-checkpoint">DeltaLake checkpoint</h3>
<p>On each Delta Table state compute, Delta reads the JSON files discussed in <a href="/post/delta-lake-essential-fundamentals-the-deltalog/">Delta Lake essential Fundamentals: Part 2 - The DeltaLog</a>. To avoid reading all the files and executing a long compute, every 10 commit files are being aggregated to a <em>checkpoint</em> file of type parquet. These checkpoint files save the entire state of the table at a point in time. It allows the Spark engine to avoid reprocessing thousands of tiny JSON files. This mechanism ensures that for computing table state, Spark only needs to read the latest parquet checkpoint file with up to 10 JSON files, it makes the computation faster and efficient.
Checkout the visualization of Delta Checkpoint file from Databricks site: <br>
<img class="responsive" src="../../images/Detla/checkpointfile.png" alt="drawing"></p>
<p>checkpoint files can be a one file for a specific table version or multiple files, it depends on what it contains.</p>
<p>In one part, table version(<code>n</code>) 10 the file name will be of the structure <code>n.checkpoint.parquet</code>:</p>
<pre tabindex="0"><code>00000000000000000010.checkpoint.parquet
</code></pre><p>In multi-part, table version(<code>n</code>) 10 the files name will be of a structure that Fragment <code>o</code> of <code>p</code>: <code>n.checkpoint.o.p.parquet</code>:</p>
<pre tabindex="0"><code>00000000000000000010.checkpoint.0000000001.0000000003.parquet
00000000000000000010.checkpoint.0000000002.0000000003.parquet
00000000000000000010.checkpoint.0000000003.0000000003.parquet
</code></pre><p>Snapshot of the function that is in charge of the writing the checkpoint files, the modulo operation is in charge of the checkpointInterval which can be updated in DeltaConfig.
<br></p>
<img class="responsive" src="../../images/Detla/delta-lake-postcommit.png" alt="drawing">
<img class="responsive" src="../../images/Detla/deltalake-interval-config.png" alt="drawing">
<p>Delta Lake configuration can be set as a Spark Configuration property, or Hadoop configuration depends on the LogStore, the cloud used, etc.</p>
<hr>
<p>I hope this provides more clarity into the differences between the three checkpoint mechanisms and their usage.</p>
<p>Next, let&rsquo;s examine the compact files mechanism Delta recommends as part of its best practices:</p>
<h2 id="delta-lake-compact-files">Delta Lake Compact files</h2>
<p>The same way Delta Lake handles its own small JSON DeltaLog files is creating, we as developers need to take care of the small files we might introduce to the system when adding data in small batches. Small batches can happen when we have Streaming workloads or continuous small batches of data ingesting without compacting it.</p>
<p>Small files can hurt the efficiency of table reads, and it can also affect the performance of the file system itself. Ideally, a large number of small files should be rewritten into a smaller number of larger files regularly. This is known as compaction.</p>
<p>We can compact a table by repartitioning it to a smaller number of files.</p>
<p>Delta Lake also introduces the ability to set the <code>dataChange</code> field false; this indicates that the operation did not change the data, only rearranges the data layout. But be careful with it, since if you are introducing a data change that is not only a layout, it can corrupt the data in the table.</p>
<hr>
<p>For exploring and learning about Delta, you are invited in joining me by watching the videos. Let me know if that is useful for you, and we can schedule twitch as well.</p>
<h1 id="whats-next">What&rsquo;s next?</h1>
<p>Next, scenarios and use cases for DeltaLake!</p>
<p>As always, I would love to get your comments and feedback on <a href="https://twitter.com/intent/follow?original_referer=http%3A%2F%2Flocalhost%3A1313%2F&amp;ref_src=twsrc%5Etfw&amp;region=follow_link&amp;screen_name=AdiPolak&amp;tw_p=followbutton">Adi Polak</a> 🐦.</p>

    <div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/Aq8bo6OR48A?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>If you would like to get monthly updates, consider <a href="https://sub.adipolak.com/subscribe">subscribing</a>.</p>
]]></content:encoded><category>open-source</category><category>apache spark</category><category>delta lake</category><category>distributed-systems</category><category>beginner</category><category>deltalog</category></item><item><title>Delta Lake essential Fundamentals: Part 2 - The DeltaLog</title><link>https://adipolak.github.io/adipolak-blog/post/delta-lake-essential-fundamentals---the-deltalog/</link><pubDate>Thu, 11 Feb 2021 00:00:00 +0000</pubDate><author>Adi Polak</author><guid>https://adipolak.github.io/adipolak-blog/post/delta-lake-essential-fundamentals---the-deltalog/</guid><description>Multi-part series that will take you from beginner to expert in Delta Lake</description><content:encoded><![CDATA[<p>In the previous part, you learned what <a href="/post/delta-lake-essential-fundamentals">ACID transactions</a> are.<br>
In this part, you will understand how Delta Transaction Log, named DeltaLog, is achieving ACID.</p>
<h2 id="transaction-log">Transaction Log</h2>
<p>A transaction log is a history of actions executed by a (TaDa 💡) database management system with the goal to guarantee <a href="/post/delta-lake-essential-fundamentals/">ACID properties</a> over a crash.</p>
<h2 id="deltalake-transaction-log---detlalog">DeltaLake transaction log - DetlaLog</h2>
<p>DeltaLog is a transaction log directory that holds an <strong>ordered</strong> record of every transaction committed on a Delta Lake table since it was created.
The goal of DeltaLog is to be the <strong>single</strong> source of truth for readers who read from the same table at the same time. That means, parallel readers read the <strong>exact</strong> same data.
This is achieved by tracking all the changes that users do: read, delete, update, etc. in the DeltaLog.</p>
<p>DeltaLog can also contain statistics on the data; depending on the type of the data/field/column, each column can have min/max values. Having this extra metadata can help with faster querying. DeltaTable read mechanism uses a simplified <a href="https://medium.com/microsoftazure/data-at-scale-learn-how-predicate-pushdown-will-save-you-money-7063b80878d7">push down predict</a>.</p>
<p>Here is a simplification of DeltaLog on the file systems from Databricks site: <br>
<img class="responsive" src="../../images/Detla/deltalake-deltalog.png" alt="drawing"></p>
<p>The DeltaLog itself is a folder that consists of multiple JSON files. When it reaches 10 files, DeltaTable does a checkpoint and compaction operations (we will dive into it in the next chapter).</p>
<p>Here is an example of a DeltaLog JSON file from the code source test resources, each entry in the file is on JSON:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{<span style="color:#f92672">&#34;remove&#34;</span>:{<span style="color:#f92672">&#34;path&#34;</span>:<span style="color:#e6db74">&#34;part-00001-f1cb1cf9-7a73-439c-b0ea-dcba5c2280a6-c000.snappy.parquet&#34;</span>,<span style="color:#f92672">&#34;dataChange&#34;</span>:<span style="color:#66d9ef">true</span>}}
</span></span><span style="display:flex;"><span>{<span style="color:#f92672">&#34;remove&#34;</span>:{<span style="color:#f92672">&#34;path&#34;</span>:<span style="color:#e6db74">&#34;part-00000-f4aeebd0-a689-4e1b-bc7a-bbb0ec59dce5-c000.snappy.parquet&#34;</span>,<span style="color:#f92672">&#34;dataChange&#34;</span>:<span style="color:#66d9ef">true</span>}}
</span></span></code></pre></div><p>There was a total of two commits captured in this file:
<em>remove</em> -it can be a delete operation on a whole column or only specific values in it. In this operation the metadata field <em>dataChange</em> is set to true.</p>
<p>Here is a more complex JSON file example, each entry in the file is on JSON:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{<span style="color:#f92672">&#34;metaData&#34;</span>:{<span style="color:#f92672">&#34;id&#34;</span>:<span style="color:#e6db74">&#34;2edf2c02-bb63-44e9-a84c-517fad0db296&#34;</span>,<span style="color:#f92672">&#34;format&#34;</span>:{<span style="color:#f92672">&#34;provider&#34;</span>:<span style="color:#e6db74">&#34;parquet&#34;</span>,<span style="color:#f92672">&#34;options&#34;</span>:{}},<span style="color:#f92672">&#34;schemaString&#34;</span>:<span style="color:#e6db74">&#34;{\&#34;type\&#34;:\&#34;struct\&#34;,\&#34;fields\&#34;:[{\&#34;name\&#34;:\&#34;id\&#34;,\&#34;type\&#34;:\&#34;integer\&#34;,\&#34;nullable\&#34;:true,\&#34;metadata\&#34;:{}},{\&#34;name\&#34;:\&#34;value\&#34;,\&#34;type\&#34;:\&#34;string\&#34;,\&#34;nullable\&#34;:true,\&#34;metadata\&#34;:{}}]}&#34;</span>,<span style="color:#f92672">&#34;partitionColumns&#34;</span>:[<span style="color:#e6db74">&#34;id&#34;</span>],<span style="color:#f92672">&#34;configuration&#34;</span>:{}}}
</span></span><span style="display:flex;"><span>{<span style="color:#f92672">&#34;remove&#34;</span>:{<span style="color:#f92672">&#34;path&#34;</span>:<span style="color:#e6db74">&#34;part-00001-6d252218-2632-416e-9e46-f32316ec314a-c000.snappy.parquet&#34;</span>,<span style="color:#f92672">&#34;dataChange&#34;</span>:<span style="color:#66d9ef">true</span>}}
</span></span><span style="display:flex;"><span>{<span style="color:#f92672">&#34;remove&#34;</span>:{<span style="color:#f92672">&#34;path&#34;</span>:<span style="color:#e6db74">&#34;part-00000-348d7f43-38f6-4778-88c7-45f379471c49-c000.snappy.parquet&#34;</span>,<span style="color:#f92672">&#34;dataChange&#34;</span>:<span style="color:#66d9ef">true</span>}}
</span></span><span style="display:flex;"><span>{<span style="color:#f92672">&#34;add&#34;</span>:{<span style="color:#f92672">&#34;path&#34;</span>:<span style="color:#e6db74">&#34;id=5/part-00000-f1e0b560-ca00-409e-a274-f1ab264bc412.c000.snappy.parquet&#34;</span>,<span style="color:#f92672">&#34;partitionValues&#34;</span>:{<span style="color:#f92672">&#34;id&#34;</span>:<span style="color:#e6db74">&#34;5&#34;</span>},<span style="color:#f92672">&#34;size&#34;</span>:<span style="color:#ae81ff">362</span>,<span style="color:#f92672">&#34;modificationTime&#34;</span>:<span style="color:#ae81ff">1501109076000</span>,<span style="color:#f92672">&#34;dataChange&#34;</span>:<span style="color:#66d9ef">true</span>}}
</span></span><span style="display:flex;"><span>{<span style="color:#f92672">&#34;add&#34;</span>:{<span style="color:#f92672">&#34;path&#34;</span>:<span style="color:#e6db74">&#34;id=6/part-00000-adb59f54-6b8f-4bfd-9915-ae26bd0f0e2c.c000.snappy.parquet&#34;</span>,<span style="color:#f92672">&#34;partitionValues&#34;</span>:{<span style="color:#f92672">&#34;id&#34;</span>:<span style="color:#e6db74">&#34;6&#34;</span>},<span style="color:#f92672">&#34;size&#34;</span>:<span style="color:#ae81ff">362</span>,<span style="color:#f92672">&#34;modificationTime&#34;</span>:<span style="color:#ae81ff">1501109076000</span>,<span style="color:#f92672">&#34;dataChange&#34;</span>:<span style="color:#66d9ef">true</span>}}
</span></span><span style="display:flex;"><span>{<span style="color:#f92672">&#34;add&#34;</span>:{<span style="color:#f92672">&#34;path&#34;</span>:<span style="color:#e6db74">&#34;id=4/part-00001-36c738bf-7836-479b-9cc1-7a4934207856.c000.snappy.parquet&#34;</span>,<span style="color:#f92672">&#34;partitionValues&#34;</span>:{<span style="color:#f92672">&#34;id&#34;</span>:<span style="color:#e6db74">&#34;4&#34;</span>},<span style="color:#f92672">&#34;size&#34;</span>:<span style="color:#ae81ff">362</span>,<span style="color:#f92672">&#34;modificationTime&#34;</span>:<span style="color:#ae81ff">1501109076000</span>,<span style="color:#f92672">&#34;dataChange&#34;</span>:<span style="color:#66d9ef">true</span>}}
</span></span></code></pre></div><p>In this example, there is the <em>metadata</em> object entry - it represents a change in the table columns either an update to the table schema or that a new table was created.
Later we see two <em>remove</em> operations, followed by three <em>add</em> operations. These operation objects can have a <em>stat</em> field, which contains statistical information, such as the number of records, minValues, maxValues, and more.</p>
<p>These JSON files might also contain operation objects with fields such as - &ldquo;STREAMING UPDATE&rdquo;, &ldquo;NOTEBOOK&rdquo;  if the operation took place from a notebook, isolationLevel, etc.</p>
<p>This information is valuable for managing the table and avoiding redundant full scan on the storage.</p>
<p>To simplify the connection between DeltaTable and DeltaLog, it&rsquo;s easier to think about DeltaTable as a direct result of a set of actions audited by the DeltaLog.</p>
<h2 id="deltalog-and-atomicity">DeltaLog and Atomicity</h2>
<p>From <a href="/post/delta-lake-essential-fundamentals">part one</a>, you already know that atomicity means that a transaction, either happened or not. The DeltaLog itself consists of atomic operations; each line in the log (like the ones you saw above) represents an action, which is an atomic unit; These are called commits.
The transactions that took place on the data can be broken into multiple components in which each one individually represents a commit in the DeltaLog. These breaking complex operations into small transactions help with ensuring atomicity.</p>
<h2 id="deltalog-and-isolation">DeltaLog and Isolation</h2>
<p>Operations such as Update, Delete, Add can harm isolation; Hence, since we want to guarantee isolation with DeltaTable, readers only get access to the table snapshot. This guarantees all parallel readers read the exact data. For handling deletion operations, Delta postpones the actual delete operation on the files; it first tags the files as deleted and later, remove them when considered safe (similar to Cassandra, and ElasticSearch delete operations with a tombstone).</p>
<p>In DeltaLake 0.8.1 source code, there is a comment saying that it&rsquo;s recommended to have the delete retention set to at least 2 weeks or longer than a duration of a job. <br>
<em>Note:</em> This will impact streaming workload as well, because there will be a need to delete the actual files at some point, which might result in blocking the stream.
<img class="responsive" src="../../images/Detla/delta-tombston-retention.png" alt="drawing"></p>
<h2 id="deltalog-and-consistency">DeltaLog and Consistency</h2>
<p>Delta Lake solves the problem of consistency by solving conflicts with an optimistic concurrency algorithm.
The class in charge of this algorithm is the OptimisticTransaction class. It achieves it by using <a href="https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/ReentrantLock.html">Java 8 ReentrantLock</a> that is controlled from a DeltaLog instance. <br>
Here is the code snippet: <br></p>
<img class="responsive" src="../../images/Detla/delta-log-optimistic-concurrency-algo.png" alt="drawing">
<p>A DeltaTable instance actively uses the ReentrantLock in the OptimisticTransaction under the <code>doCommitRetryIteratively</code> function.
The optimistic approach was chosen here because in the big data world there is a tendency to add more data than to update existing records.
It&rsquo;s rare to find and update a specific record, it is usually done when there was some data corruption on necessary data.</p>
<p>Here is the code snippet for the optimistic algorithm:
<img class="responsive" src="../../images/Detla/delta-log-OptimisticTransaction.png" alt="drawing"></p>
<p>Notice that in line 572, the program records the attempted version as the <code>commitVersion</code> instance which is of type <code>var</code>.
<code>var</code> in Scala represents a mutable object instance, which means we should expect its value to change.</p>
<p>In line 575, we start the algorithm:
it starts the <code>while(true)</code> loop and maintains an <code>attemptNumber</code> counter; if it&rsquo;s <code>==0</code>, it will try to commit; if it fails here, that means that a file with this <code>commitVersion</code> was already written/committed into the table and it will throw an exception. That exception is being caught in lines 592+593. From there, with each failure, the algorithm is increasing the attemptNumber by 1.
After the first failure, the program won&rsquo;t go into the first if statement on line 577; it will go straights into the <code>else if</code> on line 579.
If the program reached the state where <code>attemptNumber</code> is bigger than the maximum allowed/configured, it will throw a <code>DeltaErrors.maxCommitRetriesExceededException</code> exception.
maxCommitRetriesExceededException exception will provide information about the commit version, the first commit version attempt, the number of attempted commits, and total time spent attempting this commit in ms.
Otherwise, it will try to record this update with checkForConflict functionality in line 588.
Multiple scenarios can bring us to this state.</p>
<p>High-level pseudo-code:</p>
<pre tabindex="0"><code>while(tryCommit)
    if first attempt:
        do commit
    else if: attempt number &gt; max retries
            throw an exception - exit loop
        else:
            record retry operation
            try fixing logical conflicts - return valid commit version or throw an exception
            do commit
    retry on exceptions and attempt version +1
    if no exception - end loop
end     
</code></pre><p>To support the users, DeltaLake introduces a set of conflict exceptions that provide more information about the data and the conflicts:</p>
<img class="responsive" src="../../images/Detla/delte-concurrent-exceptions.png" alt="drawing">
<p>Let&rsquo;s look at some of the conflict scenarios.</p>
<h3 id="two-writers">Two Writers:</h3>
<p>This is the case of two writers who appends data to the same table simultaneously, without reading anything. In this scenario, one writer will commit, and the second writer will read the first one&rsquo;s updates before adding their own updates. Suppose it was only an append operation, like a counter which both are incrementing. In that case, there is no need to redo all computations, and it will automatically commit; if that&rsquo;s not the case, writer number two will need to redo the computation given the new information from writer one.</p>
<h3 id="delete-and-read">Delete and Read:</h3>
<p>In a more complex scenario like this one, there is no automated solution. For concurrent Delete-Read, there is a dedicated <code>ConcurentDeleteReadException</code>.
That means that if there is a request to delete a file that at the same time is being used for a read, the program throws an exception.</p>
<img class="responsive" src="../../images/Detla/ConcurrentDeleteReadException.png" alt="drawing">
<h3 id="delete-and-delete">Delete and Delete:</h3>
<p>When two operations delete the same file, it might be due to a compaction mechanism or other operation, here too an exception will occur.</p>
<h2 id="deltalog-and-durability">DeltaLog and Durability</h2>
<p>Since all transactions made on a DeltaTable are being stored directly to the disk/file system, durability is a given. All commits are being <em>persisted</em> to disk.  In case of a system failure, they can be restored from the disk.
(Unless there is a true disaster like fire etc and damage to the actual disks holding the information).</p>
<hr>
<p>For exploring and learning about Delta, I did a deep dive into the code source itself. If you are interested in joining me, I captured it through videos, let me know if that is useful for you.</p>
<h1 id="whats-next">What&rsquo;s next?</h1>
<p>Next, we will see more examples, scenarios and use cases for DeltaLake! We will learn about the compaction mechanism, schema enforcement and how it can enforce exactly once operation.</p>
<p>As always, I would love to get your comments and feedback on <a href="https://twitter.com/intent/follow?original_referer=http%3A%2F%2Flocalhost%3A1313%2F&amp;ref_src=twsrc%5Etfw&amp;region=follow_link&amp;screen_name=AdiPolak&amp;tw_p=followbutton">Adi Polak</a> 🐦.</p>

    <div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/i24ZA6mmvDI?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>If you would like to get monthly updates, consider <a href="https://sub.adipolak.com/subscribe">subscribing</a>.</p>
]]></content:encoded><category>open-source</category><category>apache spark</category><category>delta lake</category><category>distributed-systems</category><category>beginner</category><category>deltalog</category></item><item><title>Delta Lake essential Fundamentals: Part 1 - ACID</title><link>https://adipolak.github.io/adipolak-blog/post/delta-lake-essential-fundamentals/</link><pubDate>Thu, 04 Feb 2021 00:00:00 +0000</pubDate><author>Adi Polak</author><guid>https://adipolak.github.io/adipolak-blog/post/delta-lake-essential-fundamentals/</guid><description>Multi-part series that will take you from beginner to expert in Delta Lake</description><content:encoded><![CDATA[<p>🎉 Welcome to the first part of Delta Lake essential fundamentals! 🎉</p>
<h2 id="what-is-delta-lake-">What is Delta Lake ?</h2>
<blockquote>
<p>Delta Lake is an open-source storage layer that brings ACID
transactions to Apache Spark™ and big data workloads. </p>
</blockquote>
<p>DeltaLake open source consists of 3 projects:</p>
<ol>
<li><a href="https://github.com/delta-io/delta">detla</a> - Delta Lake core, written in Scala.</li>
<li><a href="https://github.com/delta-io/delta-rs">delta-rs</a> - Rust library for binding with Python and Ruby.</li>
<li><a href="https://github.com/delta-io/connectors">connectors</a> - Connectors to popular big data engines outside Spark, written mostly in Scala.</li>
</ol>
<p>Delta provides us the ability to <u>&ldquo;travel back in time&rdquo;</u> into previous versions of our data, <u>scalable metadata</u> - that means if we have a large set of raw data stored in a data lake, having metadata provides us with the flexibility needed for analytics and exploration of the data. It also provides a mechanism to <u>unify streaming and batch data</u>.<br>
<u>Schema enforcement</u> - handle schema variations to prevent insertion of bad/non-compliant records, and <u>ACID transactions</u> to ensure that the users/readers never see inconsistent data.</p>
<highlight>
<p>It's important to remember that Delta Lake is not a DataBase (DB), yes, just like Apache Kafka is not a DB.<br>
It might 'feel' like one due to the support of ACID transactions, schema enforcements, etc.<br>
But it's not.</p>
</highlight>
<p>Part 1 focuses on ACID Fundamentals:</p>
<h2 id="acid-fundamentals-in-delta-lake">ACID Fundamentals in Delta Lake:</h2>
<p>Let&rsquo;s break it down to understand what each means and how it translates in Delta:</p>
<h4 id="atomicity">Atomicity</h4>
<p>The transaction succeeded or not, all changes, updates, deletes, and other operations either happened as a single unit or not. Think Binary, there is only yes or no - 1 or 0. In Delta, it means that a commit of a transaction happened, and a new transaction log file was written. Transaction log file name example - <code>000001.json</code>, the number represents the commit number.</p>
<h4 id="consistency">Consistency</h4>
<p>A transaction can only bring the DB from one state to another; data is valid according to all the rules, constraints, triggers, etc. The transaction itself can be consistent but incorrect. To achieve consistency, DeltaLake relay on the commit timestamp that comes from the storage system modification timestamps. If you are using cloud provider storage such as <a href="https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction?WT.mc_id=delta-13569-adpolak">Azure blob</a> or AWS S3, the timestamp will come from the storage server.</p>
<h4 id="isolation">Isolation</h4>
<p>Transactions taking place concurrently result in an equals state as if transactions would have been executed sequentially. This is the primary goal of Concurrency control strategies. In Delta, after 10 commits, there is a merging mechanism that merges these commits into a checkpoint file. The checkpoint file has a timestamp. 1 second is being added to the modification timestamp to avoid flakiness. This is how it looks in the code base of Delta:
<img class="responsive" src="../../images/delta-lake-avoid-flakiness-commit.png" alt="drawing"></p>
<h4 id="durability">Durability</h4>
<p>Once a transaction has been committed, it will remain committed even if the system fails. Think about writing to disk vs. writing to Ram memory. A machine can fail, but if the commit data was written to disk, it could be restored. Delta writes all the commits in a JSON file directly to the storage; it is not left floating in RAM space for too long.</p>
<h2 id="whats-next">What&rsquo;s next?</h2>
<p>After understanding ACID basics and a bit about the Transaction Log (aka DeltaLog), you are ready to take the next chapter! <br> In Diving deeper into to DeltaLog, how it looks like on disk, and the open-source code you need to be familiar with.</p>
<hr>
<h2 id="as-always-i-would-love-to-get-your-comments-and-feedback-on-adi-polakhttpstwittercomintentfolloworiginal_refererhttp3a2f2flocalhost3a13132fref_srctwsrc5etfwregionfollow_linkscreen_nameadipolaktw_pfollowbutton-">As always, I would love to get your comments and feedback on <a href="https://twitter.com/intent/follow?original_referer=http%3A%2F%2Flocalhost%3A1313%2F&amp;ref_src=twsrc%5Etfw&amp;region=follow_link&amp;screen_name=AdiPolak&amp;tw_p=followbutton">Adi Polak</a> 🐦.</h2>
<p>If you would like to get monthly updates, consider <a href="https://sub.adipolak.com/subscribe">subscribing</a>.</p>
]]></content:encoded><category>open-source</category><category>apache spark</category><category>delta lake</category><category>distributed-systems</category><category>beginner</category><category>acid</category></item><item><title>Apache Spark Ecosystem, Jan 2021 Highlights</title><link>https://adipolak.github.io/adipolak-blog/post/apache-spark-ecosystem/</link><pubDate>Tue, 12 Jan 2021 00:00:00 +0000</pubDate><author>Adi Polak</author><guid>https://adipolak.github.io/adipolak-blog/post/apache-spark-ecosystem/</guid><description>The ever growing Open Source Ecosystem</description><content:encoded><![CDATA[<p>If you&rsquo;ve been reading here for a while, you know that I&rsquo;m a big fan of Apache Spark and have been using it for more than 8 years.<br>
Apache Spark is continually growing. It started as part of the Hadoop family,<br>
but with <a href="https://medium.com/@acmurthy/hadoop-is-dead-long-live-hadoop-f22069b264ac">the slow death of hadoop</a> and the fast growth of Kubernetes, many new tools, connectors and open source have emerged.</p>
<p>Let&rsquo;s take a look at three exciting open sources:</p>
<h2 id="ray"><strong>Ray:</strong></h2>
<img class="responsive" src="https://github.com/ray-project/ray/raw/master/doc/source/images/ray_header_logo.png" alt="drawing">
<p>Ray is an open source, python based framework for building distributed applications.
Their main audience is ML developers and Data Scientists who would like to accelerate their machine learning workloads using distributed computing.
Ray was open sourced by UC Berkly <a href="https://rise.cs.berkeley.edu/">RISELab</a>, the same lab who created the <a href="https://amplab.cs.berkeley.edu/">AMPLab</a> project, where Apache Spark was created.
BTW, if you are curious, their next big 5 years project is all about <strong>Real-time Intelligence with Secure Explainable decision</strong>.
<br></br></p>
<p><span style="background-color: #FFFF00"> RayOnSpark </span> is a feature that was recently added to <a href="https://github.com/intel-analytics/analytics-zoo">Analytic Zoo</a>, end to end data analytics + AI open sourced platform, that helps you unified multiple analytics workload like recommendation, time series, computer vision, nlp and more into one platform running on Spark, Yarn or K8S.
<br></br></p>
<p><span style="background-color: #DCDCDC"> &ldquo;RayOnSpark allows users to directly run Ray programs on Apache Hadoop*/YARN, so that users can easily try various emerging AI applications on their existing Big Data clusters in a distributed fashion. Instead of running big data applications and AI applications on two separate systems, which often introduces expensive data transfer and long end-to-end learning latency, RayOnSpark allows Ray applications to seamlessly integrate into Apache Spark* data processing pipeline and directly run on in-memory Spark RDDs or DataFrames.&rdquo; Jason Dai. </span></p>
<img class="responsive" src="https://miro.medium.com/max/728/1*Jv085PlSKouE9RRuvFNlDQ.png" alt="drawing">
<p>To learn more about Ray and RayOnSpark, checkout <a href="https://medium.com/riselab/rayonspark-running-emerging-ai-applications-on-big-data-clusters-with-ray-and-analytics-zoo-923e0136ed6a">Jason Dai article from RISELab publication</a>.</p>
<hr>
<p><br></br>
<br></br></p>
<h2 id="koalas"><strong>Koalas:</strong></h2>
<img style="width:auto;max-width:350px; height: auto;" src="https://raw.githubusercontent.com/databricks/koalas/master/icons/koalas-logo.png" alt="drawing">
<br></br>
<span style="background-color: #FFFF00"> Koalas </span> is Pandas scalable Sibling:
<p>From the <a href="https://pandas.pydata.org/docs/">Pandas</a> docs: <em>&ldquo;pandas is an open source, BSD-licensed library providing high-performance,
easy-to-use data structures and data analysis tools for the Python programming language.&rdquo;</em></p>
<p>From the <a href="https://koalas.readthedocs.io/en/latest/">Koalas</a> docs: <em>&ldquo;The Koalas project makes data scientists more productive when interacting with big data,
by implementing the pandas DataFrame API on top of Apache Spark.&rdquo;</em></p>
<p>If you are familiar with exploring and running analytics on data with <em>panads</em>,<br>
<em>Koalas</em> provides a similar API for running the same analytics on Apache Spark DataFrames.<br>
Which makes it easier for Pandas user to run their workloads at scale.<br>
When using it, notice the different versions of Koalas, many new versions are NOT available with Spark 2.4 and require Spark 3.0 cluster.</p>
<p>Koalas is built with an internal frame to hold indexes and information on top of Spark DataFrame.</p>
<img style="width:auto;max-width:650px; height: auto;" src="https://i.ytimg.com/vi/NpAMbzerAp0/maxresdefault.jpg" alt="drawing">
<br></br>
<p>To learn more about it, checkout <a href="https://databricks.com/session_eu19/koalas-pandas-on-apache-spark">Tim Hunter talk on Koalas</a> from Spark Summit 2019.</p>
<hr>
<p><br></br>
<br></br></p>
<h2 id="delta-lake"><strong>Delta Lake:</strong></h2>
<img style="width:auto;max-width:350px; height: auto;" src="https://camo.githubusercontent.com/5535944a613e60c9be4d3a96e3d9bd34e5aba5cddc1aa6c6153123a958698289/68747470733a2f2f646f63732e64656c74612e696f2f6c61746573742f5f7374617469632f64656c74612d6c616b652d77686974652e706e67" alt="drawing">
<p><a href="https://delta.io/">Delta Lake</a> is nothing new with the Spark ecosystem, but still many confuse Delta Lake to be a &hellip; DataBase! (DB) well.. delta lake is NOT a database.
Detla Lake is an open source storage layer that brings ACID (atomicity, consistency,
isolation, and durability) transactions to Apache Spark and Big data workloads but is not a DB! Just like <a href="https://docs.microsoft.com/en-us/learn/paths/store-data-in-azure/?WT.mc_id=blog-00000-adpolak">Azure Blog storage</a> and <a href="https://aws.amazon.com/s3/">AWS S3</a> are not acting as databases, they are defined as storage.<br>
Delta helps with ACID that is hard to achieve and a great pain point with distributed storage.
It provides scalable metadata handling on the data itself.
When combined with Spark this is highly useful due to the nature of Spark SQL engine
the catalyst which uses this metadata to better plan and executed big data queries.</p>
<p>There is also data versioning through snapshot of the storage named Time Travel feature.
I recommend being mindful with using this feature as saving snapshots and later using them might create an overhead to the size and compute of your data.</p>
<p>If you are curious to learn more about it, read <a href="https://databricks.com/blog/2020/06/18/time-traveling-with-delta-lake-a-retrospective-of-the-last-year.html">here</a>.</p>
<hr>
<h2 id="thats-it">That&rsquo;s it.</h2>
<p>I hope you enjoyed reading this short recap on open sources for January 2021.<br>
If you are interested in learning more and getting updates, follow <a href="https://twitter.com/AdiPolak">Adi Polak on Twitter.</a>.</p>
]]></content:encoded><category>open-source</category><category>apache spark</category><category>koalas</category><category>pandas</category><category>delta lake</category><category>distributed-systems</category><category>ray</category><category>ray on spark</category><category>analytic zoo</category></item><item><title>Kubernetes and Virtual Kubelet in a nutshell</title><link>https://adipolak.github.io/adipolak-blog/post/kubernetes-and-virtual-kubelet-in-a-nutshell/</link><pubDate>Sun, 10 Jan 2021 00:00:00 +0000</pubDate><author>Adi Polak</author><guid>https://adipolak.github.io/adipolak-blog/post/kubernetes-and-virtual-kubelet-in-a-nutshell/</guid><description>Step by step tutorial on how to scale web app using the right infrastructure such as Kubernetes and virtual kubelet</description><content:encoded><![CDATA[<img class="responsive" src="https://images.unsplash.com/photo-1550587381-a9ec95bbe09e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1950&q=80;" alt="drawing">
<p>Today, you will learn how to take a web app (it can be any programming language,<br>
we used Java &amp; Kotlin) and distribute it using Kubernetes (K8s) and Virtual Kubelet (VK).</p>
<p>Well, if you don&rsquo;t know yet why you should consider distributing your web app - read my post <a href="https://dev.to/azure/why-should-i-distribute-my-web-app-1kk8">here</a>.</p>
<p><span style="background-color: #FFFF00"><strong>So, you are probably asking yourself</strong></span><br>
<span style="background-color: #FFFF00"><strong>&ldquo;what is Kubernetes and what can I use it for?&rdquo;</strong></span><br>
<span style="background-color: #FFFF00"><strong>Just keep reading</strong></span></p>
<p>Kubernetes is an open-source container-orchestration system for automating application deployment, scaling, and management.
It is used to build distributed, scalable microservices.</p>
<p>It brings many new concepts and terminology we need to familiarize ourselves with, these are the very basics:</p>
<h2 id="basic-glossary">Basic Glossary:</h2>
<p><span style="background-color: #FFFF00"><strong>Node</strong></span> - Hardware component. Often a VM hosted on a cloud and provide CPU and RAM resources to be used by the Kubernetes cluster.<br>
<span style="background-color: #FFFF00"><strong>Kubernetes Master</strong></span> - A node or nodes that are in charge of managing the Kubernetes cluster state.<br>
<span style="background-color: #FFFF00"><strong>Kubelet</strong></span> - Primary &ldquo;node agents&rdquo; that runs on each node. It manages the containers that were created by Kubernetes and runs on the node it manages.<br>
It communicates with the K8S master.<br>
<span style="background-color: #FFFF00"><strong>Pod</strong></span> - hold one or more containers. Containers that share the same pod also share resources and network.<br>
Pod can be in charge of containers on different nodes- different physical machines or virtual machine(VM).<br>
It serves as unit of deployment, horizontal scaling, and replication.<br>
<span style="background-color: #FFFF00"><strong>PodSpec</strong> </span> - Yaml or JSON file that describes the pod spec. It is used by kubelet to make sure that the containers are healthy and running according to expectations.<br>
<span style="background-color: #FFFF00"><strong>Cluster</strong> </span> - Series of nodes connected together.<br>
There are many more concepts and terminology but this is the basic that we need to understand virtual kubelet and to start using K8S.<br>
<span style="background-color: #FFFF00"><strong>Kubernetes API</strong></span> - Server (REST) that runs on the master node and speaks directly with the kubelets running on the nodes.\</p>
<hr>
<p>In the chart from <a href="https://kubernetes.io/blog/2018/07/18/11-ways-not-to-get-hacked/">Kubernetes.io</a> we can see the nodes and master:</p>
<p><strong>Hey, where are the pods?</strong>
Well, the pods can be part of the Deployment or the ReplicaSet.<br>
The ReplicaSet/Deployment defines the replicas that are distributed among multiple nodes.<br>
<br>
Here is another chart that shows the pods work from <a href="https://thenewstack.io/kubernetes-deployments-work/">the new stack</a> website:</p>
<img class="responsive"  src="https://storage.googleapis.com/cdn.thenewstack.io/media/2017/11/07751442-deployment.png" alt="drawing">
<p>Another diagram shows how ReplicaSet work with Deployment,<br>
where Deployment can be view as a template for ReplicaSet with replicas default of 3.<br>
Diagram from Nir Mata <a href="https://www.nirmata.com/2018/03/03/kubernetes-for-developers-part-2-replica-sets-and-deployments/">site</a>:</p>
<img class="responsive"  src="https://www.nirmata.com/wp-content/uploads/2018/03/Deployment.png" alt="drawing">
<h3 id="how-kubernetes-works">How Kubernetes works?</h3>
<p>Kubernetes manages N number of nodes and within those nodes, there are these kubelets.<br>
Kubelets manage everything related to the node and the pods running on it. Pods are just a collection of containers.</p>
<p>When we take an app put it in a container, upload it to container <a href="https://azure.microsoft.com/en-in/services/container-registry/?WT.mc_id=devto-blog-adpolak">registry</a> and deploy it into Kubernetes.<br>
It then deployed onto a VM somewhere that is managed by Kubernetes cluster in our case <a href="https://docs.microsoft.com/en-us/azure/aks?WT.mc_id=devto-blog-adpolak">Azure Kubernetes Service (AKS)</a>.<br>
We can see that VM and track it from the CLI and the UI - at that point,<br>
there is no per-second or pay-as-you-go billing since it is the classic scenario of<br>
managed K8S service where we pay for the machines in use even if we end up not using them.</p>
<h2 id="what-about-virtual-kubelet-vk">What about Virtual Kubelet (VK)?</h2>
<p>With Virtual Kubelet we don&rsquo;t see the actual node only one virtual node for each service used.<br>
It acts as an abstraction for us and can spin as many pods as needed.<br>
Behind the scene, we can have multiple VMs but we will see only one for the specific service that we are using.<br>
We are not exposed to the VMs running in the managed service that <br>
we are using from the Virtual Kubelet.<br>
<strong>Virtual Kubelet acts as a stand-in that help us proxy to other managed services</strong> with higher abstraction.</p>
<p>Virtual Kubelet is an open-source implementation of Kubernetes kubelet<br>
with the purpose of connecting Kubernetes to other APIs.<br>
It registers itself as a node and allows us to deploy unlimited amounts of pods and containers.<br>
It gives us the ability to connect with serverless containers platforms as well.<br>
Meaning we can take any stateless app, containerize it and provision it through<br>
the pods and the Virtual Kubelet will manage it for us and will shift it to the<br>
managed service. We don&rsquo;t need to manage the infrastructure.<br>
It can scales up or down - all managed by the service.<br>
According to the managed service in use, we can benefit from <a href="https://azure.microsoft.com/en-in/offers/ms-azr-0003p?WT.mc_id=devto-blog-adpolak">Pay-as-you-Go accounts</a>, flexible auto-scaling and many more.</p>
<hr>
<p>When combining AKS with Azure Container Instances(ACI) you benefit from a fast orchestration of containers.<br>
We combine the two using <em>virtual nodes</em>. Results in the automation of containers scheduling.<br>
Scheduling in container context refers to the ability of the administrator to load a<br>
service onto a host system that defined how to run a specific container.<br>
Using ACI with <em>virtual nodes</em> results in faster provisioning of pods.</p>
<p><a href="https://docs.microsoft.com/en-us/azure/aks/virtual-nodes-cli?WT.mc_id=devto-blog-adpolak">Virtual nodes</a> can be used with AKS and are powered by the open-source Virtual Kubelet.</p>
<img class="responsive"  src="https://github.com/virtual-kubelet/virtual-kubelet/raw/master/website/static/img/diagram.svg?sanitize=true" alt="drawing">
<h3 id="pros"><strong>Pros:</strong></h3>
<p><strong>✅ Fully managed solution of top of Kubernetes</strong>
Allow us to connect to many managed solutions from various cloud providers in various regions.</p>
<p><strong>✅ Pay exactly for what you use</strong>
Managed solutions like <a href="https://azure.microsoft.com/en-us/services/container-instances?WT.mc_id=devto-blog-adpolak">ACI</a> or <a href="https://aws.amazon.com/fargate/">AWS Fargate</a> help us<br>
scale up or down according to our needs without intervention from our side.</p>
<p><strong>✅Portability</strong>
Everywhere K8S runs, you can run your Virtual Kubelet and connect it with your managed service.</p>
<p><strong>✅ Regions and other clusters</strong>
From Virtual Kubelet you can leverage services that run on other regions and even other cloud providers.</p>
<h3 id="cons"><strong>Cons:</strong></h3>
<p><strong>❗️ Security</strong>
In general, you should always think about security.<br>
Remember, Security is everyone job!
The overall security aspect of using Kubernetes is pretty complex to begin with.<br>
When adding Virtual Kubelet, one should be aware of security issues that can<br>
arise from communicating with other services outside of the Kubernetes cluster and outside of the region/cloud provider.<br>
If we decide to work with ACI or other internal services, we can establish an internal virtual<br>
network from K8s cluster to ACI. This way we can eliminate this security concern.</p>
<hr>
<h2 id="lets-get-practical-with-a-tutorial">Let&rsquo;s get practical with a tutorial</h2>
<p>In JVM world there are many frameworks that can help us create a web app fast. One that includes the server and UI.<br>
Our app uses Spring boot. Spring Boot has many embedded features like server and more.<br>
For the server options, we can pick from Tomcat, Jetty or Undertow.</p>
<p>So you are probably asking yourself, how to get started with Spring Boot?<br>
go to Spring initializer <a href="https://start.spring.io/">site</a> and download a template or download the demo app from this <a href="https://github.com/adipola/virtual-kubelet-kotlin-spring-demo">github repository</a>.</p>
<p>In this tutorial, we will deploy a kotlin-spring app to a Virtual Node on K8s cluster.
We will use the next services: AKS, ACR and ACI.</p>
<img class="responsive" src="../../images/k8s.png" alt="drawing">
<p><br>
<strong>For the tutorial you will need:</strong></p>
<ol>
<li>Demo <a href="https://github.com/adipola/virtual-kubelet-kotlin-spring-demo">app</a></li>
<li>Azure <a href="https://azure.microsoft.com/en-us/free?WT.mc_id=devto-blog-adpolak">free</a> subscription</li>
<li><a href="https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?WT.mc_id=devto-blog-adpolak&amp;view=azure-cli-latest">Azure CLI</a></li>
<li><a href="https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough?WT.mc_id=devto-blog-adpolak">AKS cluster</a></li>
</ol>
<p><span style="background-color: #FFFF00"><strong>At this point we have an AKS cluster</strong></span>
, an app to deploy to our cluster and CLI tools installed.<br>
For the second phase, we will need an ACI account, Docker registry to store our app image (we will use Azure container registry - <a href="%5Bhttps://azure.microsoft.com/en-in/services/container-registry?WT.mc_id=devto-blog-adpolak%5D">ACR</a></p>
<p>Our demo app comes with a docker file that defines the app already, so we can push it to ACR.<br>
Navigate in the terminal or CMD into your app directory and run:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>set ACR_NAME<span style="color:#f92672">={</span>acr name<span style="color:#f92672">}</span>
</span></span><span style="display:flex;"><span>az login
</span></span><span style="display:flex;"><span>az acr login --name $ACR_NAME
</span></span><span style="display:flex;"><span>docker build --no-cache -t demo .
</span></span><span style="display:flex;"><span>docker tag demo $ACR_NAME.azurecr.io/samples/demo
</span></span><span style="display:flex;"><span>docker push $ACR_NAME.azurecr.io/samples/demo
</span></span></code></pre></div><p>This is the push process:</p>
<p><img class="responsive" src="../../images/01-02-push_docker.png" alt="drawing"> \</p>
<p>To test yourself - run your local docker with remote image</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>docker run -it --rm -p 8080:80 $ACR_NAME.azurecr.io/samples/demo
</span></span></code></pre></div><p>The docker container will start running locally and you will see something like this:
<img src="https://github.com/adipola/my-posts/blob/master/pictures/01-01-spring.png?raw=true" alt="">
you can stop it with <em>ctrl+C</em>.</p>
<p>Now let&rsquo;s connect to our AKS cluster, for that we will need our resource group name and our AKS cluster name:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>set RES_GROUP<span style="color:#f92672">={</span>resource group name<span style="color:#f92672">}</span>
</span></span><span style="display:flex;"><span>set AKS_NAME<span style="color:#f92672">={</span>AKS name<span style="color:#f92672">}</span>
</span></span><span style="display:flex;"><span>az aks get-credentials --resource-group $RES_GROUP --name $AKS_NAME
</span></span></code></pre></div><p><em>Verify the connection to the cluster</em></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>kubectl get nodes
</span></span></code></pre></div><p>we will get the list of our nodes, version, status and more.</p>
<p>Next we will create the authentication between the container registry (ACR) and AKS,<br>
this is an important step, without it, AKS cluster will not be able to pull the image from the registry.
We will do it using secret - follow <a href="https://docs.microsoft.com/bs-latn-ba/azure/container-registry/container-registry-auth-aks#access-with-kubernetes-secret?WT.mc_id=devto-blog-adpolak">this</a></p>
<p>in the tutorial you are running this - remember to <strong>take a note</strong> of them both!</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Output used when creating Kubernetes secret.</span>
</span></span><span style="display:flex;"><span>echo <span style="color:#e6db74">&#34;Service principal ID: </span>$CLIENT_ID<span style="color:#e6db74">&#34;</span>
</span></span><span style="display:flex;"><span>echo <span style="color:#e6db74">&#34;Service principal password: </span>$SP_PASSWD<span style="color:#e6db74">&#34;</span>
</span></span></code></pre></div><p>Validate your connection and secret with logging into docker -</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>docker login $ACR_LOGIN_SERVER --username $CLIENT_ID --password $SP_PASSWD
</span></span></code></pre></div><p>If this is failing, AKS will not be able to pull the image, and later in the tutorial you will get this error <code>got HTTP response status code 400 error code “InaccessibleImage”</code>.
Make sure to follow the tutorial in the <a href="https://docs.microsoft.com/bs-latn-ba/azure/container-registry/container-registry-auth-aks#access-with-kubernetes-secret?WT.mc_id=devto-blog-adpolak">link</a> step by step.</p>
<h3 id="install-connector">Install connector:</h3>
<p>For installing the connector and the ability to use virtual nodes, we will create a subnet in our network and will install an AKS cluster there with add-ons for virtual node.
This is a more secure way since we create an internal network that is isolated from our bigger K8s cluster.<br>
Follow the step-by-step <a href="https://docs.microsoft.com/en-us/azure/aks/virtual-nodes-cli?WT.mc_id=devto-blog-adpolak">here</a> but don&rsquo;t deploy the app - we will deploy our app instead.</p>
<p>For deploying the app run:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>kubectl apply -f kotlin-spring-virtual-kublet-linux.yaml
</span></span></code></pre></div><p>This YAML file describes to K8s, pods and kubelet how we want our app to run,<br>
what is the deployments and the services in use. Each deploy component in our file<br>
starts with <code>apiVersion</code> followed by <code>kind</code>, <code>metadata</code> and <code>spec</code> In our file we have one service named- <code>azure-spring-kotlin-front-virtual-service</code>
and one deployment named: <code>azure-spring-kotlin-front-virtual</code>.
Under <code>deployment</code> and under <code>spec -&gt; template -&gt; spec</code> we have the configuration<br>
for the node selector, we might have many nodes in our cluster, and we would like this app<br>
to be deployed to our virtual node one and not to the rest.<br>
For achieving this, under <code>nodeSelector</code> we describe the <code>type</code> by giving it the value of <code>virtual-kubelet</code>.<br>
This specifies the pods and kubelet that we will deploy this app only on this specific type of node and no other.</p>
<p>Our second component is of <code>kind</code> <code>service</code> , it&rsquo;s spec type is <code>loadBalancer</code> and<br>
it will have an <code>External API</code> for the app so we can load it in our browser.<br>
For doing it, we need to expose it first - notice that we are exposing the app and not the LoadBalancer itself since we can expose a deployment:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>kubectl expose deployment azure-spring-kotlin-front-virtual --type<span style="color:#f92672">=</span>LoadBalancer --port <span style="color:#ae81ff">80</span> --target-port <span style="color:#ae81ff">8080</span>
</span></span></code></pre></div><p>To find the <code>External API</code> run:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>kubectl get services
</span></span></code></pre></div><p>And search for <code>External API</code> at <strong>azure-spring-kotlin-front-virtual</strong> entry.</p>
<h3 id="how-to-debug">How to debug:</h3>
<p>Use the next commands to debug and get a hold of what is happening in the cluster:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>kubectl get services
</span></span><span style="display:flex;"><span>kubectl get pods
</span></span><span style="display:flex;"><span>kubectl get deployment
</span></span></code></pre></div><p>From the commands above we will get the data and first statuses of the various component, after figuring out what failed we can run:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>kubectl describe <span style="color:#f92672">{</span>pod/service/node<span style="color:#f92672">}</span> <span style="color:#f92672">{</span>name of pod/service/node<span style="color:#f92672">}</span>
</span></span></code></pre></div><p>This will give us a JSON back with information like events, under event we will see what failed, it can be - <code>FailedSynch</code> app status <code>Terminated</code> - which usually reflects that the app crashed and we should check the node logs using</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>kubectl logs <span style="color:#f92672">{</span>name of node<span style="color:#f92672">}</span>
</span></span></code></pre></div><p>There are many more commands to debug K8s cluster and this was just the tip of the iceberg. Feel free to play and investigate the API.</p>
<p>Have something to add that I forgot to mention? want to discuss more options? write in comments or send a DM on <a href="https://twitter.com/AdiPolak">twitter</a>.</p>
<h2 id="learn-more-">Learn more 💡</h2>
<p>👉🏼  Watch this <a href="https://azure.microsoft.com/en-us/resources/videos/azure-friday-virtual-kubelet-introduction?WT.mc_id=devto-blog-adpolak">video</a> on Virtual Kubelet by Ria Bhatia and Scott Hanselman</p>
<p>👉🏼 <a href="https://docs.microsoft.com/en-us/azure/dev-spaces/quickstart-java?toc=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Faks%2FTOC.json&amp;bc=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fbread%2Ftoc.json&amp;WT.mc_id=devto-blog-adpolak">Quickstart:</a> Develop with Java on Kubernetes using Azure Dev Spaces</p>
<p>👉🏼 Java and <a href="https://azure.microsoft.com/en-us/develop/java/?WT.mc_id=devto-blog-adpolak">Azure</a></p>
<p>👉🏼 Kubernetes and Apache Spark on Azure <a href="https://docs.microsoft.com/en-us/azure/aks/spark-job?WT.mc_id=devto-blog-adpolak">tutorial</a></p>
<p><br></br></p>
<p><em>This article originally appeared in Adi Polak&rsquo;s Dev.to blog <a href="https://dev.to/adipolak/kubernetes-and-virtual-kubelet-in-a-nutshell-gn4">https://dev.to/adipolak/kubernetes-and-virtual-kubelet-in-a-nutshell-gn4</a>.</em></p>
]]></content:encoded><category>beginners</category><category>devops</category><category>tutorial</category><category>kubernetes</category></item><item><title>AI Systems and Applications</title><link>https://adipolak.github.io/adipolak-blog/hidden/ai-placeholder/</link><pubDate>Wed, 01 Jan 2020 00:00:00 +0000</pubDate><author>Adi Polak</author><guid>https://adipolak.github.io/adipolak-blog/hidden/ai-placeholder/</guid><description>Collection of insights on AI systems</description><category>ai</category></item><item><title>Technical Leadership Insights</title><link>https://adipolak.github.io/adipolak-blog/hidden/leadership-placeholder/</link><pubDate>Wed, 01 Jan 2020 00:00:00 +0000</pubDate><author>Adi Polak</author><guid>https://adipolak.github.io/adipolak-blog/hidden/leadership-placeholder/</guid><description>Collection of insights on technical leadership</description><category>leadership</category></item></channel></rss>