scribd · Avicennasis · Jun 29, 2026
diff --git a/_posts/2018-04-18-bandits-for-the-win.md b/_posts/2018-04-18-bandits-for-the-win.md
@@ -17,13 +17,13 @@ We’ll cover how we optimized our home page, transitioned from hard-coded rules
 
 ### Guessing which recommendation rows people like is probably a bad idea
 
-Users have diverse tastes. This bears repeating. Users have diverse tastes. I cannot understate the complexity of user preferences in a system that includes books, audiobooks, magazines and user-generated content. We need a system that automatically optimizes which rows to show various groups of users. And while we had a good idea of how well each row performed before implementing the multi-armed bandit, the row position biased the row type’s performance. Position bias, both horizontal and vertical, makes evaluating a recommendation system’s effectiveness challenging, since it’s one of the biggest determinants of interaction.
+Users have diverse tastes. This bears repeating. Users have diverse tastes. I cannot overstate the complexity of user preferences in a system that includes books, audiobooks, magazines and user-generated content. We need a system that automatically optimizes which rows to show various groups of users. And while we had a good idea of how well each row performed before implementing the multi-armed bandit, the row position biased the row type’s performance. Position bias, both horizontal and vertical, makes evaluating a recommendation system’s effectiveness challenging, since it’s one of the biggest determinants of interaction.
 
 ### There are too many combinations of rows to AB test
 
 The home page has 42 possible row types, which we can display in 10 row positions, resulting in 5*10¹⁵ potential combinations = 42! / (42 -10)! To give you some context, this is more stars than the entire Milky Way galaxy has!
 
-### The diversity of rows are important
+### The diversity of rows is important
 
 If diversity weren’t a factor, one reasonable solution would be to randomly display every row type until there was enough unbiased data to rank the rows. But given that the diversity and order of rows are essential, we needed a more sophisticated method.
 > The scale of the product opportunity

diff --git a/_posts/2019-08-28-real-time-data-platform.md b/_posts/2019-08-28-real-time-data-platform.md
@@ -12,7 +12,7 @@ team:
 - Core Platform
 ---
 
-> **Editors note:** *This is a cross-post from Tyler's [personal blog](https://brokenco.de/2019/08/28/real-time-data-platform.html)*
+> **Editor's note:** *This is a cross-post from Tyler's [personal blog](https://brokenco.de/2019/08/28/real-time-data-platform.html)*
 
 One of the harder parts about building new platform infrastructure at a company
 which has been around a while is figuring out exactly _where_ to
@@ -132,7 +132,7 @@ to that negative feedback, understand what lies beneath the frustrations.
 Finally, have a vision for the future, but build and deliver incrementally.
 When I first sketched this out, I was forthcoming in stating "this is a 2020
 project." I made sure to clarify that this did not mean we wouldn't deliver anything
-to the business for 18 months. Instead, I made made sure to explain that to
+to the business for 18 months. Instead, I made sure to explain that to
 execute on this overall vision would be a long journey with milestones along
 the way.
 

diff --git a/_posts/2019-12-23-data-eng-in-2020.md b/_posts/2019-12-23-data-eng-in-2020.md
@@ -119,7 +119,7 @@ I often think of the quote by [Charles Babbage](https://en.wikipedia.org/wiki/Ch
 > provoke such a question.
 
 Data quality is a concern that anybody in the Data Engineering space is
-familiar. For Scribd I think "quality" on two axis:
+familiar. For Scribd I think "quality" on two axes:
 
 * Integrity: is each record within this set formed the way the customer
   expects it, or in adherence with a predefined schema.
@@ -129,7 +129,7 @@ familiar. For Scribd I think "quality" on two axis:
   other sensitive information which must have extra care added in order to
   safe-guard our readers' privacy.
 
-Unfortunately data quality is an area where I think we need to substantial
+Unfortunately data quality is an area where I think we need to make substantial
 improvements. Data was at one time treated as a by-product of production
 systems. Now it is rightfully recognized as business-critical, and our
 practices must rise to meet the challenge.
@@ -151,4 +151,4 @@ available, insightful, and of high quality. Data by itself tells us
 nothing, but well-managed data pipelines that allow us to identify characteristics
 of text documents, or content which is interesting to read, is incredibly
 valuable to Scribd. Data Engineering helps us understand our data which helps
-Scribd build products which deliver great reads to the world
+Scribd build products which deliver great reads to the world.
diff --git a/_posts/2019-12-30-migrating-kafka-to-aws.md b/_posts/2019-12-30-migrating-kafka-to-aws.md
@@ -28,7 +28,7 @@ existed, stemmed from the operational difficulties of _just_ running the thing.
 It was almost like we were afraid to touch Kafka for fear it might fall over.
 Another part of that avoidance grew out of the functionality not matching
 developers' expectations.  When we first adopted Kafka,
-ours was an on-premise deloyment of version **0.10**. Developers used it for a
+ours was an on-premise deployment of version **0.10**. Developers used it for a
 few projects, unexpected things occasionally happened that were difficult to
 "fix" and we started avoiding it for new projects.
 

diff --git a/_posts/2020-03-02-breaking-up-the-dag-repo.md b/_posts/2020-03-02-breaking-up-the-dag-repo.md
@@ -28,7 +28,7 @@ source daemon I have written to make it possible.
 
 ## Delivering DAGs
 
-Every Airflow component expects the DAGs to present in a local DAG folder,
+Every Airflow component expects the DAGs to be present in a local DAG folder,
 accessed through a filesystem interface. There are 3 common approaches to meet
 this requirement:
 
@@ -116,7 +116,7 @@ For daemon Airflow components like web server and scheduler, we run
 S3 to local filesystem every 5 seconds. This is implemented using the sidecar
 container pattern. The DAG folder is mounted as a shared volume between the
 Airflow web/scheduler container and objinsync container. The sidecar
-objinsync container is setup to run the following command:
+objinsync container is set up to run the following command:
 
 ```
 /bin/objinsync pull s3://<S3_DAG_BUCKET>/airflow_home/dags <YOUR_AIRFLOW_HOME>/dags
@@ -125,7 +125,7 @@ objinsync container is setup to run the following command:
 For other components like task instance pod that runs to completion, we run
 `objinsync`in pull once mode where it only pulls the required DAG from S3 once
 before the Airflow component starts. This is implemented using Airflow K8S
-executor’s builtin git sync container feature. We are effectively replacing git
+executor’s built-in git sync container feature. We are effectively replacing git
 invocation with `objinsync` in this case.
 
 **Environment variables for Airflow scheduler:**

diff --git a/_posts/2020-03-24-introducing-kafka-player.md b/_posts/2020-03-24-introducing-kafka-player.md
@@ -39,7 +39,7 @@ yaml formatting of [JSON schema](https://json-schema.org/) to formalize our
 shared schema definitions, giving us the necessary API contract to enforce
 between producer and consumer.
 
-The snippet below shows a general version of what one of these message schemas look like in yaml.
+The snippet below shows a general version of what one of these message schemas looks like in yaml.
 This comes from one of our real schemas but with all of the interesting fields removed.
 
 ```yaml

diff --git a/_posts/2020-09-15-integrating-databricks-and-datadog.md b/_posts/2020-09-15-integrating-databricks-and-datadog.md
@@ -42,7 +42,7 @@ agent with the following init script on the driver node:
 echo "Running on the driver? $DB_IS_DRIVER"
 
 if [[ $DB_IS_DRIVER = "TRUE" ]]; then
-  echo "Setting up metrics for spark applicatin: ${APP_NAME}"
+  echo "Setting up metrics for spark application: ${APP_NAME}"
   echo "Driver ip: $DB_DRIVER_IP"
 
   cat << EOF >> /home/ubuntu/databricks/spark/conf/metrics.properties
@@ -149,15 +149,15 @@ class Datadog(val appName: String)(implicit spark: SparkSession) extends Seriali
 }
 ```
 
-To initializing the helper class takes two lines of code:
+To initialize the helper class takes two lines of code:
 
 ```scala
 implicit val spark = SparkSession.builder().getOrCreate()
 val datadog = new Datadog(AppName)
 ```
 
 Then you can use `datadog.statsdcli()` to create statsd clients from within
-both **driver** and **executors** to emit custom emtrics:
+both **driver** and **executors** to emit custom metrics:
 
 
 ```scala