From 9da7dd6e1638221446fa0d72ce5a3f099a5da1a5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?L=C3=A9on=20Avic=20Simmons?= Date: Mon, 29 Jun 2026 14:00:54 -0400 Subject: [PATCH] docs: fix 15 typos in blog post comments and documentation --- _posts/2018-04-18-bandits-for-the-win.md | 4 ++-- _posts/2019-08-28-real-time-data-platform.md | 4 ++-- _posts/2019-12-23-data-eng-in-2020.md | 6 +++--- _posts/2019-12-30-migrating-kafka-to-aws.md | 2 +- _posts/2020-03-02-breaking-up-the-dag-repo.md | 6 +++--- _posts/2020-03-24-introducing-kafka-player.md | 2 +- _posts/2020-09-15-integrating-databricks-and-datadog.md | 6 +++--- 7 files changed, 15 insertions(+), 15 deletions(-) diff --git a/_posts/2018-04-18-bandits-for-the-win.md b/_posts/2018-04-18-bandits-for-the-win.md index a70db49..e973f52 100644 --- a/_posts/2018-04-18-bandits-for-the-win.md +++ b/_posts/2018-04-18-bandits-for-the-win.md @@ -17,13 +17,13 @@ We’ll cover how we optimized our home page, transitioned from hard-coded rules ### Guessing which recommendation rows people like is probably a bad idea -Users have diverse tastes. This bears repeating. Users have diverse tastes. I cannot understate the complexity of user preferences in a system that includes books, audiobooks, magazines and user-generated content. We need a system that automatically optimizes which rows to show various groups of users. And while we had a good idea of how well each row performed before implementing the multi-armed bandit, the row position biased the row type’s performance. Position bias, both horizontal and vertical, makes evaluating a recommendation system’s effectiveness challenging, since it’s one of the biggest determinants of interaction. +Users have diverse tastes. This bears repeating. Users have diverse tastes. I cannot overstate the complexity of user preferences in a system that includes books, audiobooks, magazines and user-generated content. We need a system that automatically optimizes which rows to show various groups of users. And while we had a good idea of how well each row performed before implementing the multi-armed bandit, the row position biased the row type’s performance. Position bias, both horizontal and vertical, makes evaluating a recommendation system’s effectiveness challenging, since it’s one of the biggest determinants of interaction. ### There are too many combinations of rows to AB test The home page has 42 possible row types, which we can display in 10 row positions, resulting in 5*10¹⁵ potential combinations = 42! / (42 -10)! To give you some context, this is more stars than the entire Milky Way galaxy has! -### The diversity of rows are important +### The diversity of rows is important If diversity weren’t a factor, one reasonable solution would be to randomly display every row type until there was enough unbiased data to rank the rows. But given that the diversity and order of rows are essential, we needed a more sophisticated method. > The scale of the product opportunity diff --git a/_posts/2019-08-28-real-time-data-platform.md b/_posts/2019-08-28-real-time-data-platform.md index c90bcf8..5627fd8 100644 --- a/_posts/2019-08-28-real-time-data-platform.md +++ b/_posts/2019-08-28-real-time-data-platform.md @@ -12,7 +12,7 @@ team: - Core Platform --- -> **Editors note:** *This is a cross-post from Tyler's [personal blog](https://brokenco.de/2019/08/28/real-time-data-platform.html)* +> **Editor's note:** *This is a cross-post from Tyler's [personal blog](https://brokenco.de/2019/08/28/real-time-data-platform.html)* One of the harder parts about building new platform infrastructure at a company which has been around a while is figuring out exactly _where_ to @@ -132,7 +132,7 @@ to that negative feedback, understand what lies beneath the frustrations. Finally, have a vision for the future, but build and deliver incrementally. When I first sketched this out, I was forthcoming in stating "this is a 2020 project." I made sure to clarify that this did not mean we wouldn't deliver anything -to the business for 18 months. Instead, I made made sure to explain that to +to the business for 18 months. Instead, I made sure to explain that to execute on this overall vision would be a long journey with milestones along the way. diff --git a/_posts/2019-12-23-data-eng-in-2020.md b/_posts/2019-12-23-data-eng-in-2020.md index f0eb331..fb045bf 100644 --- a/_posts/2019-12-23-data-eng-in-2020.md +++ b/_posts/2019-12-23-data-eng-in-2020.md @@ -119,7 +119,7 @@ I often think of the quote by [Charles Babbage](https://en.wikipedia.org/wiki/Ch > provoke such a question. Data quality is a concern that anybody in the Data Engineering space is -familiar. For Scribd I think "quality" on two axis: +familiar. For Scribd I think "quality" on two axes: * Integrity: is each record within this set formed the way the customer expects it, or in adherence with a predefined schema. @@ -129,7 +129,7 @@ familiar. For Scribd I think "quality" on two axis: other sensitive information which must have extra care added in order to safe-guard our readers' privacy. -Unfortunately data quality is an area where I think we need to substantial +Unfortunately data quality is an area where I think we need to make substantial improvements. Data was at one time treated as a by-product of production systems. Now it is rightfully recognized as business-critical, and our practices must rise to meet the challenge. @@ -151,4 +151,4 @@ available, insightful, and of high quality. Data by itself tells us nothing, but well-managed data pipelines that allow us to identify characteristics of text documents, or content which is interesting to read, is incredibly valuable to Scribd. Data Engineering helps us understand our data which helps -Scribd build products which deliver great reads to the world +Scribd build products which deliver great reads to the world. diff --git a/_posts/2019-12-30-migrating-kafka-to-aws.md b/_posts/2019-12-30-migrating-kafka-to-aws.md index c9418e1..5efc739 100644 --- a/_posts/2019-12-30-migrating-kafka-to-aws.md +++ b/_posts/2019-12-30-migrating-kafka-to-aws.md @@ -28,7 +28,7 @@ existed, stemmed from the operational difficulties of _just_ running the thing. It was almost like we were afraid to touch Kafka for fear it might fall over. Another part of that avoidance grew out of the functionality not matching developers' expectations. When we first adopted Kafka, -ours was an on-premise deloyment of version **0.10**. Developers used it for a +ours was an on-premise deployment of version **0.10**. Developers used it for a few projects, unexpected things occasionally happened that were difficult to "fix" and we started avoiding it for new projects. diff --git a/_posts/2020-03-02-breaking-up-the-dag-repo.md b/_posts/2020-03-02-breaking-up-the-dag-repo.md index a59f468..1b63cac 100644 --- a/_posts/2020-03-02-breaking-up-the-dag-repo.md +++ b/_posts/2020-03-02-breaking-up-the-dag-repo.md @@ -28,7 +28,7 @@ source daemon I have written to make it possible. ## Delivering DAGs -Every Airflow component expects the DAGs to present in a local DAG folder, +Every Airflow component expects the DAGs to be present in a local DAG folder, accessed through a filesystem interface. There are 3 common approaches to meet this requirement: @@ -116,7 +116,7 @@ For daemon Airflow components like web server and scheduler, we run S3 to local filesystem every 5 seconds. This is implemented using the sidecar container pattern. The DAG folder is mounted as a shared volume between the Airflow web/scheduler container and objinsync container. The sidecar -objinsync container is setup to run the following command: +objinsync container is set up to run the following command: ``` /bin/objinsync pull s3:///airflow_home/dags /dags @@ -125,7 +125,7 @@ objinsync container is setup to run the following command: For other components like task instance pod that runs to completion, we run `objinsync`in pull once mode where it only pulls the required DAG from S3 once before the Airflow component starts. This is implemented using Airflow K8S -executor’s builtin git sync container feature. We are effectively replacing git +executor’s built-in git sync container feature. We are effectively replacing git invocation with `objinsync` in this case. **Environment variables for Airflow scheduler:** diff --git a/_posts/2020-03-24-introducing-kafka-player.md b/_posts/2020-03-24-introducing-kafka-player.md index ed8e33c..60d6d85 100644 --- a/_posts/2020-03-24-introducing-kafka-player.md +++ b/_posts/2020-03-24-introducing-kafka-player.md @@ -39,7 +39,7 @@ yaml formatting of [JSON schema](https://json-schema.org/) to formalize our shared schema definitions, giving us the necessary API contract to enforce between producer and consumer. -The snippet below shows a general version of what one of these message schemas look like in yaml. +The snippet below shows a general version of what one of these message schemas looks like in yaml. This comes from one of our real schemas but with all of the interesting fields removed. ```yaml diff --git a/_posts/2020-09-15-integrating-databricks-and-datadog.md b/_posts/2020-09-15-integrating-databricks-and-datadog.md index 35e4c28..6fcdc11 100644 --- a/_posts/2020-09-15-integrating-databricks-and-datadog.md +++ b/_posts/2020-09-15-integrating-databricks-and-datadog.md @@ -42,7 +42,7 @@ agent with the following init script on the driver node: echo "Running on the driver? $DB_IS_DRIVER" if [[ $DB_IS_DRIVER = "TRUE" ]]; then - echo "Setting up metrics for spark applicatin: ${APP_NAME}" + echo "Setting up metrics for spark application: ${APP_NAME}" echo "Driver ip: $DB_DRIVER_IP" cat << EOF >> /home/ubuntu/databricks/spark/conf/metrics.properties @@ -149,7 +149,7 @@ class Datadog(val appName: String)(implicit spark: SparkSession) extends Seriali } ``` -To initializing the helper class takes two lines of code: +To initialize the helper class takes two lines of code: ```scala implicit val spark = SparkSession.builder().getOrCreate() @@ -157,7 +157,7 @@ val datadog = new Datadog(AppName) ``` Then you can use `datadog.statsdcli()` to create statsd clients from within -both **driver** and **executors** to emit custom emtrics: +both **driver** and **executors** to emit custom metrics: ```scala