Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions _posts/2018-04-18-bandits-for-the-win.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@ We’ll cover how we optimized our home page, transitioned from hard-coded rules

### Guessing which recommendation rows people like is probably a bad idea

Users have diverse tastes. This bears repeating. Users have diverse tastes. I cannot understate the complexity of user preferences in a system that includes books, audiobooks, magazines and user-generated content. We need a system that automatically optimizes which rows to show various groups of users. And while we had a good idea of how well each row performed before implementing the multi-armed bandit, the row position biased the row type’s performance. Position bias, both horizontal and vertical, makes evaluating a recommendation system’s effectiveness challenging, since it’s one of the biggest determinants of interaction.
Users have diverse tastes. This bears repeating. Users have diverse tastes. I cannot overstate the complexity of user preferences in a system that includes books, audiobooks, magazines and user-generated content. We need a system that automatically optimizes which rows to show various groups of users. And while we had a good idea of how well each row performed before implementing the multi-armed bandit, the row position biased the row type’s performance. Position bias, both horizontal and vertical, makes evaluating a recommendation system’s effectiveness challenging, since it’s one of the biggest determinants of interaction.

### There are too many combinations of rows to AB test

The home page has 42 possible row types, which we can display in 10 row positions, resulting in 5*10¹⁵ potential combinations = 42! / (42 -10)! To give you some context, this is more stars than the entire Milky Way galaxy has!

### The diversity of rows are important
### The diversity of rows is important

If diversity weren’t a factor, one reasonable solution would be to randomly display every row type until there was enough unbiased data to rank the rows. But given that the diversity and order of rows are essential, we needed a more sophisticated method.
> The scale of the product opportunity
Expand Down
4 changes: 2 additions & 2 deletions _posts/2019-08-28-real-time-data-platform.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ team:
- Core Platform
---

> **Editors note:** *This is a cross-post from Tyler's [personal blog](https://brokenco.de/2019/08/28/real-time-data-platform.html)*
> **Editor's note:** *This is a cross-post from Tyler's [personal blog](https://brokenco.de/2019/08/28/real-time-data-platform.html)*

One of the harder parts about building new platform infrastructure at a company
which has been around a while is figuring out exactly _where_ to
Expand Down Expand Up @@ -132,7 +132,7 @@ to that negative feedback, understand what lies beneath the frustrations.
Finally, have a vision for the future, but build and deliver incrementally.
When I first sketched this out, I was forthcoming in stating "this is a 2020
project." I made sure to clarify that this did not mean we wouldn't deliver anything
to the business for 18 months. Instead, I made made sure to explain that to
to the business for 18 months. Instead, I made sure to explain that to
execute on this overall vision would be a long journey with milestones along
the way.

Expand Down
6 changes: 3 additions & 3 deletions _posts/2019-12-23-data-eng-in-2020.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ I often think of the quote by [Charles Babbage](https://en.wikipedia.org/wiki/Ch
> provoke such a question.

Data quality is a concern that anybody in the Data Engineering space is
familiar. For Scribd I think "quality" on two axis:
familiar. For Scribd I think "quality" on two axes:

* Integrity: is each record within this set formed the way the customer
expects it, or in adherence with a predefined schema.
Expand All @@ -129,7 +129,7 @@ familiar. For Scribd I think "quality" on two axis:
other sensitive information which must have extra care added in order to
safe-guard our readers' privacy.

Unfortunately data quality is an area where I think we need to substantial
Unfortunately data quality is an area where I think we need to make substantial
improvements. Data was at one time treated as a by-product of production
systems. Now it is rightfully recognized as business-critical, and our
practices must rise to meet the challenge.
Expand All @@ -151,4 +151,4 @@ available, insightful, and of high quality. Data by itself tells us
nothing, but well-managed data pipelines that allow us to identify characteristics
of text documents, or content which is interesting to read, is incredibly
valuable to Scribd. Data Engineering helps us understand our data which helps
Scribd build products which deliver great reads to the world
Scribd build products which deliver great reads to the world.
2 changes: 1 addition & 1 deletion _posts/2019-12-30-migrating-kafka-to-aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ existed, stemmed from the operational difficulties of _just_ running the thing.
It was almost like we were afraid to touch Kafka for fear it might fall over.
Another part of that avoidance grew out of the functionality not matching
developers' expectations. When we first adopted Kafka,
ours was an on-premise deloyment of version **0.10**. Developers used it for a
ours was an on-premise deployment of version **0.10**. Developers used it for a
few projects, unexpected things occasionally happened that were difficult to
"fix" and we started avoiding it for new projects.

Expand Down
6 changes: 3 additions & 3 deletions _posts/2020-03-02-breaking-up-the-dag-repo.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ source daemon I have written to make it possible.

## Delivering DAGs

Every Airflow component expects the DAGs to present in a local DAG folder,
Every Airflow component expects the DAGs to be present in a local DAG folder,
accessed through a filesystem interface. There are 3 common approaches to meet
this requirement:

Expand Down Expand Up @@ -116,7 +116,7 @@ For daemon Airflow components like web server and scheduler, we run
S3 to local filesystem every 5 seconds. This is implemented using the sidecar
container pattern. The DAG folder is mounted as a shared volume between the
Airflow web/scheduler container and objinsync container. The sidecar
objinsync container is setup to run the following command:
objinsync container is set up to run the following command:

```
/bin/objinsync pull s3://<S3_DAG_BUCKET>/airflow_home/dags <YOUR_AIRFLOW_HOME>/dags
Expand All @@ -125,7 +125,7 @@ objinsync container is setup to run the following command:
For other components like task instance pod that runs to completion, we run
`objinsync`in pull once mode where it only pulls the required DAG from S3 once
before the Airflow component starts. This is implemented using Airflow K8S
executor’s builtin git sync container feature. We are effectively replacing git
executor’s built-in git sync container feature. We are effectively replacing git
invocation with `objinsync` in this case.

**Environment variables for Airflow scheduler:**
Expand Down
2 changes: 1 addition & 1 deletion _posts/2020-03-24-introducing-kafka-player.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ yaml formatting of [JSON schema](https://json-schema.org/) to formalize our
shared schema definitions, giving us the necessary API contract to enforce
between producer and consumer.

The snippet below shows a general version of what one of these message schemas look like in yaml.
The snippet below shows a general version of what one of these message schemas looks like in yaml.
This comes from one of our real schemas but with all of the interesting fields removed.

```yaml
Expand Down
6 changes: 3 additions & 3 deletions _posts/2020-09-15-integrating-databricks-and-datadog.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ agent with the following init script on the driver node:
echo "Running on the driver? $DB_IS_DRIVER"

if [[ $DB_IS_DRIVER = "TRUE" ]]; then
echo "Setting up metrics for spark applicatin: ${APP_NAME}"
echo "Setting up metrics for spark application: ${APP_NAME}"
echo "Driver ip: $DB_DRIVER_IP"

cat << EOF >> /home/ubuntu/databricks/spark/conf/metrics.properties
Expand Down Expand Up @@ -149,15 +149,15 @@ class Datadog(val appName: String)(implicit spark: SparkSession) extends Seriali
}
```

To initializing the helper class takes two lines of code:
To initialize the helper class takes two lines of code:

```scala
implicit val spark = SparkSession.builder().getOrCreate()
val datadog = new Datadog(AppName)
```

Then you can use `datadog.statsdcli()` to create statsd clients from within
both **driver** and **executors** to emit custom emtrics:
both **driver** and **executors** to emit custom metrics:


```scala
Expand Down