Uber’s Data Platform with Zhenxiao Luo

Greatest Hits – Software Engineering Daily

Greatest Hits – Software Engineering Daily에서 제공하는 콘텐츠입니다. 에피소드, 그래픽, 팟캐스트 설명을 포함한 모든 팟캐스트 콘텐츠는 Greatest Hits – Software Engineering Daily 또는 해당 팟캐스트 플랫폼 파트너가 직접 업로드하고 제공합니다. 누군가가 귀하의 허락 없이 귀하의 저작물을 사용하고 있다고 생각되는 경우 여기에 설명된 절차를 따르실 수 있습니다 https://ko.player.fm/legal.

7+ y ago 55:34

MP3•에피소드 홈

저장한 시리즈 ("피드 비활성화" status)

When? This feed was archived on August 01, 2022 13:57 (3y ago). Last successful fetch was on February 14, 2022 03:52 (3+ y ago)

Why? 피드 비활성화 status. 잠시 서버에 문제가 발생해 팟캐스트를 불러오지 못합니다.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

When a user takes a ride on Uber, the app on the user’s phone is communicating with Uber’s backend infrastructure, which is writing to a database that maintains the state of that user’s activity. This database is known as a transactional database or “OLTP” (online transaction processing). Every active user and driver and UberEATS restaurant is writing data to the transactional data store.

Periodically, that data is copied from the transactional data system to a different data storage system, where that data can be queried for large-scale data analysis. For example, if a data scientist at Uber wants to get the average amount of miles that a given user rode in February, that data scientist would issue a query to the analytical data cluster.

Uber uses the Hadoop distributed file system (HDFS) to store analytical data. On this file system, Uber has a version history of all of the company’s useful historical data. Trip history, rider activity, driver activity–every data point that is in the transactional database–but in a file format that is easier to query for large scale processing. This file format is known as Parquet.

Data scientists, machine learning engineers, and real-time application developers all depend on the massive quantities of data that are stored in these Parquet files on Uber’s HDFS cluster. To simplify the access of that data by many different clients, Uber uses Presto, an analytical query engine originally built at Facebook.

Presto translates SQL queries into whatever query language is necessary to access the underlying storage medium–whether that storage system is an ElasticSearch cluster, a set of Parquet files, or a relational database. Presto is useful because it simplifies the relationship between data engineers and the application developers who are building on top of the data engineering infrastructure.

In today’s show, Zhenxiao Luo joins to give an end-to-end description of Uber’s data infrastructure–from the ingest point of the OLTP database to the OLAP data storage system on HDFS, to the wide range of data systems and applications that run on top of that OLAP data.

The post Uber’s Data Platform with Zhenxiao Luo appeared first on Software Engineering Daily.

168 에피소드

#Tech News #Open Source #News #Greatest Hits Software Engineering Daily