Build a data pipeline
Learn how to build resilient and performant data pipelines with Prefect.
In the Quickstart, you created a Prefect flow to get stars for a list of GitHub repositories. And in Schedule a flow, you learned how to schedule runs of that flow on remote infrastructure.
In this tutorial, you’ll learn how to turn this flow into a resilient and performant data pipeline. The real world is messy, and Prefect is designed to handle that messiness.
- Your API requests can fail.
- Your API requests run too slowly.
- Your API requests run too quickly and you get rate limited.
- You waste time and money running the same tasks multiple times.
Instead of solving these problems in the business logic itself, use Prefect’s built-in features to handle them.
Retry on failure
The first improvement you can make is to add retries to your flow. Whenever an HTTP request fails, you can retry it a few times before giving up.
Concurrent execution of slow tasks
If individual API requests are slow, you can speed them up in aggregate by making multiple requests concurrently.
When you call the map
method on a task, you submit a list of arguments to the task runner to run concurrently (alternatively, you could .submit()
each argument individually).
Calling .result()
on the list of futures returned by .map()
will block until all tasks are complete.
Read more in the .map()
documentation.
Avoid getting rate limited
One consequence of running tasks concurrently is that you’re more likely to hit the rate limits of whatever API you’re using. To avoid this, use Prefect to set a global concurrency limit.
Now, you can use this global concurrency limit in your code to rate limit your API requests.
Cache the results of a task
For efficiency, you can skip tasks that have already run. For example, if you don’t want to fetch the number of stars for a given repository more than once per day, you can cache those results for a day.
Run your improved flow
This is what your flow looks like after applying all of these improvements:
Run your flow twice: once to run the tasks and cache the result, again to retrieve the results from the cache.
The terminal output from the second flow run should look like this:
Next steps
In this tutorial, you built a resilient and performant data pipeline which uses the following techniques:
- Retries to handle transient errors
- Concurrency to speed up slow tasks
- Concurrency limits to avoid hitting the rate limits of your APIs
- Caching to skip repeated tasks
Next, learn how to handle data dependencies and ingest large amounts of data. You’ll use error handling, pagination, and nested flows to scrape data from GitHub.
Need help? Book a meeting with a Prefect Product Advocate to get your questions answered.