- Airbyte is a foss core tool for managing ETL operations between data sources.
- Airbyte uses dbt under the covers to do some of the heavy lifting during transformation.
Airbyte API
A REST API is provided by the cloud and self-hosted platform and documentation is available here. The API routes are found at /api/public/v1/
on the same port as the UI.
There is also a python sdk.
To use the Python SDK we need to override the base url for the server like so:
Selected Fields
As of V0.63.4 you can now get a list of fields available within a given stream (resolving issue 35798in their git project):
We can then set which fields are enabled for a given connection:
Octavia
Octavia has been deprecated and is no longer supported
- There is a GUI for configuring airbyte but there is also a cli which can be used to version control configurations
- https://airbyte.com/tutorials/version-control-airbyte-configurations
- Secrets can be managed via a
~/.octavia
config file https://airbyte.com/tutorials/version-control-airbyte-configurations#create-a-postgres-source-with-octavia-cli - I’ve written a little Python script to debug octavia configurations here: https://tungsten.filament.uk.com/ravenscroftj/airbyte-config-tool
- Data warehouse tech stack with MySQL, DBT, Airflow
Airbyte Transformation
-
It seems like you can use DBT to do transformation after you’ve copied the data from the source
-
In order to connect Airbyte to Gitlab you can create a deploy token or access token and connect over http.
-
Transforming raw data into datasets using Airbyte with dbt - WalkingTree Technologies
Authentication
- Airbyte connectors can authenticate via Basic Auth, Bearer Token, API Key or OAuth
- https://docs.airbyte.com/connector-development/connector-builder-ui/authentication
Record Processing
- Airbyte connectors make HTTP requests, carry out transformations of the data
- The documentation seems to imply that records can be arbitrary json objects - you just specify a JSON selector to point to where those records are.
- We use json-schema to define the record object schema
Upgrading Airbyte
If the deployment is done via docker-compose then we can use the provided script to upgrade the deployment as per this guide:
Airbyte Docker Troubleshooting
- Make sure we are running the very latest versions of docker and docker-compose or else we get some strange error messages using the provided
docker-compose.yaml
. A comprehensive walkthrough can be found here
Permissions Error with temp files/workspaces
If you encounter issues like this one:
ERROR StatusConsoleListener FileManager (/tmp/workspace/4de9764c-5e55-4c00-9cf6-f23c54d6db2d/0/logs.log) java.io.IOException: Could not create directory /tmp/workspace/4de9764c-5e55-4c00-9cf6-f23c54d6db2d/0
Then you may need to set permissions properly in the airbyte_workspace volume using a command like this:
- See issue here
Cannot Find image: airbyte/source-mysql:3.6.0
No idea why this is happening. If you run the upgrade script as detailedhere it seems to fix it. Not exactly confidence boosting.
RESOURCE_EXHAUSTED: grpc: received message larger than max
According to this bug the state of the MySQL sync history can grow too big and cause Airbyte to fail every time. The only workaround is to recreate the connection.