FIVETRAN SETUP Guide
These days, data has become an invaluable asset. For business owners aiming to level up their IT and Business Intelligence infrastructures, seamlessly integrating and analyzing data from multiple sources is imperative. One of the solutions that have gained popularity in this regard is Fivetran—an ELT/ETL tool. This article provides an in-depth look at Fivetran and how it can aid in transforming raw data into actionable insights.
Table of Contents
First things first: Understanding the Need for ELT/ETL
Before we even dive into the specific Fivetran Setup, we need to quickly look into the basics concerning the tool. When talking about Fivetran, we talk about an ETL/ELT solution.
ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are processes for transferring data from source systems to a centralized data warehouse. The key difference is the order in which data is transformed and loaded (here’s more on ETL vs. ELT).
Why is it important to have a proper ELT or ETL setup in place?
Centralizes data from different sources
Streamlines data processing for analytics
Ensures data quality and consistency
One of the solutions offered in this scenario is Fivetran. It is a cloud-based, fully-managed data integration platform that automates the extraction of data from various sources and loads it into a data warehouse, simplifying the ELT (Extract, Load, Transform) process. It offers pre-built connectors for a wide range of applications, databases, and platforms, enabling businesses to centralize their data for analytics and reporting without extensive manual setup.
Automated Data Integration: Streamlines the ELT process by automatically extracting and loading data from sources to a data warehouse. This eliminates the need for manual data engineering tasks, saving time and ensuring consistent data flows.
Pre-built Connectors: Offers connectors for a vast array of sources including CRMs, databases, file storages, and more. These pre-configured integrations reduce the time and technical expertise required to connect various data sources.
Real-time Data Synchronization: Ensures timely data updates and consistency in the data warehouse. This means businesses can rely on up-to-date data for analytics and decision-making at any given moment.
Centralized Monitoring Dashboard: Provides a unified view to monitor and manage all data integrations. This centralized perspective helps in quickly identifying and resolving any integration-related issues.
Data Transformation: Allows for in-warehouse transformations to prepare data for analytics. This capability ensures that the data is not only centralized but also structured optimally for analysis purposes.
Historical Data Backfill: Automatically backfills historical data when a new connector is set up. This feature ensures that businesses don’t miss out on any valuable historical insights when adding new data sources.
Data Collision Resolution: Handles schema changes and data collisions to ensure data consistency. As source data structures evolve, it automatically adapts to these changes, preventing potential data inconsistencies.
Enterprise-grade Security: Compliant with industry standards and regulations to protect sensitive data. This ensures that businesses can trust Fivetran with their data, knowing it’s safeguarded against potential threats.
Scalability: Scales to handle increasing data volumes, making it suitable for both small businesses and large enterprises. As a business grows and data volume increases, it adjusts without any degradation in performance.
Error Handling: Automatically detects and retries failed data loads, with alerts for persistent issues. This proactive approach ensures that data flow interruptions are minimal, and businesses are promptly informed of any critical issues.
These expanded explanations showcase how its’ features work cohesively to provide a robust and efficient data integration platform. Leveraging its cloud-native architecture, it ensures real-time, scalable, and accurate data synchronization.
Your Fivetran setup offers a vast range of connectors to ensure seamless integration of data from various sources into a centralized data warehous. Their connectors are designed to handle various types of data sources, such as:
Databases: These connectors allow it to pull data from popular relational databases, NoSQL databases, and more. Examples include PostgreSQL, MySQL, MongoDB, and Oracle.
Applications: Fivetran provides connectors for various applications. These range from CRM tools like Salesforce, to marketing platforms like HubSpot or Google Analytics, finance tools like NetSuite, and many others.
Advertising Platforms: For businesses involved in online advertising, it offers connectors for platforms like Facebook Ads, Google Ads, and Bing Ads, enabling them to analyze their advertising ROI comprehensively.
File Storage: For businesses that store data in cloud-based file storage systems, with connectors for platforms like Amazon S3, Google Cloud Storage, and Azure Blob Storage.
Cloud Functions: Connectors that integrate with cloud function platforms like AWS Lambda or Google Cloud Functions, enabling businesses to process and transform data on-the-fly before loading it into the warehouse.
Version Control and Project Management: Connectors for platforms like GitHub, GitLab, and Jira, allowing businesses to track changes, manage projects, and analyze development workflows.
ERP Systems: Integration with popular ERP systems like Microsoft Dynamics or SAP ensures that businesses can pull in operational and financial data seamlessly.
E-commerce Platforms: Connectors for platforms like Shopify, Magento, and WooCommerce to help e-commerce businesses integrate sales, customer, and product data.
It’s important to note that the ecosystem of connectors is continuously growing, and they frequently add new integrations based on the evolving needs of businesses. The aforementioned categories provide a high-level overview, but the actual number of specific connectors within these categories is much more extensive. For a detailed list, it’s always a good idea to check Fivetran’s official documentation or website.
Missing a connector?
There are a few reasons why Fivetran (or any similar platform) might not offer certain connectors, which might be the case if a tool is still relatively new or a niche product. Other reasons can of course revolve around licensing topics e.g. in the case of proprietary systems, just to name a few potential reasons.
How about scalability?
Fivetran setups are highly scalable and can handle a large volume of data. It offers automatic scalability, which means it can accommodate growing data needs without requiring manual adjustments or additional infrastructure.
Fivetran’s architecture is designed to efficiently sync data from various sources to a data warehouse, ensuring smooth and reliable data integration even as the data volume increases. Additionally, Fivetran provides connectors for a wide range of data sources, allowing organizations to scale their data pipelines easily.
With its scalable infrastructure and extensive connector library, Fivetran offers a robust solution for managing data at scale.
Fivetran’s cloud-native architecture ensures it is inherently scalable, adeptly managing data demands of both small businesses and large enterprises.
By leveraging the benefits of cloud infrastructures, it dynamically scales its resources to meet growing data needs without manual intervention. It employs strategies like incremental updates and parallel processing to efficiently handle vast data volumes. Its integrations with modern cloud-based data warehouses, such as Snowflake, BigQuery, and Redshift, further bolster its scalability.
Adaptive data syncing adjusts to the frequency of data changes, ensuring optimal resource usage, and built-in error detection ensures reliability even as data volumes surge. With a usage-based pricing model and performance monitoring capabilities, Fivetran ensures businesses can smoothly scale their operations while maintaining cost-effectiveness and optimal performance.
In terms of cost, Fivetran offers a pricing model based on the volume of data being processed. While specific pricing details may vary depending on the organization’s needs, it is generally considered to be a cost-effective solution for data integration.
One reason for this is the automatic scalability, which allows it to handle large volumes of data without requiring manual adjustments or additional infrastructure. This means that organizations can scale their data pipelines without incurring significant costs related to infrastructure upgrades or additional resources.
Furthermore, the before mentioned extensive connector library eliminates the need for organizations to build and maintain custom integrations for each data source. This saves time and resources that would otherwise be spent on development and maintenance costs.
Overall, cost-effectiveness can be attributed to its scalable architecture, extensive connector library, and the efficiency it brings to data integration processes. It can be highly cost-effective, especially when considering the operational efficiency it brings and the costs it can save in terms of integration and maintenance.
However, businesses should carefully assess their specific needs, anticipated data volumes, and growth projections to get a clear picture of the total cost of ownership.
What a typical setup involving Fivetran could look like – An exemplary case
Now let’s assume you decided to go with the solution, of course the tool is just one (important) component within your overall framework.
It all starts with your various data sources that you would like to connect. Beyond the likes of e.g. HubSpot for your CRM data, Google Analytics for web metrics, Google Ads for advertising insights, or social media platforms that might offer relevant insights into campaign performance as well as for example brand engagement or customer sentiments towards your company.
If you have an e-commerce business, connecting your shop system for sales data or further customer information will of course also be crucial. Businesses need to of course always consider which further sources they want or need to connect depending on their individual use case. E-commerce platforms like Shopify or WooCommerce provide sales and further customer data.
Additionally, internal databases and more might provide highly valuable data that should not remain untouched. Ideally, the set of data sources ensures for a holistic view on your data to provide the foundation for actual business intelligence to allow for more informed, data driven strategic choices.
ELT through Fivetran and dbt:
Once the sources you want to connect are defined, Fivetran comes into play for the orchestration of a streamlined extraction process of your data. It then directly loads this raw data into a target data warehouse so it is readily available for further analysis and processing. This is where (our tool of choice in this scenario) comes into play.
dbt allows for the transformation of the raw data using SQL queries and modeling techniques. With dbt, data analysts and engineers can apply business logic, perform calculations, and create meaningful datasets optimized for analysis. Utilizing SQL, dbt helps transform this raw data into structured, analysis-ready formats.
The combination with dbt offers a streamlined and efficient ELT process. Fivetran takes care of the data extraction and loading, while dbt provides a powerful tool for transforming and modeling the data. This setup enables organizations to easily maintain and update their data pipelines, ensuring accurate and reliable data for analytics and reporting purposes.
In the data warehouse once gone through the elt-processing, your data from various sources is now stored and organized for further analysis and reporting. Your warehouse should provide a reliable and scalable infrastructure to handle large volumes of data efficiently, starting with the aspect of data modeling for example.
So you should be able to structure the data in a way that facilitates easy querying and analysis, which involves organizing it into tables, defining relationships between those, and creating indexes.
Your warehouse should be properly set up for performance for query performance, as this has a huge impact on fast and efficient data retrieval. And lastly, your setup should be a scalable one for future growth and user demands. It should be able to handle growing data volumes and increased user concurrency without performance losses.
Let’s assume you want to use Snowflake for warehousing. At kemb we have extensive experience in configuring similar setups. Snowflake is basically tailored for the cloud, separating compute and storage resources. It allows for flexible scaling and cost management and brings several more advantages. Nevertheless, the best choice for a data warehouse often depends on specific business needs and existing infrastructure.
Visualisation and Reporting:
Now lastly, the probably most important part when it comes to the actual working with the data is translating it into the needed reports and graphs for the various departments. It needs to be turned into more comprehensible formats to help with well informed and data based strategic decisions.
As proper visualization not only simplifies data interpretation but also needs to ensure that insights can be derived efficiently and effectively with tools such as Tableau or Power BI for example, to convert data into structured reports and interactive dashboards.
The Setup process
1. Before signing up
Before setting up Fivetran, we would recommend first checking whether the tool offers what you specifically need. Ideally start by mapping out your primary data sources and ensuring Fivetran’s connectors align with them. See the documentation for more information.
Although by the date of publishing this article, Fivetran already supports +300 connectors, you might also want to check flexibility for custom connections you may still need. Another important aspect is of course costs, which can be a bit tricky to estimate when you first start working with the tool due to its pricing model based on monthly active rows.
2. Connecting your data warehouse
Once signed up, you will need to prepare your data infrastructure for a seamless integration. Within your dashboard you will first need to define where the data should be loaded by connecting your data warehouse, such as above mentioned Snowflake for example and enter the necessary connection details.
3. Integrating your data sources
Once you’ve connected your data warehouse, you want to connect your various data sources. In your dashboard you can go through a list of already available connectors and select the ones you need.
Per source you will then of course have to go through the authentication process, before choosing the specific tables or fields you want to extract. Do not forget to define a sync frequency that aligns with your actual data refresh needs.
This could range from real-time updates to weekly or even monthly pulls, depending on the given data source this will most likely heavily require the involved teams input, as they are the experts knowing best, how frequently they would need their data updated.
With a view on costs, you should always make sure to have an efficient approach in place to avoid unnecessary bloats of data.
4. Preparing for Data Transformation
As mentioned before, while Fivetran is there for the extraction and loading part, dbt will be responsible for the transformation within your data warehouse to now bridge the gap between raw data pulls and insightful information.
So once you properly set up the loading of data into your warehouse you can install dbt and set up your new project, configuring the connection to your data warehouse. So you will need to specify exactly the type of database, connection details, credentials, and other specifics.
After that, you can start building models in dbt (your transformations) based on your transformation needs. Of course, dbt also allows for testing and documentation to ensure data quality. For ongoing transformations, you’d want to schedule dbt runs, where with the help of orchestration tools or simple cron jobs, you can ensure your data is regularly transformed after it loads.
Of course, there are more things to consider in the context of the dbt setup, but at this point we want to focus on setting up Fivetran. One last recommendation though in this context would be to also integrate dbt with Fivetran Logs. Ingesting these into your warehouse and use dbt to model and analyze them as this might provide relevant insights into the ETL process itself.
5. Your first synchronization
Once you connect the involved components within your data stack, it is time for your first synchronization! During this initial sync, the most important aspect is to ensure a complete and accurate transfer of historical data from your sources.
Watch for any discrepancies or errors as these often occur due to inconsistencies between data types or source-specific characteristics. Also have a look at the volume of data being transferred, as large amounts of data can not only impact the duration of your sync but as mentioned before have a huge impact on costs.
Post Sync Checks
Data Integrity Checks
Completeness: Ensure all records from the source have been transferred to the destination.
Uniqueness: Check for any unintended duplicate records.
Consistency: Data types and formats should be consistent between the source and destination. For instance, dates should not be converted to strings inadvertently.
Compare table structures, field names, and data types between the source and the warehouse to confirm that the schema is correctly replicated. Check for any missing tables or fields.
Transformation Validation with dbt:
After running dbt models, validate the transformed data’s accuracy and completeness.
Ensure that the dbt transformations haven’t introduced any errors, such as null values where there shouldn’t be, or overlooked aggregations.
Volume & Performance Monitoring:
Monitor the time taken for the initial sync. If it took longer than anticipated, consider optimizing settings or evaluating if Fivetran’s setup aligns with your data volume.
Track the computational costs associated with this sync, especially if you’re using a cloud data warehouse where costs can scale with data volume and query complexity.
Error Logs & Notifications:
Review any error logs or notifications generated by the tool during the sync process. These can provide clues about failed data transfers, transformation errors, or connection issues.
Set up alerts or notifications (if not already done) for future sync operations to be proactively informed about potential issues.
Historical Data Accuracy:
For time-series or historical data, ensure that the chronological data points are accurate and no gaps exist in the series.
Check for any anomalies or outliers that could indicate potential issues in the sync or source data.
Access & Permissions:
Confirm that the right access levels and permissions are set for the loaded data, ensuring data security and governance.
Make sure sensitive data is appropriately masked or encrypted, based on your organization’s data privacy policies.
Backup & Recovery:
After the initial sync, consider creating a backup of your data warehouse. This provides a recovery point should any issues arise in subsequent syncs or transformations.
Test the recovery process to ensure you can restore data if needed. Remember that this initial synchronization forms the basis for all subsequent data operations, so thorough validation and error checking are extremely important.
6. Maintenance of your setup
Maintaining your setup or rather your data stack as a whole requires consistent monitoring, optimization, and adaptation. You need to ensure that data integration remains seamless and efficient.
When it comes to technicalities, we can only recommend to regularly review Fivetran’s logs and dashboards, as this will help address any potential failures in syncing as well as data mismatches, or performance bottlenecks.
And of course, as your organization’s data sources evolve or requests will change, you will need to revisit and adjust your connectors accordingly to always ensure for new tables or fields to be integrated into your sync processes properly.
Also consider scheduling periodic backups of your data warehouse and run tests to confirm if recovery processes are effective. Moreover, always check out updates or feature releases. Regular maintenance ensures that your setup remains agile, robust, and aligned with your ever-evolving data integration needs.
Make sure to provide detailed documentation of the process. First of all it will help with smooth onboarding of new team members. Even more important though can it become when dealing with troubleshooting and identifying and to ensure for the consistent application of best practices across your setup.
Having documentation as a reference for the intended configuration and workflows will reduce ambiguities, support for knowledge sharing, and an efficient management of transparent processes.
Furthermore even with a view on aspects like data compliance, documentation can become extremely valuable with regards regulatory adherence.
Within your data stack as the backbone of your business intelligence, having a robust and streamlined ELT setup in place is of course (but not just) one of the key components.
We hope that our deep dive into setting up Fivetran did not just make that clear but also provided a good starting point for the actual process of the setup. A well-orchestrated ELT process not only optimizes data accessibility but also enhances its reliability and relevance.
If you are currently looking into optimizing your data stack and need support in choosing the right tools or setting up the needed framework, please feel free to reach out. We are happy to support you in the process!