Fivetran MAR Optimization

by Konstantin Wemhoener | Oct 20, 2022 | Business Intelligence

Many potential components of the modern data stack have consumption based pricing. As most of these tools offer free trials, you should make sure to make the most of the trial and determine the cost of your infrastructure. However once you are on a running setup there might be ways to save several hundred Euros or more by optimizing what data you consume and how you set up the whole infrastructure you use for ingestion. In this case we want to look at fivetran in conjunction with a snowflake warehouse.

The Pricing Model of Fivetran: Consumption based

Fivetran Pricing is based on the user’s actual monthly data consumption. Therefore, whatever possibilities you have to control or optimize this amount has an immediate impact on your costs, let alone efficiency as one should always question what is really needed or not. Given that next to direct cost for fivetran itself you will pay for compute resources (in this case snowflake warehouses) you might be able to also reduce secondary cost associated with fivetran.

How is consumption measured?

In the case of fivetran, pricing is based on a metric called MAR (monthly active rows). It might be initially tricky to grasp that concept but let’s give it a try (a good resource is of course Fivetran’s website as well):

  • You have a table with one million rows. Now this table gets synced when you completely set up your connector with a historical sync

  • Basically, this table grows by 10.000 rows daily, all of these new rows are counted as MAR, so will all newly added rows that get synced

  • Within the whole table, there are some rows that get updated during the month. Any row that gets updated once or more in a month will get counted once as a MAR. A row that has been created on the 2nd of a month and that gets updated 5 times within that same month, will be counted as one MAR for that month

  • There are some notable exceptions to this, the biggest being file connectors (file folders, SFTP/FTP,…). Here you will pay for the maximum amount of rows in each file that fivetran syncs in a month and you will get charged for the full table each month. Using lots of files in your account as sources can bloat the cost (see https://fivetran.com/docs/pricing#fileconnectors)

  • Some specific connectors have sync strategies, where they regularly need to sync an entire table as there is no tracking mechanism for fivetran in place to understand which rows are new or have changed. An overview of edge cases is summarized here (https://fivetran.com/docs/pricing#connectorspecificfunctionaldifferences)

If you use fivetran, you are most often aware how to track your MAR. When you open your account and have the right permissions, you can go on a per connector and per table level by clicking on “Account settings” -> “Billing and Usage” and selecting the usage tab. Another option is to use the fivetran log connector to write information on all connectors to your data warehouse / data lake and then build a custom reporting. We regularly support our clients in setting up a cost reporting on their whole infrastructure

Optimizing MAR

Several approaches can lead to a more cost-efficient setup with fivetran:

  • On the connector level, understand which tables within a connector account for the most MAR and whether they are actually used downstream for reporting. An example that regularly saves lots of MAR is switching off hourly reports from the Google Ads connector, which generate most of the MAR in this connector, but most companies only evaluate their ad performance on a daily level. Go through the statistics connector by connector and take the time to talk to users involved in downstream processes, to see if the “big spenders” are actually in use. If you are using tools like dbt or coalesce, you might understand from their data lineage features if these tables are implemented in your data modeling at all

  • MAR do not depend on whether a row is actually updated in the sources, but rather whether fivetran sees a change or not. Let me elaborate: a concrete example out of our consulting work was a table that held all the product information in our client’s ERP system. This table was needed downstream in multiple instances to join in product information. However,within this table there was a column that was updated every 5 minutes as a ping against the stock. Stock levels however were not of interest at that time. Fivetran offers an option to individually select or deselect most columns(which can also help exclude PII from getting to your destination). Removing that “ping-column” helped to massively reduce the MAR for that table as the remaining columns rarely changed during a month

  • Specific connectors require the composition of custom datasets you extract with fivetran. You will have to configure fivetran to generate tables with columns that you need from the sources data model. Let’s take Google Analytics as an example: there is a tradeoff between building very specific custom reports for each reporting use case vs building broader tables in custom reports that cover more dimensions and metrics than needed for each individual report, but then you only need that one table. Your free trial for each connector can help estimate which setup is more efficient. However, adding a not needed dimension to a custom report will directly multiply your MAR. Say you have a table with 200 rows daily for pageviews per source and medium. Adding the campaign dimension, those can easily become several thousand rows. If you do not report on a campaign level however, this is just additional cost with no value

  • Another issue we regularly encounter is that synchronization frequencies in fivetran are not aligned with the downstream use cases. You can technically have fivetran sync every minute (depending on your plan the minimum might be every hour). The impact on your MAR is actually negligible, because if a row is active in a month whether you get it at high or low frequency does not matter. There are use cases where high frequency syncs are required though: you need a (near-)realtime dashboard for inventory, you have no historization in your source and use dbt to build an incremental table where you want all status changes to be covered… However, if you do not need data every other minute, you may want to dial down the frequency, as this will raise the cost for your data warehouse infrastructure. If you go for a daily sync (given that use cases and amount of data allow for that) you even have the possibility to set the hour at which data gets synced. If feasible, set the hour to be the same for all sources so that your warehouse compute resource on snowflake only starts for just one and not multiple timeframes. Make sure to have your warehouse connected to fivetran auto-suspend in snowflake

  • The sizing of snowflake warehouses can also have a major impact on synchronization costs. In most use cases, we see that an XS warehouse is absolutely sufficient for running connector syncs. Make your way up from there, if it should not suffice

Conclusion: Many MAR optimization options, mostly depending on individual setups

As you can see there is multiple ways to optimize your MAR and associated cost with fivetran to improve your cost-value relation in this setup. However, there might be even more optimizations that can be found when evaluating your account. It all starts with a good analysis of what data is actually used and which tables incur the most MAR. Bringing this together can be challenging, as it might require multiple alignments with business users and other data teams, but we have regularly seen significant savings in company’s stacks. If you want to challenge your current setup with a trusted fivetran partner, feel free to reach out to use and we are happy to jump on a call with you.

Latest Posts

Are you facing similar challenges?

We would be happy to discuss ways we can best assist you. Do not hesitate to book a free consultation at a date of your choice!