Fivetran MAR Optimization
Many potential components of the modern data stack have consumption based pricing. As most of these tools offer free trials, you should make sure to make the most of the trial and determine the cost of your infrastructure. However once you are on a running fivetran setup there might be ways to save several hundred Euros or more by optimizing what data you consume and how you set up the whole infrastructure you use for ingestion. In this case we want to look at fivetran in conjunction with a snowflake warehouse.
The Pricing Model of Fivetran: Consumption based
Fivetran Pricing is based on the user’s actual monthly data consumption. Therefore, whatever possibilities you have to control or optimize this amount has an immediate impact on your costs, let alone efficiency as one should always question what is really needed or not. Given that next to direct cost for fivetran itself you will pay for compute resources (in this case snowflake warehouses) you might be able to also reduce secondary cost associated with fivetran.
How is consumption measured?
In the case of fivetran, pricing is based on a metric called MAR (monthly active rows). It might be initially tricky to grasp that concept but let’s give it a try (a good resource is of course Fivetran’s website as well):
You have a table with one million rows. Now this table gets synced when you completely set up your connector with a historical sync
Basically, this table grows by 10.000 rows daily, all of these new rows are counted as MAR, so will all newly added rows that get synced
Within the whole table, there are some rows that get updated during the month. Any row that gets updated once or more in a month will get counted once as a MAR. A row that has been created on the 2nd of a month and that gets updated 5 times within that same month, will be counted as one MAR for that month
There are some notable exceptions to this, the biggest being file connectors (file folders, SFTP/FTP,…). Here you will pay for the maximum amount of rows in each file that fivetran syncs in a month and you will get charged for the full table each month. Using lots of files in your account as sources can bloat the cost (see https://fivetran.com/docs/pricing#fileconnectors)
Some specific connectors have sync strategies, where they regularly need to sync an entire table as there is no tracking mechanism for fivetran in place to understand which rows are new or have changed. An overview of edge cases is summarized here (https://fivetran.com/docs/pricing#connectorspecificfunctionaldifferences)
If you use fivetran, you are most often aware how to track your MAR. When you open your account and have the right permissions, you can go on a per connector and per table level by clicking on “Account settings” -> “Billing and Usage” and selecting the usage tab. Another option is to use the fivetran log connector to write information on all connectors to your data warehouse / data lake and then build a custom reporting. We regularly support our clients in setting up a cost reporting on their whole infrastructure
Several approaches can lead to a more cost-efficient setup with fivetran:
On the connector level, understand which tables within a connector account for the most MAR and whether they are actually used downstream for reporting. An example that regularly saves lots of MAR is switching off hourly reports from the Google Ads connector, which generate most of the MAR in this connector, but most companies only evaluate their ad performance on a daily level. Go through the statistics connector by connector and take the time to talk to users involved in downstream processes, to see if the “big spenders” are actually in use. If you are using tools like dbt or coalesce, you might understand from their data lineage features if these tables are implemented in your data modeling at all
MAR do not depend on whether a row is actually updated in the sources, but rather whether fivetran sees a change or not. Let me elaborate: a concrete example out of our consulting work was a table that held all the product information in our client’s ERP system. This table was needed downstream in multiple instances to join in product information. However,within this table there was a column that was updated every 5 minutes…