Azure Data Lake Ingestion
One of the services used to ingest data into an Azure Data Lake is Azure Data Factory.
But when building a robust Data Lake Environment you can blend Azure services together to result is much richer and vibrant solution.
Our approach to data ingestion considers the data source and volatility of the data; from static to real-time.
By using the right tool (service) for each source data set we can capitalize on the Azure services available to us including:
These Professional Services, along with some of the tooling we provide in our .Net / .Net Core framework (emFramework
), allows us to deliver a solution that is truly enterprise class.
Ingesting data into an Azure Data Lake isn't complex, but it requires a strong understanding of the source data, the Azure architecture, and the outcome expected.
However, when implementing each ingestion process it is important to ensure that several organizational needs are taken into consideration.
The performance needs of real-time data ingestion versus static is obviously different, but that being said too many people take the same approach for each.
As part of each source data set being implemented, a design consideration for performance is a must.
To ensure the Data Lake Environment is function on a minute-to-minute basis, and ready to supports the organization's objectives, a strong approach to monitoring data ingestion is critical.
Implementing an event centric ingestion architecture (the "Information Pipeline") allows for a rich set of options for in a publish-subscribing orientation.
Our approach allows for a rich set of monitoring and alerting solutions otherwise unavailable or ignored within a typical implementation.
By implementing an event centric data ingestion architecture the same ingestion pipeline can be used for standard data loading and data lake initialization (bulk data loading) with the key difference being scale during each of those classes of data load.
In taking an event centric approach the standard data load can be tested as part of repository initialization.
Additionally testing the scaling capabilities of the Data Lake Environment is also tested during repository initialization.