Prepare data pipelines without code or extra tools
DataFactory is everything you need for data integration and building efficient data pipelines
Transform raw data into information and insights in ⅓ the time of other tools
DataFactory is made for
self-service data ingestion
streaming
Real-Time Replication
transformations
cleansing
preparation
wrangling
machine learning modeling
Build Pipelines Visually
The days of writing pages of code to move and transform data are over. Drag data operations from a tool palette onto your pipeline canvas to build even the most complex pipelines.
- Palette of data transformations you can drag to a pipeline canvas
- Build pipelines that would take hours in code in minutes
- Automate and operationalize using in-built approval and a version control mechanism
Data Wrangling & Transformations
It used to be that data wrangling was one tool, pipeline building was another, and machine learning was yet another tool. DataFactory brings these functions together, where they belong.
- Perform operations easily using drag-and-drop transformations
- Wrangle datasets to prepare for advanced analytics
- Add & operationalize ML functions like segmentation and categorization without code
Complete Pipeline Orchestration
With DataFactory, you’ve got complete control over your pipelines — when they ingest, when they load targets — everything. It’s like your personal data pipeline robot doing your bidding.
- Perform all ETL, ELT, and lake house operations visually
- Schedule pipeline operations & receive notifications and alerts
Click into Details
DataFactory doesn’t sacrifice power for ease of use — open the nodes you’ve added to fine-tune operations exactly how you want them to perform. Use code or tune nodes using the built-in options.
- Click on any pipeline node to see the details of the transformation
- Adjust parameters with built-in controls or directly using SQL
- Analyze and gain insights into your data using visualizations and dashboards
Model and maintain an easily accessible cloud Data Lake
If you’re worried about standing up a data lake just for analytics, worry no more. DataFactory includes it’s own data lake to save you time and money. No need to buy yet another tool just for analytic data storage.
- Use for cold and warm data reporting & analytics
- Save costs vs buying a separate data lake platform
Hundreds of Connectors
Batch data? Buy a tool. Streaming data? Buy another tool. That’s so 2018. DataFactory connects to any data, anywhere, without having to buy YAT (yet another tool).
- Legacy databases
- Modern Cloud databases
- Cloud ERPs
- REST API sources (Streaming / Batch)
- Object Storage Locations
- Flat Files
Whatever you need for your data engineering,
DataFactory is there for you
Capability Matrix
data sources
connect
Discover
Ingest
Processing
Load
Use cases
What can you do with DataFactory?
Quick Insight on Datasets
Use DB Explorer to query data for rapid insights using the power of Spark SQL engine
Save results as data sets to be used as sources in building pipelines and as sources in data wrangling and ML modeling
Easy Data Preparation
Use transformation nodes included in the tool palette to analyze the grain of the data and distribution of attributes
Use nulls and empty records, value statistics, and length statistics to profile data in sources for efficient join and union operations
Query-based CDC
Identify and consume changed data from source databases into downstream staging and integration layers in Delta Lake
Minimizes impacts on source systems via incremental changes identified using variable-enabled, timestamp-based SQL queries
Incorporate only the latest changes into your data warehouse for near real-time analytics and cost savings
Log-based CDC
Deliver real-time data streaming by reading the database logs for identifying the continuous changes happening to source data.
Optimize efficiency by using background processes to scan database logs to capture changed data without impacting transactions and minimizing impact on source
Easily configure and schedule using built-in log-based CDC tools
Anomaly Detection
Quickly perform data pre-processing or data cleansing to provide the learning algorithm a meaningful training dataset
Minimizes impacts on source systems via incremental changes identified using variable-enabled, timestamp-based SQL queries
Cap anomalies based on the built-in rule sets and set guardrails to achieve a higher accuracy percentage of quality data
Push-down Optimization
Execute pushdown optimization via push-down enabled transformation nodes so that transformation logic is pushed down to the source or target database
Easily configure push-down operations within transformation nodes
The team was so excited that we were able to do it in a fraction of the time and so effectively.
$940K saved annually by automating data quality across nine data sources.
14 FTEs saved through automation. 60% reduction in time needed to test data.