Self-Service Data Ingestion
Blog - Analytics, Big Data Analytics, Data Analytics, Data Ingestion
Self-service BI has today become mainstream in the enterprise set-up. BI tools have the capability to connect and visualize data located in disparate systems. Their primary purpose is to help businesses realize value out of the enterprise data-marts. The traditional concept of discovery, planning, budgeting, allocating, and implementing data-marts within enterprises takes time and, consequently, delays value realization. With the availability of fairly-priced and hardware in the market (with enterprise-grade proven open-source technologies) that offer unlimited data storage and processing options, it is only logical to continue down the path of self-service data ingestion.
Data integration, in its simplest form, involves retrieval from source, applying single (or) multi-step transformation(s) and saving it in the target system. Enterprises are leveraging a variety of secondary data-stores to replicate transactional data for analytics / historical statistics or other reasons. Data synchronization between primary to secondary data-sources involved enterprises creating custom solutions that entailed investing a substantial amount of time and money. To address this crippling issue, the Kafka ecosystem came up with a framework called ‘Kafka Connect’. The framework addresses only the data extraction / load aspect of the use-case. If the enterprises need transformation / mediation, then Apache Spark or other middleware technologies shall have to be leveraged with the transformed data fed back into Kafka before saving in the target system. For additional details referring to system design, please check out Confluent documentation.
To put it simply, the framework (see image) connects to the data-source, retrieves data, and stores it in the Kafka topic. The stored data can be consumed by one or more sinks as well as parallelly ingested into one or multiple secondary stores. In case this data needs to be transformed / translated, a spark stream or similar technologies can be leveraged for stream processing.
MSRCosmos has built an in-house product on similar lines with added capabilities, which is being successfully leveraged by multiple customers. Our next release of the product will leverage Kafka Connect for data ingestion while still keeping intact the core strengths of our product (i.e. analytics + machine learning).
We leverage Docker for testing new products / technologies / framework for obvious reasons. Here is a sample Docker-compose file for kick-starting Kafka Connect:
Happy Business Transformation!