We now know why it is imperative for modern businesses to be omni-present and omni-potent. Our interest here is to know what it takes to do so, from the IT infrastructure of an ever-expanding business. Obviously, the number of machines, servers, databases, and people interacting with the business in one way or the other, grows many-fold.
Of course, the interactions themselves may not be fully valuable on their own; but, in the way the interactions have taken place – traversed and ended, lies data that can shed a lot of light on what all is happening through the various customer touch-points of a business.
The secret recipe for a bright future
Data, if managed optimally, can lead to precious insights being drawn that will likely help the business to understand its market better – both current as well as future- and serve it better. But this open secret is not a readymade recipe for success. Thus, effective data management isn’t only about storing large volumes of data but doing so in a structured and proactively consumable manner.
The expanding and diversifying nature of business, needless to say, necessitates having in place disparate, scalable, and flexible IT infrastructure to suit the respective type, quality, and the extent of business interaction with the customers. This means a lot of data flows in from multiple data sources, which could be a mixture of complex and unstructured data that doesn’t fit properly into tables.
The catalyst was brewing
So, it went on like this for quite some time. But in due course, Google, Yahoo! and, eventually, Apache‘s efforts lead to the advent of Hadoop. Hadoop opened up a world of possibilities for businesses with large chunks of data to work with.
However, since the huge volume of data is also likely to be scattered across multiple data sources – types and numbers of databases, it still remained a cumbersome job for businesses to delve sufficiently into the data points and, fruitfully use the information lying within them.
In all of this, the power and utility of Hadoop did not get diminished – it only remained not fully harnessed. Its capacity as the super-catalyst wasn’t questioned and, perhaps, can never be.
Data Ingestors / Transformers…came in to help
But the question and, hence, the problem of too much data from too many sources still persisted. IT infrastructures often seemed to have bitten more than they could chew.
So what was required was a data ingestion / transformation tool that would seamlessly integrate big-data (the huge amount of data resulting from the many transactions of customers with the business, through the various touch-points) with on-premises as well as on-cloud databases and data warehouses like Oracle, SQL Server, MySQL, DB2, SAP HANA, etc.
How about integration with other databases?
The data ingestion or transformation tool should also be able to do a lot more such as integrating seamlessly with complex databases like SAP HANA, relational databases like Teradata, Netezza, and NoSQL databases like Mongo DB and Cassandra.
How about the extraction capabilities?
Yeah that one too. The tool should also have the capacity to extract table data from source database servers to HDFS, Hive, and HBase.
If one asked the personnel behind running such huge and wide-spread IT infrastructures, one can easily get the following needs (not wants, mind you!):
- It should also facilitate conditional data transfer of table content to Hadoop
- Provide scheduler option for data transfer jobs
- Allow for batch-wise execution of jobs on multiple databases
- Should have the ability of capturing data changes
- Of course, should allow for easy job status monitoring as well as terminating unwanted jobs.
Yeah keep ‘em coming
The data ingestion or transformation tool could further facilitate for near-line storage, cold data archiving, tiered storage, and creation of a data lake.
Why all these?
- Firstly, to bring data into Hadoop HDFS and Hive so that Hadoop can be used as near-line storage.
- Then, to be able to bring any cold data into Hadoop HDFS and Hive where it can be archived and used for undertaking analytical purposes.
- Also, so that Hadoop HDFS can be used for tiered storage by using the tool to bring data in to HDFS.
- And finally, the data ingestion or transformation tool – by connecting to several data sources and bringing the data into Hadoop HDFS and Hive- should facilitate the creation of a data lake with the data from multiple sources.
So, are there any such tools or offerings that measure up to these requirements?
Frankly, very few; and not all of them live up to the expectations or demands of optimal big data management and utilization. There is, however, our product called HCube, which promises to meet most of these requirements. So, all those interested folks looking to address the above requirements may well explore HCube, and let them know if it is indeed an effective tool for achieving optimal data management and accelerated analytics.