Storing, querying and managing large masses of unstructured data are how many businesses operate today. The term “Big Data” and morphed and evolved to describe the logging and utilization of continuous data samples to generate forward-looking operational initiatives. Whether it is global retail, investment banking, or healthcare; the ability of data to optimize businesses and change lives rests on an organizations ability to utilize data lakes and manage large file transfers. The answered promise of unstructured data lake is that you can rapidly identify correlations and covariance between data entities that you didn’t know existed, largely because the middleware now exists to supported scattered data amidst repositories that once, in the past, were segmented.
Key customer personas and trends in banking, healthcare and the online consumer market are blazing the trail for a bright future in big data with the data lake approach, but what does a data lake look like in a typical organization and how is supported? Additionally, and perhaps more notably, how is it changing?
While perspectives waiver slightly, a data lake can be as simple as utilizing an enterprise cluster and moving your data into a Java-based file system – thus becoming a data lake. However, in practice it is much more complex than that. Considering the massive influx of data, and increasing file size of data sets in general, global enterprise has numerous data repositories and are scrambling to create more each day. Also, the efficacy and quality of data; whether it be warehouses, operational data stores or data stored in applications like excel or log files, must be tested and validated to be of any use. If you also add stream data, such as data created by sensors and mobile devices, the picture becomes less clear and solution more complex.
Achieving data optimization in the end comes down to the management and virtualization file transfers within, and between, large enterprise organizations.
The data lake rests above operational databases, enterprise warehouses, system application files, and “Big Data” structures like Hadoop or NoSQL and is responsible for querying data without duplication for virtualization purposes. BI and reporting tools essentially read these evolving, expanding and disparate data sources as one, logical data lake. But the challenge for IT operations and other large enterprise employees looking to harness the power of unstructured data lies within their ability to pull and use data in their daily business function. For example, can they easily and reliably extract and send large files between departments, locations and between partners? Are organizations losing files in transfer? How much time is wasted extracting and organizing data to construct actionable business insights? Without a digital pathway for companies and organizations alike to make sense of the data lake, global business is bogged down and bottlenecked by their data-scientist resources. This is why managed file transfer Solutions (MFT) are rising, once again, as a primary construct that is helping business scale to meet the demands the growing and complex data lake.
You wouldn’t fill a lake using a tea spoon
Today large organizations are using several methods to feed their data lake depending of the nature and the origin of each set of data. It is coming through stream of events or through big files. Loading one big file certainly seems easier than taking care of millions of individual events. However, managing transfers of very large files is not that common. The matters of concern are: scalability, data integrity, transfer acceleration and checkpoint/ restart.
Your Business is already producing data – use it!
There is a big chance that you need to move business data from a business application (i.e. billing, ordering, manufacturing, etc.) to your data lake. It would be great to identify correlations and operational insights in order to optimize these business functions. The best way to achieving this improved functionality is to change the source application. This will allow you to generate a file on a regular period, or on a triggered event, to push the resulting file to the data lake. In other words, it’s automated data logging your business to become smarter and more connected.
From this point, all you need to do is answer these questions:
- Is your team managing the data lake / DB aware of existing file transfers?
- Is your team managing the data lake / DB able to leverage existing integration programs rather than developing, maintaining and operating new ones?
- How will your team manage application changes?
File transfers are everywhere, but what is in the files?
It is very likely your IT organization is moving tons of files every single day. That is the way your business is running just as 32% of businesses who are relying on file transfers to support critical business functions. The reality behind that stat is that file transfers occur, and are operated and managed between two servers with no knowledge on the content of the file, the criticality of the transfer, or how that file impacts the business. This leads to redundant file transfers, inability to monitor or trace.
Large Files means Long Transfer Time right?
The ultimate aim of a data lake is to serve as a repository for all data in your ecosystem. As a global enterprise, you will very likely have to deal with big files and long distances in your transfer operations daily. In these situations, a 1GB or larger file, may take hours to be transferred, delaying business. A robust managed file transfer solution changes this by opening up the floodgates to large file transfer capabilities. Here’s how.
It is not only about filling the data lake; it’s about using it.
Finally, business analysts require the ability to download big sets of data for further investigation and data modeling. This constantly leads for required demand for data transfer acceleration, reliability and security of the source data in question. Data analysts will periodically extract some sets of data to explore them. Sometimes this may be huge data sample and require a high level of reliability to support the transfer. Also there are questions about the rights associated with each data sample. How are the rights associated to the files managed? Who is tracking that the data is secured? Who is tracking that a file has been downloaded?
How a company manages data and dataflow defines how a company is able to capitalize on it. The data lake is essentially the resource that is driving increased customer engagement and business efficiency in any vertical industry where file transfer and data management are critical. Depending on how well data can be gathered, shared, and secured will define the winners from losers in global enterprise.
MFT:ready for digital
 Re-Thinking Your Data Flow Transmission for Security Risk and Compliance – © Ovum report October 2014