Choosing storage for streaming large files in big data sets
A comprehensive collection of articles, videos and more, hand-picked by our editors
Editor's note: In part two of our two-part series on big data analytics in the cloud, several cloud providers discuss the networking, storage and architectural challenges of supporting big data analysis in their environments. Don't miss the first part of this story: Service providers anticipate SMB demand for big-data cloud analytics.
Given the rigorous demands that big data places on networks, storage and servers, it's not surprising that some customers would outsource the hassle and expense to the cloud. Although cloud providers say they welcome this new business opportunity, supporting big data analysis in the cloud is forcing them to confront various, albeit manageable, architectural hurdles.
The elasticity of the cloud makes it ideal for big data analytics -- the practice of rapidly crunching large volumes of unstructured data to identify patterns and improve business strategies -- according to several cloud providers. At the same time, the cloud's distributed nature can be problematic for big data analysis.
"If you're running Hadoop clusters and things like this, they put a really heavy load on storage, and in most clouds, the performance of the storage isn't good enough," said Robert Jenkins, co-founder and chief technology officer of CloudSigma, a Zurich-based Infrastructure as a Service (IaaS) provider. "The big problem with clouds is making the storage perform to a level that enables this kind of computing, and this would be the biggest reason why some people wouldn't use the cloud for big data processing."
More on big data analytics
Data scientists find role in big data analysis
What is cloud analytics?
Just how big is 'big data,' anyway?
But Jenkins and other cloud providers emphasized that these challenges aren't insurmountable, and many providers already have plans to tweak their cloud architectures to improve the capacity, performance and agility of all their cloud services -- moves that they expect will also provide better support for big data in the cloud.
"It's the same thing that we're all dealing with as more companies adopt the cloud: How do we continue to supply for the demand?" said Joseph Corvaia, vice president of cloud computing at Evolve IP, a cloud provider based in Wayne, Penn. "But I don't know that we're necessarily doing anything differently than what we did before. [We're] just being very prudent about watching what's being consumed, maintaining the ratios of how quickly it's being consumed and adding capacity as needed, based on what we see the projections over a particular measured period."
Devising an architecture that supports big data analysis in the cloud is no more daunting than meeting the challenges of satiating the rapidly growing appetite for cloud services in general, according to Henry Fastert, chief technologist and managing partner at SHI International, a large reseller, managed service provider (MSP) and cloud provider based in Somerset, N.J.
"Every day, as a cloud provider, especially at this point in time in the marketplace, I don't know if there's going to be some big [spike in] demand," Fastert said. "I had a situation recently where a micropayment gaming company asked me if I could add 2,000 eight-way virtual machines in a week. Fortunately, we could. We need to add capacity on a regular basis, but sometimes we need to add capacity in a very short period of time."
Cloud storage can drag down big data analysis
The cloud storage challenges in big data analytics fall into two categories: capacity and performance.
Scaling capacity, from a platform perspective, is something all cloud providers need to watch closely.
"Data retention continues to double and triple year-over-year because [customers] are keeping more of it. Certainly, that impacts us because we need to provide capacity," Corvaia said.
Storage performance in a highly virtualized, distributed cloud can be tricky on its own, and the demands of big data analysis only magnify the issue, several cloud providers said.
SHI International's cloud strategy is built on the company's vCore model, its branding for a proprietary "finite collection of servers, storage and switching elements," replicated across SHI's cloud, Fastert said. The distributed storage architecture enables SHI to "really optimize the performance of our infrastructure because it's set up in that granular fashion," he said.
"Storage is something that's also impacted by specific types of virtualization loads, and so the way in which you spread tasks across your storage … will always impact your performance," he said. "The [vCore] model allows us to spread loads based on the characteristics of those loads, and so we constantly look at the characteristics of customers' loads across our vCore infrastructure … and then we do load balancing across them from a storage-performance point of view."
Robert JenkinsChief Technology Officer, CloudSigma
CloudSigma is one of several providers participating in the Helix Nebula consortium, an ecosystem of European cloud providers catering to scientific research organizations. One of its customers includes the European Space Agency (ESA), which is using CloudSigma's infrastructure to store enormous volumes of data collected from a new satellite launching next year, Jenkins said. The satellite, which will be pointed at Earth, will collect data about the environment, such as air temperatures and soil conditions, and stream that data back to the ESA's cloud in real time for analysis.
Big data customers like the ESA didn't drive CloudSigma to upgrade its storage, but they certainly benefit from it. The company upgraded its architecture to improve overall storage performance several months before pursuing its ecosystem strategy, Jenkins said, noting that making storage work well is "one of the hardest things" to accomplish in the cloud.
"When you have this multi-tenant environment and mix everyone's activities together, it tends to look more and more random," Jenkins said. "The magnetic discs are not good about jumping around because they go in loops, so the more random it is, the slower the performance becomes for the users. There's an inherent problem there, so that's why we wanted to move away from the systems we had to a system that was much more distributed and able to deal better with that kind of load."
Using a combination of open-source platforms and in-house development, CloudSigma built a tiered-storage architecture that makes more efficient use of a new distributed system of solid-state drives (SSDs) and magnetic storage, Jenkins said. The result is less variability and higher performance, as the data is spread out over 50 or 100 servers instead of being on one server, he said.
"We're taking the local storage that's on every server and making it into one big storage volume," Jenkins said. "It's kind of like a SAN [storage area network] except it's not a SAN."
Cloud networking and architecture considerations
The challenges of supporting customers demanding big data analysis in the cloud don't end with storage. Cloud providers say it requires a more holistic approach to the network and overall cloud architecture.
That means acknowledging when big data analysis isn't suited for the cloud, said Jonathan King, vice president of cloud solutions at Savvis. But that's also where having a complementary set of hosting services comes in handy, he said.
"You'll have pieces of the big data engine that are running full-tilt all the time, which means that really is ideal for dedicated infrastructure, unlike other components, which are going to be variable, [and] that's ideal for cloud," King said. "A lot of these jobs are batch -- you're going to run them in four- or eight-hour increments at different times -- so having a burst-up from dedicated to virtual is really table stakes."
Big data analysis in the cloud also raises networking issues for service providers. By having all of its partners and customers in one cloud, CloudSigma makes the most of its ecosystem strategy by running a 10-Gigabit Ethernet network, "which means that you can fire terabytes of data around really, really quickly and at a very low cost," Jenkins said. Savvis, which CenturyLink acquired last year, is also considering the network implications of big data in the cloud.
"You don't want to be shipping terabytes and petabytes around," King said. "Keep the data where it is, and then you move the analytics … to that data."
As SHI develops a big-data cloud service, likely to be released next year, the cloud provider is tapping its experience with high-performance computing (HPC) in the cloud through a partnership with HP to deliver IaaS to the Internet2 consortium. In addition to improving storage performance, SHI's vCore architecture also "self-optimizes" networking and server performance as well, Fastert said.
"As it turns out, that same exact work, in terms of the way we design and optimize vCore for [HPC], is completely amenable to big data analytics," he said. "Most cloud providers essentially use a monolithic architecture, where they may have large numbers of servers and shared storage and so on, but it's all a single architecture. When you use [the vCore] model, it allows you to optimize subsections of that infrastructure very easily. It turned out that exact same form of optimization works exceptionally well for big data analytics."
Read part one of this series: Service providers anticipate SMB demand for big-data cloud analytics.
Let us know what you think about the story; email: Jessica Scarpati, Site Editor.
cloud for big data processing