The move to data-driven science and decision-making is necessitating the need for a comprehensive benchmarking of ‘big data’ applications as well as price/performance across the board, according to attendees at a recent workshop organized by the San Diego Supercomputer Center (SDSC) at the University of California, San Diego.
Big data applications are characterized by the need to provide timely analytics while dealing with large data volumes, high data rates, and a wide range of data sources. Many of those datasets are so voluminous that most conventional computers and software cannot effectively process them.
The Workshop on Big Data Benchmarking (WBDB2012), organized by SDSC’s Center for Large-scale Data Systems research (CLDS), was held May 8-9 in San Jose, Calif. The workshop, supported by the National Science Foundation (NSF), was quickly oversubscribed, attended by about 60 experts representing more than 45 institutions from industry and academia.
The meeting brought together experts in large-scale data, parallel database systems, benchmarking and system performance, cloud storage and computing, and related areas. Sponsors for the event in addition to the NSF included networking, and storage solutions companies Seagate, Greenplum, Brocade, and Mellanox.
“This is an important first step toward the development of a set of benchmarks that will be needed to move ahead in a unified fashion when it comes to big data applications,” said Chaitan Baru, head of the program committee for the WBDB2012. Baru is an SDSC Distinguished Scientist and director of the CLDS.
“Data is emerging as the key differentiator for creating value for enterprises,” said Milind Bhandarkar, chief scientist of Greenplum, a corporate sponsor of CLDS. “The storage scalability, along with processing flexibility offered by various big data platforms, has introduced a large variety of use cases. Choosing an appropriate data platform that meets the performance and scalability needs of an organization is not a trivial problem. Therefore, having an industry-standard suite of benchmarks that represent real workloads is most important. WBDB has taken an important first step of assembling a gathering of experts.”
One key topic of discussion focused on the range and types of applications and associated data that need to be modeled as part of a big data benchmarking exercise. In addition to discussion of industry examples, the workshop brought forward several leading examples from science, including a presentation on genomic data by Nicholas Schork, a professor at The Scripps Research Institute and the Scripps Translational Science Institute, and one on geospatial data by Shashi Shekhar, McKnight Distinguished University Professor in the Department of Computer Science at the University of Minnesota (U.Minn).
In a recent program solicitation, the NSF described big data as “large, diverse, complex, longitudinal, and/or distributed datasets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future.”
“The move to data-driven science and decision-making has made the big data issue critical to all areas of science as well as enterprise applications,” said SDSC’s Baru. “Big data and data-intensive research is here to stay, and our ultimate goal is to provide clear and objective information to help characterize and understand hardware and system performance, as well as price and performance across the board, including computing, network connectivity, and data storage systems.”
WBDB2012 was hosted by Brocade at their Executive Briefing Center. “This was an excellent forum to begin to determine big data metrics, and we were honored to sponsor the workshop,” said Scott Pearson, director of big data solutions, and a member of the big data strategy team at Brocade, a leading vendor of networking technologies.
SDSC’s second big data benchmarking workshop will be held in December 2012 in Pune, India, hosted by Persistent Systems. The community is also beginning to organize itself via regular phone conferences. For further information about the workshop and the regular meetings, visit the CLDS website (http://clds.sdsc.edu).
As an Organized Research Unit of UC San Diego, SDSC works with industry and government, as well as academia. Industry researchers and representatives interested in learning more about SDSC’s resources and expertise should contact Ron Hawkins at firstname.lastname@example.org or 858 534-5045.
Available for comment:
Chaitan Baru, (858) 534-5082 or email@example.com
Warren R. Froelich, SDSC Communications, 858 822-3622 or firstname.lastname@example.org