2013 Developers/Users Forum Minutes
CMAS Community Developers/Users Forum Minutes
CMAS Conference October 28-30, 2013
Big Data Issues in the CMAS Community
Facilitator: Zac Adelman (UNC)
Panelists: Alison Eyth (USEPA OAQPS), Barron Henderson (U. Florida), Joshua Fu (U. Tennessee), Arastoo Pour Biazar (U. Alabama Huntsville), Christian Hogrefe (USEPA ORD)
Introduction by Adelman introducing the concept of the panel discussion. The session is presented as an open forum for discussing thoughts, experiences, and ideas within the community. He also presented a case-study on building a data warehouse for transferring large modeling datasets. Tools are now available for electronic data transfer that are much faster and more efficient than mailing external hard drives (aka "sneaker-net").
The objective of the meeting is for the CMAS community to have a conversation about the challenges posed by big data. A series of charge questions will be presented to stimulate discussion. The panelists will have the chance to respond to the questions first and then the discussion will be open to the audience.
Question 1: The key issues surrounding big data include generation, management, storage, sharing/transfer, and visualization/analysis. While we're clearly realizing success in generating big data, in what other areas are we doing well as a community?
Henderson: Individual success in data management, although storing everything that you generate is not necessarily a success. Proposed the concept that it may be false to believe that it is more expensive to redo a modeling run than it is to store the data. We can also see the relative open nature of data exchange in the CMAS community as a success. Visualization and analysis is mixed, there are still problems with the usability of 3-D visualization tools
Eyth: The data transfers happening in the community are a sign of success, e.g. RPOs willing to copy/transfer larger data archives. The offers from the community to help share data are appreciated. The community-based development and improvement of modeling data is also a success.
Biazar: Data cataloging is an issue in the community. Transferring large data sets still requires using external hard drives.
Fu: Faster and more transparent data sharing will address many issues in the community. The Research Education Network is a project with NOAA and NSF to share data on pedabyte scales, includes tools for locating and displaying archived data. Need to look to these large projects for examples of how to create infrastructure and networks to support large data sharing
Hogrefe: Generating so much data, so quickly is an accomplishment. The questions and environmental problems are expanding the need for computing resources and stretching the existing infrastructure. As the air quality modeling community interacts with other fields/media/disciplines there will be more of a need for data sharing. The modeling process is now more holistic, used to be more specialized, modeling groups are now thinking through the entire problem of generating and sharing data. There are now resources for streamlined and routine access of observations, intercomparison studies, and benchmarking models
- Can't put all data in memory for visualization, need for tools to visualize and analyze data one time-step at a time
- Need to design datasets more intelligently; restarts need full precision of data, visualization and analysis need less variables/precision, consider the context of how data will be used when designing data sharing archive
- Need for better compression algorithms that work well with modeling data
- To facilitate data sharing there is a need for more structured and clear methods for versioning data
Question 2: What are the more pressing data challenges and how might we address these challenges as a community?
Henderson: Awareness and cataloging: we need to know what's out there; Usability: different types of data require different types of tools; GIOVANNI and MIRADOR are great data cataloging success, although there are still issues with the underlying data; need for a metaprocessor that emulates all metadata to the netCDF format, needed to help support cataloging; cataloging with help usability
Eyth: Long term data archival is a challenge for OAQPS; storage is not free and don't have the on-line storage capacity and network bandwidth to keep massive data archives online indefinitely
Biazar: Moving along but the problems keep repeating themselves on larger scales; in the past, we couldn't save everything because other cost of storage, now we save everything and management and cataloging is a challenge; too many file formats require different tools for analysis, need to normalize to single format like CF-compliant netCDF; data catalogs don't need to be located on a data server, the can be linked together to catalog a wider network of data archives
Fu: Need for data portals around the community; burden of storing and sharing data should be distributed around the community; see Earth System Grid as an example of community data sharing; don't need a single data center, create different portals instead
Hogrefe: Solution to large data: don't generate so much data; define the problem more precisely at the beginning, try to be more intelligent about how experiments are designed; use subsetting tools more effectively to parse the large data into more manageable pieces
- Problem of having sufficient metadata to describe modeling files, need funding and effort to address this issue; mechanisms exist for data transfer, no need to reinvent the wheel; implement/deploy these technologies for our community
- Software enhancements are needed to add additional metadata to modeling files
- Project underway at Washington State University to integrate modeling data with Earth System models; cyberinfrastructure task using Keplar workflows: includes full documentation of the data and uses higher level workflows to generate descriptions of the processes and datasets
- Data are being generated in too much volume, too quickly; don't have time to analyze, need better visualization tools, particularly 3-D vis; for regulatory modeling it's rare to look beyond the surface data
- Metadata is key; need for information sharing: who has what, nice to have a GIS app to display who has what for different domains; need more robust test data for benchmarking visualization and analysis tools
- There is a challenge in subsetting regional data that doesn't exist with global datasets; need for adding lat-lon coordinates into regional air quality data for cataloging and sharing
- People struggle with the data formats and converting between them; need to create a list of tools available for data manipulation
- Demand for CMAS to build a data exchange catalog
- Need for a community toolbox, for accessing ad hoc tools
- Need for metadata standard in the community; look at what similar communities are doing
- Data cataloging is the problem with "sneaker net"
- Organize data sharing around the purpose of the data: need for catalog system, expect learning curve to use data sharing tools and datasets, collate data providers and get them to organize
- May not get away from ad hoc project orientation any time soon, need a way to track
- Create a tool for sampling data to make the data sizes more manageable