Category: Data management

Keeping science reproducible in a world of custom code and data

Read the full story at Ars Technica.

Since the mid-1600s, the output from a typical scientific study has been an essay-style journal article describing the results. But today, in fields ranging from astronomy to microbiology, much of the technical work for a journal article involves writing code to manipulate data sets. If the data and code are not available, other researchers can’t reproduce the original authors’ work and, more importantly, may not be able to build upon the work to explore new methods and discoveries.

Thanks to cultural shifts and funding requirements, more researchers are warming up to open data and open code. Even 100-year-old journals like the Quarterly Journal of Economics or the Journal of the Royal Statistical Society now require authors to provide replication materials—including data and code—with any quantitative paper. Some researchers welcome the new paradigm and see the value in pushing science forward via deeper collaboration. But others feel the burden of learning to use distribution-related tools like Git, Docker, Jupyter, and other not-quite words.

Parts of the web are disappearing every day. Here’s how to save Internet history

Read the full story from Fast Company.

The websites of today are the historical evidence of tomorrow—but only if they are archived.

File not found: Biodiversity needs better data archiving

Read the full story from Michigan State University.

A Spartan-led research team reveals surprising gaps in ecological genetic data that could otherwise help global conservation efforts

The United States Geological Survey Science Data Lifecycle Model

Download the document. See also this article in Journal of eScience Librarianship for a discussion of the model’s development.

U.S. Geological Survey (USGS) data represent corporate assets with potential value beyond any immediate research use, and therefore need to be accounted for and properly managed throughout their lifecycle. Recognizing these motives, a USGS team developed a Science Data Lifecycle Model (SDLM) as a high-level view of data—from conception through preservation and sharing—to illustrate how data management activities relate to project workflows, and to assist with understanding the expectations of proper data management. In applying the Model to research activities, USGS scientists can ensure that data products will be well-described, preserved, accessible, and fit for re-use. The Model also serves as a structure to help the USGS evaluate and improve policies and practices for managing scientific data, and to identify areas in which new tools and standards are needed.

How Scientists Can Protect Their Data From the Trump Administration

Read the full story in The Intercept.

If you’re an American scientist who’s worried that your data might get censored or destroyed by Trump’s radically anti-science appointees, here are some technologies that could help you preserve it, and preserve access to it.

Researchers make soil experiment data available in real time

Read the full story in EnvironmentalResearchWeb.

Traditionally, the scientific world has had closed doors. Since the 17th century scientists have published their findings in learned journals, but the details of exactly what they did, how they gathered their data and what tools they used to reach their conclusions are not usually communicated beyond the laboratory walls. But times are changing. Modern technology is enabling scientists to share their data worldwide and put information into the hands of ordinary citizens. However, sharing data openly is not without its challenges.

October 2015 issue of NTIS National Technical Reports Newsletter features e-cycling publications

The October 2015 issue of NTIS’ National Technical Reports Newsletter features a sampling of new and historic information on electronics recycling that is available from NTIS via the NTRL website. The issue also includes links to the public access plans of several federal agencies and an overview of NTIS’ new NTRL database.

Geospatial Data: Progress Needed on Identifying Expenditures, Building and Utilizing a Data Infrastructure, and Reducing Duplicative Efforts

Geospatial Data: Progress Needed on Identifying Expenditures, Building and Utilizing a Data Infrastructure, and Reducing Duplicative Efforts
GAO-15-193: Published: Feb 12, 2015. Publicly Released: Mar 16, 2015.

Download at http://www.gao.gov/products/GAO-15-193.

What GAO Found

Federal agencies and state governments use a variety of geospatial datasets to support their missions. For example, after Hurricane Sandy in 2012, the Federal Emergency Management Agency used geospatial data to identify 44,000 households that were damaged and inaccessible and reported that, as a result, it was able to provide expedited assistance to area residents. Federal agencies report spending billions of dollars on geospatial investments; however, the estimates are understated because agencies do not always track geospatial investments. For example, these estimates do not include billions of dollars spent on earth-observing satellites that produce volumes of geospatial data. The Federal Geographic Data Committee (FGDC) and the Office of Management and Budget (OMB) have started an initiative to have agencies identify and report annually on geospatial-related investments as part of the fiscal year 2017 budget process.

FGDC and selected federal agencies have made progress in implementing their responsibilities for the National Spatial Data Infrastructure as outlined in OMB guidance; however, critical items remain incomplete. For example, the committee established a clearinghouse for records on geospatial data, but the clearinghouse lacks an effective search capability and performance monitoring. FGDC also initiated plans and activities for coordinating with state governments on the collection of geospatial data; however, state officials GAO contacted are generally not satisfied with the committee’s efforts to coordinate with them. Among other reasons, they feel that the committee is focused on a federal perspective rather than a national one, and that state recommendations are often ignored. In addition, selected agencies have made limited progress in their own strategic planning efforts and in using the clearinghouse to register their data to ensure they do not invest in duplicative data. For example, 8 of the committee’s 32 member agencies have begun to register their data on the clearinghouse, and they have registered 59 percent of the geospatial data they deemed critical. Part of the reason that agencies are not fulfilling their responsibilities is that OMB has not made it a priority to oversee these efforts. Until OMB ensures that FGDC and federal agencies fully implement their responsibilities, the vision of improving the coordination of geospatial information and reducing duplicative investments will not be fully realized.

OMB guidance calls for agencies to eliminate duplication, avoid redundant expenditures, and improve the efficiency and effectiveness of the sharing and dissemination of geospatial data. However, some data are collected multiple times by federal, state, and local entities, resulting in duplication in effort and resources. A new initiative to create a national address database could potentially result in significant savings for federal, state, and local governments. However, agencies face challenges in effectively coordinating address data collection efforts, including statutory restrictions on sharing certain federal address data. Until there is effective coordination across the National Spatial Data Infrastructure, there will continue to be duplicative efforts to obtain and maintain these data at every level of government.

Why GAO Did This Study

The federal government collects, maintains, and uses geospatial information—data linked to specific geographic locations—to help support varied missions, including national security and natural resources conservation. To coordinate geospatial activities, in 1994 the President issued an executive order to develop a National Spatial Data Infrastructure—a framework for coordination that includes standards, data themes, and a clearinghouse. GAO was asked to review federal and state coordination of geospatial data.

GAO’s objectives were to (1) describe the geospatial data that selected federal agencies and states use and how much is spent on geospatial data; (2) assess progress in establishing the National Spatial Data Infrastructure; and (3) determine whether selected federal agencies and states invest in duplicative geospatial data. To do so, GAO identified federal and state uses of geospatial data; evaluated available cost data from 2013 to 2015; assessed FGDC’s and selected agencies’ efforts to establish the infrastructure; and analyzed federal and state datasets to identify duplication.

What GAO Recommends

GAO suggests that Congress consider assessing statutory limitations on address data to foster progress toward a national address database. GAO also recommends that OMB improve its oversight of FGDC and federal agency initiatives, and that FGDC and selected agencies fully implement initiatives. The agencies generally agreed with the recommendations and identified plans to implement them.

For more information, contact David A. Powner at (202) 512-9286 or pownerd@gao.gov.

Computer equal to or better than humans at cataloging science

Read the full story from the University of Wisconsin.

In 1997, IBM’s Deep Blue computer beat chess wizard Garry Kasparov. This year, a computer system developed at the University of Wisconsin-Madison equaled or bested scientists at the complex task of extracting data from scientific publications and placing it in a database that catalogs the results of tens of thousands of individual studies…

The development, described in the current issue of PLoS, marks a milestone in the quest to rapidly and precisely summarize, collate and index the vast output of scientists around the globe, says first author Shanan Peters, a professor of geoscience at UW-Madison.

The Gordon and Betty Moore Foundation selects awardees for $21 million in grants to stimulate data-driven discovery

The Gordon and Betty Moore Foundation today announced the selection of 14 Moore Investigators in Data-Driven Discovery. These scientists will catalyze new data-driven scientific discoveries through grants to their academic institutions totaling $21 million over five years. These unrestricted awards will enable the recipients to make a profound impact on scientific research by unlocking new types of knowledge and advancing new data science methods across a wide spectrum of disciplines. These Moore Investigator Awards are part of a $60 million, five-year Data-Driven Discovery Initiative within the Gordon and Betty Moore Foundation’s Science Program. The initiative – one of the largest privately funded data scientist programs of its kind –  is committed to enabling new types of scientific breakthroughs by supporting interdisciplinary, data-driven researchers. Last year, the Moore Foundation announced an ambitious collaboration with three universities and the Alfred P. Sloan Foundation to create dedicated data science environments.

“Science is generating data at unprecedented volume, variety and velocity, but many areas of science don’t reward the kind of expertise needed to capitalize on this explosion of information,” said Chris Mentzel, Program Director of the Data-Driven Discovery Initiative. “We are proud to recognize these outstanding scientists, and we hope these awards will help cultivate a new type of researcher and accelerate the use of interdisciplinary, data-driven science in academia.”

By empowering data-driven investigators at universities, these new awards will strengthen incentives for data scientists in academia and create greater rewards for working between disciplines. The awards will support sustained collaborations among data science researchers to build on one another’s work, capitalize on the best practices and tools, and create solutions that can be used more broadly by others.

“Many areas of science are currently data-rich, but discovery-poor,” said Vicki Chandler, PhD, Chief Program Officer for Science at the Moore Foundation. “The Moore Investigator Awards in Data-Driven Discovery aim to reverse that trend by enabling researchers to harness the unprecedented diversity of scientific data now available and answer new kinds of questions. We hope that other funders, public and private, will join us in supporting this transformation.”

The selected Investigators are listed below. Please visit the Data-Driven Discovery Investigator page for profiles of these outstanding researchers.

2014 Moore Investigators in Data-Driven Discovery

  • Joshua Bloom, University of California, Berkeley
  • C. Titus Brown, University of California, Davis
  • Casey Greene, Dartmouth College
  • Jeffrey Heer, University of Washington
  • Carl Kingsford, Carnegie Mellon University
  • Laurel Larson, University of California, Berkeley
  • Christopher Re, Stanford University
  • Kimberly Reynolds, University of Texas Southwestern Medical Center
  • Amit Singer, Princeton University
  • Matthew Stephens, University of Chicago
  • Blair Sullivan, North Carolina State University
  • Matthew Turk, University of Illinois
  • Laura Waller, University of California, Berkeley
  • Ethan White, University of Florida

%d bloggers like this: