Notes | Open | Published:
Developing a semi-automatic data conversion tool for Korean ecological data standardization
Journal of Ecology and Environmentvolume 41, Article number: 11 (2017)
Recently, great demands are rising around the globe for monitoring and studying of long-term ecological changes. To go with the stream, many researchers in South Korea have attempted to share and integrate ecological data for practical use. Although some achievements were made in the meantime, we still have to overcome a big obstacle that existing ecological data in South Korea are mostly spread all over the country in various formats of computer files. In this study, we aim to handle the situation by developing a semi-automatic data conversion tool for Korean ecological data standardization, based on some predefined protocols for ecological data collection and management.
The current implementation of this tool works on only five species (libythea celtis, spittle bugs, mosquitoes, pinus, and quercus mongolica), helping data managers to quickly and efficiently obtain a standardized format of ecological data from raw collection data. With this tool, the procedure of data conversion is divided into four steps: data file and protocol selection step, species selection step, attribute mapping step, and data standardization step. To find the usability of this tool, we utilized it to conduct the standardization of raw five species data collected from six different observatory sites of Korean National Parks. As a result, we could obtain a common form of standardized data in a relatively short time. With the help of this tool, various ecological data could be easily integrated into the nationwide common platform, providing broad applicability towards solving many issues in ecological and environmental system.
It is important to share and integrate ecological data for monitoring and studying of long-term ecological changes (Brunt et al. 2002). Currently, however, domestic data in South Korea are spread over numerous research sites, institutions, and individual researchers. Even there has been no common protocol for ecological data collection and management; the data are mainly kept in a variety of forms. For this reason, existing data are difficult to integrate, analyze, and manage for long-term ecological research, so it is very necessary to standardize domestic ecological data in a common form for data integration and further analyses (Michener et al. 2012, Bonet et al. 2014).
Until now, long-term ecological data have been globally collected in each country according to its own protocols, while being maintained in large databases in the form of Ecological Metadata Language (EML) (Fegraus et al. 2005). In particular, various long-term environmental monitoring projects, including Environmental Change Network (ECN) (Morecroft et al. 2009), the National Ecological Observatory Network (NEON) (Keller et al. 2008), and the Long-Term Ecological Research network (LTER) (San Gil et al. 2009), are providing large volume of ecological data easily accessible to the public. To follow such trends, Korea is also building a unified ecological data integration network. For this purpose, there is a need to convert already collected raw data into common form, as well as to collect new data with common protocols.
In this study, we developed a semi-automatic ecological data conversion tool that can help ecologists to standardize ecological data more easily and efficiently in a relatively short time, while keeping the inherent meaning of the data. The data conversion was done based on some predefined protocols for data collection and management. Figure 1 summarizes the overall workflow of conversion procedure in our program.
Materials and methods
Ecological data are mostly stored in text-based tables. Each row in the table represents a record that contains the values of many attributes (or characteristics) for target species. Each column corresponds to an attribute of the same data type and unit. For example, an attribute of “search date” includes the date when the raw data were collected, usually given in the format of YYYY-MM-DD, DD-MM-YY, and so on. With our tool, the raw data is standardized by following the four steps: (1) data file and protocol selection step, (2) species selection step, (3) attribute mapping step, and (4) data standardization step.
The first step of data file and protocol selection is to upload raw data file to be converted and select predefined protocols which define standard attributes and data types for target species (see Fig. 2). In the present version of the tool, only csv files are allowed for raw data files.
Next, the second step is to specify target species to be converted from raw data files. This is to filter out and convert only specific (target) species data matched with the chosen protocol, in case that the raw data file contains a number of species. If the raw data include only one species corresponding to the protocol, this step can be skipped. The user interface for this step to choose a list of target species that should be extracted from raw data is presented as shown in Fig. 3. Here, users can find a certain attribute containing some specific names of target species and add a particular species name to the “selected species list”. Like this, users can selectively convert only a part of raw data matched with the chosen protocol. For user convenience, we provide the function of uploading a list of species names to be converted, which makes it easier and faster to select a number of species.
Then, in the step of attribute mapping, the relations between raw data attributes and standard attributes in the protocol need to be specified by users. To this end, users should specify which attributes in raw data are matched with which standard attributes defined in the protocol. Once the relation between the two attributes is specified, in Fig. 4, the “mapping” button of the screen can be pressed to realize the mapping into the data conversion procedure. Non-selected raw data attributes are excluded from the subsequent conversion process. The mapping list between the two attributes can also be allowed to use for convenience.
In the final step, data type and unit of each attribute can be properly transformed into a standardized format. For this purpose, we provide several functions like concatenation, separation, substitution, date conversion, unit conversion, and editing function (Fig. 5). Specifically, the concatenation function can be used to merge values in two or more attributes into one new value. We can insert a text or symbol as a delimiter when combining multiple values. The separation function divides a string into several chunks. For example, by the separation, the attribute of “search period” can be divided into two attributes of the “search start date” and “search end date.” The substitution function replaces certain values with different values, e.g., texts, numbers, delimiters, or symbols. The function of date conversion can be utilized to specify the desirable format of search date. For example, this function separates search date into three parts as day, month, and year, and then rearranges them to the desired order such as YYYY-MM-DD. Unit conversion is to change the data unit, and editing function is to transform numerical data by using a formula for computation. At the end, the standardized data are saved into a new csv file in the table form.
Results and discussion
Our semi-automatic data conversion tool is a software of desktop application that works on Windows and Macintoshes. It helps ecologists to easily and efficiently create standardized data from raw collection data. To find the usability of our tool, we performed the data standardization with the six datasets from six observatory sites located in Korea National Park (for more information about the dataset, refer to Table 1). For this purpose, we need some predefined protocols about five kinds of indicator species, selected by the long-term ecological research of Kyungpook National University in Korea (refer to Table 2 for details). As results, overall, each raw data that varies widely in data types and terms was successfully standardized according to predefined protocols (refer to Table 3). For instance, search period was divided into search start date and search end date, and search date such as 01-MAY-2010 was converted to 2010-05-01, using separation and date conversion functions. The number of records that was converted according to SC protocols is equal to or smaller than that of the original raw data, because the SC protocol contain only search date and environment information, and several entities can be found in the same search date.
With the use of our tool, it is expected to possibly create standardized data of a common form in a relatively short time. Moreover, since the converted data can be stored and shared in the same format, it is possible to conduct comparative analysis with numerous ecological data more easily without regard to any organizations or project goals. Consequently, this tool can contribute to provide broad applicability to ecological and environmental data, such as towards uncovering the various effects of environmental factors on species.
Environmental Change Network
Ecological Metadata Language
Long-Term Ecological Research network
National Ecological Observatory Network
Bonet, F. J., Pérez-Pérez, R., Benito, B. M., De Albuquerque, F. S., & Zamora, R. (2014). Documenting, storing, and executing models in ecology: a conceptual framework and real implementation in a global change monitoring program. Environ Model Softw, 52, 192–199.
Brunt, J. W., McCartney, P., Baker, K., & Stafford, S. G. (2002). The future of ecoinformatics in long term ecological research (Proceedings of the 6th World Multiconference on Systemics, Cybernetics and Informatics: SCI, pp. 14–18).
Fegraus, E. H., Andelman, S., Jones, M. B., Schildhauer, M. (2005). Maximizing the value of ecological data with structured metadata: an introduction to Ecological Metadata Language (EML) and principles for metadata creation. Bulletin of the Ecological Society of America, 86, 158–168.
Keller, M., Schimel, D. S., Hargrove, W. W., Hoffman, F. M. (2008). A continental strategy for the National Ecological Observatory Network. The Ecological Society of America, 6, 282–284.
Michener, W. K., et al. (2012). Participatory design of DataONE—enabling cyberinfrastructure for the biological and environmental sciences. Ecological Informatics, 11, 5–15.
Morecroft, M. D., et al. (2009). The UK Environmental Change Network: emerging trends in the composition of plant and animal communities and the physical environment. Biol Conserv, 142, 2814–2832.
San Gil, I., et al. (2009). The Long-Term Ecological Research community metadata standardisation project: a progress report. International Journal of Metadata, Semantics and Ontologies, 4, 141–153.
We would like to appreciate anonymous reviewers for their valuable comments on the manuscript.
This subject is supported by the Korea Ministry of Environment (MOE) as “Public Technology Program based on Environmental Policy (2014000210003).”
Availability of data and materials
Data are not publicly available to this article because they used under license for the current study.
HL carried out the studies, performed the analysis, and wrote/reviewed the manuscript. HJ carried out the studies. MS participated in the design of the study and wrote/reviewed the manuscript. OK participated in the design of the study. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate