JEQ Grow Your Career With ASA
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (5)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Spruill, T. B.
Right arrow Articles by Howe, S. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Spruill, T. B.
Right arrow Articles by Howe, S. S.
GeoRef
Right arrow GeoRef Citation
Agricola
Right arrow Articles by Spruill, T. B.
Right arrow Articles by Howe, S. S.
Related Collections
Right arrow Statistics
Right arrow Animal Waste
Right arrow Ground Water Quality
Right arrow Water Pollution
Right arrow Nitrogen
Journal of Environmental Quality 31:1538-1549 (2002)
© 2002 American Society of Agronomy, Crop Science Society of America, and Soil Science Society of America

TECHNICAL REPORTS
Ground Water Quality

Application of Classification-Tree Methods to Identify Nitrate Sources in Ground Water

Timothy B. Spruill*,a, William J. Showersb and Stephen S. Howea

a United States Geological Survey, 3916 Sunset Ridge Rd., Raleigh, NC 27607
b Dep. of Marine Earth and Atmospheric Sciences, North Carolina State University, Raleigh, NC 27695-8208

* Corresponding author (tspruill{at}usgs.gov)

Received for publication August 17, 2001.

    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS STUDIES
 MULTIVARIATE STATISTICAL METHODS
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
A study was conducted to determine if nitrate sources in ground water (fertilizer on crops, fertilizer on golf courses, irrigation spray from hog (Sus scrofa) wastes, and leachate from poultry litter and septic systems) could be classified with 80% or greater success. Two statistical classification-tree models were devised from 48 water samples containing nitrate from five source categories. Model 1 was constructed by evaluating 32 variables and selecting four primary predictor variables ({delta}15N, nitrate to ammonia ratio, sodium to potassium ratio, and zinc) to identify nitrate sources. A {delta}15N value of nitrate plus potassium >18.2 indicated animal sources; a value <18.2 indicated inorganic or soil organic N. A nitrate to ammonia ratio >575 indicated inorganic fertilizer on agricultural crops; a ratio <575 indicated nitrate from golf courses. A sodium to potassium ratio >3.2 indicated septic-system wastes; a ratio <3.2 indicated spray or poultry wastes. A value for zinc >2.8 indicated spray wastes from hog lagoons; a value <2.8 indicated poultry wastes. Model 2 was devised by using all variables except {delta}15N. This model also included four variables (sodium plus potassium, nitrate to ammonia ratio, calcium to magnesium ratio, and sodium to potassium ratio) to distinguish categories. Both models were able to distinguish all five source categories with better than 80% overall success and with 71 to 100% success in individual categories using the learning samples. Seventeen water samples that were not used in model development were tested using Model 2 for three categories, and all were correctly classified. Classification-tree models show great potential in identifying sources of contamination and variables important in the source-identification process.

Abbreviations: CART, classification and regression tree • USGS, United States Geological Survey


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS STUDIES
 MULTIVARIATE STATISTICAL METHODS
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
NITRATE IN GROUND water has been known to be a potential human health problem for more than 50 yr, since Comly (1945) reported that concentrations of nitrate in drinking water could cause methemoglobinemia in infants. A nitrate drinking water standard of 45 mg/L for nitrate (10 mg/L of nitrate, as nitrogen) for United States public water supplies was established in 1962 (United States Department of Health, Education, and Welfare, 1962). This standard has remained in force since 1962 and is the current maximum contaminant level (MCL) for public drinking water supplies (USEPA, 2001).

Some areas of the United States are more likely than others to have high nitrate concentrations in ground water. Susceptibility to nitrate contamination typically is highest in areas with sandy soils (Nolan et al., 1997). Within the Albemarle–Pamlico Drainage Basin of North Carolina and Virginia, the highest nitrate concentrations occurred in areas having sandy soils with relatively low organic carbon content (Spruill et al., 1997; Spruill et al., 1998). Such areas primarily are located in the inner Coastal Plain where dissolved carbon concentrations are less than 3 mg/L. Nitrate concentrations exceeded the 10 mg/L maximum contaminant level in about 5% of the ground water samples from these areas.

To control nitrate contamination in ground water, the nitrate sources must be identified before appropriate and effective management actions can be taken. Ground water can have many nitrate sources, both natural and anthropogenic (Madison and Brunett, 1985; Hallberg and Keeney, 1993; Spalding and Exner, 1993). Rain, forests, grasslands, agricultural lands, organic wastes (e.g., farm manures, sewage sludges, food-processing wastes, and crop residues), row crops, vegetable crops, and livestock production are all potential nitrate sources in ground water.

Nitrogen sources have increased over the last several decades (Smil, 1997; Vitousek et al., 1997). Nationally, nitrogen applications to agricultural lands have increased 20-fold over the last 50 yr, and the most dramatic increases have occurred over the last 30 yr (Puckett et al., 1999). On an annual basis, fertilizer is the largest input of nitrogen to most agricultural systems (Hallberg and Keeney, 1993). In North Carolina, confined feeding operations, particularly with respect to hog production, have increased from 2.2 million hogs in 1990 to more than 10 million hogs in 1999, primarily in the Coastal Plain, making North Carolina the second largest producer of hogs in the United States (Mallin, 2000). In addition, human populations have increased as much as 40% since 1990 in some counties included in this study (United States Census Bureau, 2001). Because of increased nitrogen sources, the many potential regional or local nitrate sources to ground water, and increasing numbers of people in close proximity to these sources, identifying the predominant nitrate sources in ground water may not be easy. Reliable methods are needed that can be used by natural resources scientists and managers to identify sources of nitrate-contaminated ground water.


    PREVIOUS STUDIES
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS STUDIES
 MULTIVARIATE STATISTICAL METHODS
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Several studies have been conducted over the last 30 yr to identify nitrate sources in ground water (Kreitler, 1975; Kreitler and Jones, 1975; Gormly and Spalding, 1979; Fogg et al., 1998) and surface water (Showers et al., 1990). Gormly and Spalding (1979) used isotopes of nitrogen and found that the primary nitrate sources in ground water in Nebraska and corresponding {delta}15N range of values were +5 to +9{per thousand} (per mil) for soil nitrogen, -2 to +7{per thousand} for commercial fertilizer, and +10 to +23{per thousand} for livestock. Komor and Anderson (1993) used {delta}15N to distinguish nitrate sources in ground water beneath five land-use settings in Minnesota and found that water from wells in livestock feedlots had an average {delta}15N concentration of 21.3{per thousand}; in cultivated irrigated fields, 7.4{per thousand}; in residential areas with septic systems, 6{per thousand}; in nonirrigated cropland, 3.4{per thousand}; and in natural undeveloped areas, 3.1{per thousand}. Several isotope chemists reported that {delta}15N concentrations of 10{per thousand} or greater (Kreitler, 1975; Gormly and Spalding, 1979; Aravena et al., 1993; Fogg et al., 1998; Kendall and McDonnell, 1998) indicate that nitrogen from animals is present. In general, {delta}15N has been demonstrated to be an effective discriminator between plant or commercial fertilizer–derived nitrate and animal-derived nitrate, but divisions between multiple animal sources and humans are less well defined (Fogg et al., 1998; Kendall and McDonnell, 1998). However, Fogg et al. (1998) indicated that separations between septic and dairy or feedlot sources were possible and, based on their data, septic wastes had a {delta}15N signature range from 7.3 to 10.3{per thousand}, whereas the {delta}15N signature range of the animal sites was from 10 to 14{per thousand}.

Thus, although {delta}15N of nitrate can be used to distinguish between animal and organic N or inorganic fertilizer-derived nitrate, it has not been successfully used alone to distinguish between subcategories of animal-derived nitrate in ground water. Even coupling {delta}15N with other isotopes, such as {delta}18O, has not been particularly successful for determining differences between animal sources. Nitrate {delta}15N data in combination with other water quality variables, such as ions or ionic ratios, however, may be effective in distinguishing animal sources. For example, halogen ratios have been used to identify specific oil-field brines or salt contamination of freshwater aquifers (Whittemore and Pollock, 1979) or to discriminate among precipitation, natural ground water, domestic wastes, and saltwater contamination from evaporites (Davis et al., 1998). By including more variables in the source-identification process, the probability should be greater for successful discrimination among animal sources. Karr et al. (2001) recently coupled the information from both major ion and stable isotope chemistry of ground and surface water to identify sources of nitrate contamination.


    MULTIVARIATE STATISTICAL METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS STUDIES
 MULTIVARIATE STATISTICAL METHODS
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Multivariate techniques, both computational and graphical, have been applied to determine the natural phenomena that control ground water quality. Waters associated with specific sources, such as aquifers or petroleum reservoirs, often can be distinguished by using trilinear and pattern diagrams, such as those devised by Piper (1944) and Stiff (1951). Hem (1985) presents several examples of the use of Piper diagrams for distinguishing water composition derived from specific aquifers. These techniques work, in general, because the specific minerals used for source identification either are dissolved by water moving through the rock matrix that composes the natural reservoir or contain connate waters that provide a unique signature of the source. However, for the same reason that makes these diagrams (which use only seven or eight ions) effective at discerning ions derived from a few natural sources, discerning anthropogenic sources with such a limited number of ions becomes considerably more difficult, because of the similarity of concentrations of the same few ions produced by many different natural and anthropogenic sources. The use of more sophisticated multivariate techniques, which can incorporate information from many more chemical ions, chemical isotopes, and associated properties to detect unique combinations of variables that identify each source, becomes imperative.

Multivariate statistical methods, capable of distinguishing complex relations among many variables, can be useful for source-identification problems. Alley (1993) presented an excellent overview of multivariate statistical techniques that have been applied to examine phenomena associated with water quality and to understand behavior and spatial patterns of water quality constituents. These techniques include cluster analysis, principal components analysis (PCA), and factor analysis. Steinhorst and Williams (1985) applied multivariate analysis, including analysis of variance, canonical analysis, and discriminant analysis to segregate ground water sources and to differentiate water quality associated with particular aquifers in basalt flows and interbeds in south-central Washington. Multivariate procedures, however, have not been used extensively to determine contamination sources from human activities.

A primary assumption behind this study is that the variability in one or more chemical constituents caused by anthropogenic sources is greater than that caused by other possible natural sources, such as minerals in rocks and soils of the region; therefore, certain constituents can be related to waste-specific sources. The waste-specific sources that often contribute to nitrate contamination are septic-system wastes; fertilizers applied to lawns, row crops, and golf courses; hog wastes leaking from lagoons or sprayed on crops as fertilizer; and chicken wastes applied to crops as fertilizer (Madison and Brunett, 1985; Hallberg and Keeney, 1993).

When the objective of an analysis is to determine into which predefined category a particular observation belongs, discriminant analysis (Davis, 1985) and classification or regression trees (Wilkinson, 2000) are appropriate techniques. Discriminant analysis is a multivariate technique, related to multiple regression, whereby linear equations are found that best discriminate the observations into two or more groups (Wilkinson, 2000). Although either discriminant analysis or classification-tree models are appropriate for the problem of classifying observations into predefined groups, classification-tree techniques have several advantages over discriminant analysis. The primary advantage of classification trees is that they are graphical and the output is more easily interpreted than strictly numerical methods, such as discriminant analysis (Breiman et al., 1984; StatSoft, 2001). As an example, classification-tree model output is hierarchical (StatSoft, 2001) and produces a visual representation of a dichotomous key, familiar to biologists, that visually and sequentially guides the user through a series of simple if–then statements from the beginning of the tree through a series of subgroups to the final group classification. Other advantages of classification trees over discriminant analysis procedures are that they are nonparametric (Breiman et al., 1984) and can incorporate categorical data, thus making classification-tree methods more versatile with respect to variables that can be included in model development.

After reviewing statistical procedures in available software, classification trees were selected as a versatile tool that can be applied and understood effectively by those who may not have extensive statistical training. Even though many statisticians are not familiar with classification-tree techniques (Wilkinson, 2000), tree models and their development began in the 1960s in the field of social sciences and have, for about the last 20 yr, been extensively used in medicine, marketing, and information management. Regression-tree models (similar to classification-tree models) have only recently been applied to water quality problems. Qian and Anderson (1999) used regression trees to identify factors that affect pesticide concentrations in the Willamette River basin in Oregon. Robertson et al. (2001) used regression trees to identify important environmental variables that affect nutrient concentrations in watersheds in the upper Midwest.

The purpose of this study was to apply tree-based classification methods to (i) determine which water quality variables, both with and without {delta}15N, could be used to identify the source of nitrate contamination with 80% or better success using selected chemical characteristics of the water sample from five known source categories, and (ii) determine if the chemical characteristics of water samples collected from wells in the North Carolina Coastal Plain and contaminated with nitrate can be used to identify the nitrate source. Ultimately, the intent of this study is to develop and demonstrate the potential of a simple predictive classification procedure that could be used and further developed by environmental scientists and regulators to identify principal nitrate sources present in ground water in a specific geographic area and perhaps apply these procedures to similar environmental problems. Throughout the remainder of this paper, the {delta}15N of nitrate will simply be referred to as {delta}15N.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS STUDIES
 MULTIVARIATE STATISTICAL METHODS
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Five common nitrate sources were selected for the analysis—hog wastes sprayed on cultivated fields (Spray), poultry wastes applied as litter (Poultry), septic-system wastes (Septic), inorganic fertilizer applied on golf courses (Golf), and inorganic fertilizer applied on row crops (Crop). Permission was obtained to sample ground water from 4 to 15 locations per category in the Coastal Plain of North Carolina (Fig. 1) . Ground water samples were collected directly beneath each source area or, in the case of septic wastes, in the septic field or beneath fields sprayed with septic wastes. Forty-eight ground water samples from 48 wells were included for development of the model.



View larger version (35K):
[in this window]
[in a new window]
 
Fig. 1. Locations of sites sampled in North Carolina and nitrate contamination sources.

 
Wells included in the study were screened to intercept at least the upper 1.5 m of the saturated zone near the water table and were intended to intercept recent (<2 yr old) vertical recharge. The water table of the shallow aquifer usually is located within 3 m of the land surface in the North Carolina Coastal Plain and depth to water ranges between 1 and 3 m below land surface. United States Geological Survey (USGS) wells in the study area intercepted the upper 0.3 to 0.6 m of the saturated zone. In general, areas having sandy soils were selected for sampling to maximize the probability of contamination from nitrate and to ensure that adequate oxygen to maintain nitrate was present. Although only water samples having NO3–N concentrations greater than 3 mg/L were to be collected (concentration was estimated by using test strips for nitrate), a few samples received from the lab had lower concentrations. Four samples had concentrations too low (<0.5 mg/L) to analyze {delta}15N and were not used. Twenty-six wells were installed and/or used by the North Carolina Department of Environment and Natural Resources (NCDENR) as monitoring wells for a study of pesticides and nitrate in North Carolina ground water (Wade et al., 1997), onsite waste disposal, or other studies. Wells installed by the NCDENR typically were constructed of polyvinylchloride (PVC) with 1.5- to 3-m screens located in the saturated zone of the aquifer beneath the contaminant sources. The USGS installed temporary wells using a minipiezometer assembly (Winter et al., 1988) at 16 of the sites. The minipiezometer was hammered to the desired depth, the 2.5-cm screen extended, and the water sample collected through polytetrafluoroethylene (PTFE) or nylon tubing using a peristaltic pump. North Carolina State University installed six shallow PVC wells that were used in this study.

Each water sample was analyzed for 32 water quality variables that were included in model development (Table 1). Selected water quality data collected from the 48 wells are presented in Table 2. Water samples from 17 additional wells, most with 0.5- to 1.5-m screens, were used to test the resulting models and were collected as part of other USGS and NCDENR–North Carolina Department of Agriculture (NCDA) studies conducted in the study area (Table 3). All water samples collected between August 1996 and February 2000 were filtered through a 0.45-µm capsule filter by using either a peristaltic or submersible pump fitted with either PTFE or nylon tubing. The USGS National Water-Quality Laboratory in Denver, Colorado analyzed major inorganic ions and nutrient species according to methods in Fishman (1993). Either the Stable Isotope Laboratory at North Carolina State University or the USGS Stable Isotope Research Laboratory in Menlo Park, California analyzed samples for {delta}15N of nitrate. Determinations of {delta}15N were done according to methods presented in Chang et al. (1999) and Silva et al. (2000). Either the USGS National Water-Quality Laboratory or the NCDENR Division of Water Quality Laboratory analyzed the additional 17 well-water samples that were collected as part of the USGS Albemarle–Pamlico Water-Quality Assessment (NAWQA) Program (Spruill et al., 1998) or for the North Carolina Interagency Pesticide Study (Wade et al., 1997).


View this table:
[in this window]
[in a new window]
 
Table 1. Water quality variables with reporting units included in model development.

 

View this table:
[in this window]
[in a new window]
 
Table 2. Selected data used to develop classification-tree models.{dagger}

 

View this table:
[in this window]
[in a new window]
 
Table 3. Data from test sample used to validate Model 2.{dagger}

 
Two classification-tree models were devised by using the classification and regression tree (CART) procedure (Breiman et al., 1984) on the original 48-sample data set. Model 1 included nitrate {delta}15N because it is known to be highly valuable in discriminating animal and fertilizer nitrate. However, {delta}15N may not be available because of its cost or because it is not a standard analyte in most ground water monitoring networks. Therefore, all variables, except {delta}15N, were used in devising Model 2.

The basic idea behind classification-tree models is to create a hierarchical tree of key variables and values based on a sample of objects of known classes (termed the learning sample); the resulting tree is then used to predict classes from another independently obtained sample having the same variables but unknown classes (termed the test sample). Classification-tree procedures employed by many statistical programs begin by separating the initial group composed of all observations (termed the root node, which is also a parent node or split node) into two homogeneous groups (termed child nodes) (Fig. 2) . The program does this by examining all possible variables and then selecting the best variable (termed the split variable) to split the group into two homogeneous groups (nodes that have the fewest misclassifications or lowest "impurity" and greatest reduction in error from the previous node). The two resulting child groups are now the new parent nodes. The program again splits each of the two new parent nodes into two more child nodes each. This process continues until all of the objects or observations are classified. The groups formed at the end of the tree, which cannot be split any more, form the terminal nodes of the tree (Fig. 2).



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 2. Diagram of hypothetical classification tree showing node types, split variables, and associated split values.

 
A variety of tree models including THAID (Morgan and Messenger, 1973), CART (Breiman et al., 1984), FACT (Loh and Vanichsetakul, 1988), and QUEST (Loh and Shih, 1997) are available through several statistical software programs and different tree models may generate different trees according to the classification algorithms employed by the particular model (StatSoft, 2001). Specific splitting algorithms for many of these programs are discussed in Loh and Shih (1997). The CART procedure (Breiman et al., 1984) and a variation, RPART (Therneau and Atkinson, 1997), both used in this analysis, evaluate all variables to determine which variable can make the best split (i.e., the variable that splits the parent group into the two purest child groups) using the GINI index of impurity (i/t) (Breiman et al., 1984). The GINI index is a measure of the total error (also known as deviance, Di, for classification trees), in any node and is computed by:

where j is the number of classes in any node t and p is the proportion of the class at the node (Loh and Shih, 1997). Thus, if the first, or root node, contains four classes in equal proportion, then the GINI index is 1 - [(1/4)2 + (1/4)2 + (1/4)2 + (1/4)2] or 1 - 1/4 or 0.75. A node with only one class (all observations are perfectly classified) would have a GINI impurity index value of 1 - (1)2 or 0. The error after the split is the sum of the error of the two resulting child nodes, where Di (child) = Di (left child) + Di (right child). The variable selected would be the one that most reduces the error between the parent and the sum of the error of the two new child nodes:

A succinct description of the GINI index is presented in StatSoft (2001) and Qian and Anderson (1999). It should be noted that the models developed in this paper are not necessarily unique, and it is possible that the model algorithm could select more than one competing variable or split value, particularly with small sample sizes. However, both CART (Breiman et al., 1984) and RPART (Therneau and Atkinson, 1997) were used in the model development process and resulted in very similar models.

An important consideration in devising tree models pertains to the construction of the "right-sized" classification tree (StatSoft, 2001). In essence, how large should the tree be to give the needed predictive accuracy without creating too complex a tree? For example, it may be possible to construct (or "grow") a tree that perfectly classifies all objects, but the resulting tree could be very long and complex, possibly ending with each observation in its own terminal node. A tree that is too short (having too few split nodes) will often have a higher predictive error (or cost) than a more complex tree with more splits and nodes. The issue of when to stop building the tree is a major topic in the classification-tree literature, and good discussions of the principal methods available (including test sample cross-validation, V-fold cross-validation, and global cross-validation) are presented in Breiman et al. (1984) and Statsoft (2001). However, because the intent of this study was largely exploratory in nature and the sample size of 48 observations with five separate groups was very small, a rigorous development of a final fully cross-validated tree model was not the focus of this paper.

In addition to the standard analysis of tree models, the classification success of the terminal nodes of both models (evaluated simply as the percentage of correct classifications of each group) was used to estimate the predictive classification potential of each model, similar to classification matrices produced by discriminant analysis procedures in several commercially available statistical programs. The 48 water analyses shown in Table 2 compose the learning sample by which both classification-tree models were constructed. These are the original observations (i.e., water samples with variables selected by the program for construction of Model 1 or Model 2) that form the basis for each model. If the performance were good (80% classification success or better on the learning sample was considered to be acceptable), there would be a basis for adopting the model for practical use or further development to test the model's predictive power and reliability.

Testing on an independent sample and comparing classification success for each category between the learning sample and test sample can be used to demonstrate the practical predictive performance of the model (model validation). However, Model 1 could not be validated by testing with an independent sample, because the primary split variable selected by Model 1 included {delta}15N, which was not available for analyses of water samples from most wells where the nitrate source was known. All variables identified by Model 2 were available, and the predictive success of Model 2 was validated by using water analyses from an independently obtained test sample of 17 wells not used for Model 2 construction (Table 3) to evaluate model validity. A Kruskal–Wallis test (Conover, 1980) was used when evaluating differences between distributions of model variables among the five sources.


    RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS STUDIES
 MULTIVARIATE STATISTICAL METHODS
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
A classification-tree model (Model 1, Fig. 3) was devised by using all 32 variables (including {delta}15N). The classification tree consists of four splits and five terminal nodes. Only 46 of the original 48 samples were used because of missing zinc data for two of the water samples. The most important variables in this classification tree were potassium plus {delta}15N of nitrate (KNO315), nitrate to ammonia ratio (NO3NH4), sodium to potassium ratio (NAK), and zinc (ZN). The resulting classification matrix for evaluating Model 1 performance on the learning sample is shown in Table 4. Source classification of contamination by inorganic fertilizer in both the Crop and Golf categories resulted in 100% correct placement. The Septic category nitrate sources were classified with 75% success. Water samples from the Poultry category were placed with 71% success. Overall correct classification performance of Model 1 was approximately 88% for all five categories. Because all observations with {delta}15N of nitrate were used to develop the model, no independently collected observations (water samples) were available to test model performance.



View larger version (32K):
[in this window]
[in a new window]
 
Fig. 3. Classification tree for Model 1 using the predictor variables potassium plus {delta}15N of nitrate (KNO315), nitrate to ammonia ratio (NO3NH4), sodium to potassium ratio (NAK), and dissolved zinc (ZN), in micrograms per liter.

 

View this table:
[in this window]
[in a new window]
 
Table 4. Performance of Model 1 on learning sample.

 
Model 2 was formulated without {delta}15N data. All 48 samples were used in model development. The model that resulted included the sum of sodium plus potassium (NAKSUM), nitrate to ammonia ratio, calcium to magnesium ratio (CMR), and sodium to potassium ratio (Fig. 4) . Classification success ranged from 100% for ground water from beneath fertilized golf courses to 71% for water collected from beneath fields fertilized with poultry litter (Table 5a). Overall classification success for the model on the learning sample was about 85%, similar to Model 1. Seventeen samples collected from other areas in the Coastal Plain for three of the five categories were used for validating Model 2. Classification success for Crop, Spray, and Septic categories was 100% (Table 5b).



View larger version (33K):
[in this window]
[in a new window]
 
Fig. 4. Classification tree for Model 2 using the predictor variables sum of sodium plus potassium (NAKSUM), nitrate to ammonia ratio (NO3NH4), calcium to magnesium ratio (CMR), and sodium to potassium ratio (NAK).

 

View this table:
[in this window]
[in a new window]
 
Table 5a. Performance of Model 2 on learning sample.

 

View this table:
[in this window]
[in a new window]
 
Table 5b. Performance of Model 2 on test sample.

 
Application of classification-trees to ground water quality data from eastern North Carolina appears to be very useful in identifying nitrate sources. Model 1 identified four important variables in discriminating between the five groups—potassium plus {delta}15N of nitrate, nitrate to ammonia ratio, sodium to potassium ratio, and zinc. Consistent with previous work, much of it summarized in Kendall and McDonnell (1998), {delta}15N of nitrate is very useful in distinguishing animal sources of N from the other two major environmental sources of N, soil organic N, and fertilizer N. For discussion purposes, another model (not shown) was constructed by using only {delta}15N, with a resulting model-derived split value (SV) of about 8.5{per thousand} and correctly classified most soil organic and/or inorganic fertilizer sources and animal-based N sources. Based on the learning sample, the model using {delta}15N alone was able to correctly classify 17 of 18 fertilizer- or organic N–derived nitrate samples and 29 of 30 animal-source samples. The addition of potassium, in milligrams per liter, to the {delta}15N per mil concentrations, however, better separated (i.e., caused less overlap of the distributions) the animal from the inorganic- and/or plant-derived nitrate nitrogen than {delta}15N alone, as shown in Fig. 5 , and was selected by CART for this data set as the best first split. The primary improvement appears to result from the improved ability to separate poultry from the inorganic- and/or soil organic N–derived nitrogen sources and the Golf category from the animal-derived N sources.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 5. Distributions of (A) N15 ({delta}15N in per mil) and (B) potassium (mg/L) plus {delta}15N of nitrate ({per thousand}) (KNO315) in five source categories demonstrating the effect of adding potassium to increase separation of the animal and fertilizer groups and particularly the Poultry and Septic categories.

 
In Model 1, the best discriminator of Golf from Crop samples for the model run shown was the nitrate to ammonia ratio (split value = 575). In general, the Golf water samples had much lower nitrate nitrogen concentrations (median = 2.9 mg/L) than the Crop samples (median = 14.5 mg/L). However, some model runs used nitrate concentrations (model not shown) or other nitrate-related ratios (nitrate to potassium ratio for Model 2) to separate these two groups. The sample size for the Golf category (N = 4), however, was so small that it might not be possible to distinguish Crop from Golf categories, unless ground water nitrate concentrations are lower at golf courses compared with those at cultivated fields. Thus, although ground water beneath golf courses appears to have lower nitrate concentrations compared with ground water beneath row crops, many more randomly selected water samples stratified by source would need to be collected to reach such a conclusion.

The best discriminator of septic waste from other animal-derived N sources was the sodium to potassium ratio. Based on information shown in Fig. 6 , sodium concentrations in ground water contaminated by septic wastes were higher than those in ground water contaminated by other animal-derived wastes, and the sodium to potassium ratios of septic wastes were significantly higher (median of approximately 14, p < 0.05) than other categories investigated (median of all categories < 3). Wilhelm et al. (1994) used sodium concentrations to identify septic-system contamination at a site in Canada. The concentrations were approximately 10 times the background sodium concentration of the ground water (Wilhelm et al., 1994) and the ratio of sodium to potassium in these septic wastes was about 8. Data from Zublena et al. (1993b) indicate that the sodium to potassium ratios for swine lagoon wastes and stockpiled broiler or layer litter (Zublena et al., 1993a) and common fertilizers (Zublena et al., 1991) are all less than 0.5, much lower than the sodium to potassium ratio (approximately 7.5 to 8) indicated by data from Wilhelm et al. (1994) for septic wastes. The sodium to potassium ratio data shown for septic wastes in the North Carolina Coastal Plain in Fig. 6 had a median of about 14 with 75% of the samples exceeding 8, which is comparable with the ratio shown in Wilhelm et al. (1994). The data from our study suggest that sodium relative to potassium is much higher in septic wastes compared with either of the other animal-derived wastes and may be due to the preponderance of sodium in the typical human diet and the use of salt in water softeners in rural areas. In any case, the sodium to potassium ratio appears to be a good identifier of septic-system wastes within the study area.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 6. Distributions of (A) NA (sodium, in milligrams per liter) and (B) NAK (sodium to potassium ratio, unitless) in five source categories showing increase of separation between septic and the other two animal source categories when NAK is used.

 
After segregating the septic from the poultry and hog-spray wastes (sodium to potassium ratio <3.2, Fig. 3), zinc was useful for further separating the hog and poultry wastes. From the model, a zinc value greater than 2.2 µg per liter (µg/L) indicated hog wastes, whereas values less than 2.2 µg/L indicated poultry wastes. Zinc is added to hog feed as a growth enhancer (National Research Council, 1998) and may be the reason for the higher concentrations observed in ground water samples collected beneath crops fertilized with hog spray.

From the performance data shown in Table 4 for the learning sample, Model 1 appears to be an excellent discriminator of nitrate from inorganic fertilizer on crops, golf courses, and sprayed hog wastes (100, 100, and 92% respectively). Model 1 did not do as well in discriminating between poultry and septic sources, as indicated by the lower classification-success rates (71 and 75% respectively, Table 4). As has been shown by previous researchers, this may be because the {delta}15N values of the septic sources have been shown to have a wide range (7.3 to 10{per thousand}) that grades into values in both the Crop and Golf categories (Fig. 5), making discrimination difficult. The overlap was not improved by adding potassium (Fig. 5), where the lower tail of the Septic distribution overlaps with the Crop and Golf categories.

Thus, although {delta}15N by itself is not particularly successful in separating specific animal sources (Kendall and McDonnell, 1998) and shows no difference between animal categories in the area studied in the Coastal Plain of North Carolina (Fig. 5), using it in combination with other isotopes (such as {delta}18O, as suggested in Kendall and McDonnell, 1998) or ions, as demonstrated by results shown in this paper, can potentially segregate by animal-source category. An advantage of using major ions, as opposed to various isotopes, is related to the generally lower cost of the analysis for major ions. Although major ions alone can be used effectively in eastern North Carolina and probably most areas where the specific conductance of the shallow ground water is 350 µS/cm or less, specific models probably will need to be devised for areas where specific conductance is typically greater than this. Such areas include coastal areas and parts of the western and midwestern United States where evaporite deposits or saltwater intrusion occurs. In these areas, {delta}15N is probably the best indicator of nitrate sources. In such areas, further separation of nitrate sources by using major ions may be difficult or require trace elements or other isotopes.

Nevertheless, in North Carolina and perhaps other areas of the East Coast where shallow ground water has relatively low dissolved solids, major ions can be used effectively to identify sources, as indicated by results shown for Model 2 (Fig. 4, Table 5a). In this model, sodium plus potassium, in mg/L, was found to be an excellent indicator of inorganic and/or soil organic N and animal-derived nitrate sources, with only one crop fertilizer–derived water sample misclassified as septic-derived N and one septic-derived sample classified as nitrate from an inorganic fertilizer source (Table 5a). The overall classification success rate for Model 2 on the learning sample was 85%. The primary distinguishing characteristic of water samples from golf courses was the low nitrate concentration, although statistical limitations of its use for this purpose have been mentioned already. The nitrate to ammonia ratio was used by Model 2 (as in Model 1) to best distinguish the two categories, although the split value (454) was lower in this model. The calcium to magnesium ratio (split value = 2.9) was best used to distinguish poultry from hog spray, and sodium to potassium ratio was best used to distinguish septic from hog spray. The performance of the calcium to magnesium ratio in identifying poultry sources was identical to the performance of zinc in Model 1 (71% success, Table 4). Calcium and magnesium may be easily leached in the North Carolina Coastal Plain, where the cation exchange capacity (CEC) is typically low (<2 cmolc/kg). The mobility of cations may be greatly enhanced in much of the Coastal Plain, which may allow for their use in source identification in this and other areas having low CEC.

Although additional samples would be desirable in formulating a more precise model, both Model 1 and Model 2 appear to be effective in identifying nitrate from specific waste sources, at least for inorganic fertilizer-derived nitrate (Crop, Golf) and animal-derived nitrate (Spray and Septic) categories. Model 2 was tested using 17 water samples that were not used in model formulation, yielding a 100% classification success rate for the three categories (Crop, Septic, and Spray) for which data were available. The reliability of the model is further substantiated in that one well (GR-851995; Table 3) in the test data set sampled in 1995 was identified as an inorganic fertilizer source and in 1999 was identified as a hog-waste spray source (GR-851999; Table 3). Hog spray was indeed used after 1995 for fertilizing crops grown in this field and the model correctly identified nitrate sources for each time period. The water sample from L2 in 1995 (L21995; Table 3) indicated inorganic fertilizer and/or soil organic nitrogen as a source and again in 2000 (L22000; Table 3). This area is not affected by spray and is upgradient from fields that received spray. In addition, two drainage ditches (MS4D1 and MS4D2; Table 3) drain fields fertilized with inorganic fertilizer and hog spray, respectively, and were identified correctly by the model.

A significant finding of this study was that, with the exception of nitrate, no anion was identified as an important classification variable. These results suggest that although anions generally are more mobile in water, they do not differ significantly in concentration among source categories in shallow ground water of the North Carolina Coastal Plain. Even nitrate was found to be important only in distinguishing the fertilizer from crop and golf courses; of the four golf course samples used, all had lower nitrate, which may or may not be generally representative of golf courses. No significant differences were found among categories for sulfate (p > 0.05), and chloride in the Septic category was significantly higher (p < 0.05) than the Crop, Golf, and Poultry categories, but not the Spray category (p > 0.10), which explains why sodium was selected by the model.


    CONCLUSIONS
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS STUDIES
 MULTIVARIATE STATISTICAL METHODS
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
There are many possible applications of the classification-tree models presented in this paper. Some of these applications include determining nitrate sources in wells that appear unusual (i.e., determining the source of high nitrate concentrations in the vicinity of other wells that have much lower concentrations); determining the principal source of high nitrate where multiple sources may be contributing (septic tank vs. nearby chicken or crop-farming operations); and evaluating effectiveness of management actions (i.e., eliminating a source of contamination, such as a leaking sewer or spray application).

The classification-tree models developed in this study demonstrate that they are useful in identifying variables that are important in the source-identification process and that {delta}15N, dissolved calcium, magnesium, sodium, potassium, nitrate, ammonia, and zinc are potentially useful in identifying dominant nitrate sources in ground water in sandy recharge areas of the Coastal Plain. Anions in general were not identified in the modeling process as important in discriminating nitrate sources in the study area, although further work and larger sample sizes will be needed to verify this. Specifically, although the classification-tree models may be applied as presented here, they are not unique or the only models possible, and additional ground water samples collected throughout the North Carolina Coastal Plain will be needed to better identify particular nitrate sources and improve the models, particularly for septic and poultry sources. Although this process may lead to more complicated tree models, it could also result in more precise classifications.

Although the simple models presented in this paper may be suitable for shallow aquifers in the North Carolina Coastal Plain and much of the middle Atlantic Coastal Plain, specific applications that may include other sources or contaminants (i.e., gas stations, landfills, etc.) in other areas would require the gathering of data from additional ground water sites with samples to be collected from known sources, such as was done in this study. Classification-tree models are widely available in many statistical computer packages, are relatively easily implemented and interpreted, and appear to classify sources at a level of reliability that can be practically useful.

The nitrate-source identification techniques used here appear to be generally useful in the Coastal Plain of North Carolina and possibly other areas having shallow ground water and low specific conductance, although further research is necessary to address questions about resulting mixtures, influence of oxidation–reduction conditions in the aquifer, degradation or sorption of particular chemical indicators along flow paths, and interference with high background concentrations of ions that are used as indicators. As has been noted already, {delta}15N appears to be a reliable indicator under conditions where other chemical indicators would not be as effective. Thus, inclusion of {delta}15N in analyses is almost always advantageous for identification of sources and in establishing model plausibility. Data presented in this paper also demonstrate that routine inclusion of major ions as part of water quality studies that are not specifically directed at understanding the geochemistry can yield information that is highly useful, if not necessary, for meaningful data interpretation.


    ACKNOWLEDGMENTS
 
This project is a cooperative effort between the United States Geological Survey (USGS), the North Carolina Department of Environment and Natural Resources (NCDENR), and the United States Environmental Protection Agency (USEPA). Thanks to the many landowners, farmers, golf course managers, and others in eastern North Carolina who allowed access to their property. Special thanks to the USGS National Water-Quality Assessment Program; Song Qian, The Cadmus Group, Durham, NC; Diana Rashash, North Carolina State University Cooperative Extension, Onslow County; Wendell Gilliam, North Carolina State University; and Ray Milosh, Carl Bailey, Elizabeth Morey, Ted Mew, and Paul Dahlen, NCDENR. Finally, thanks to all of the reviewers of this paper who made many helpful comments and suggestions.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 PREVIOUS STUDIES
 MULTIVARIATE STATISTICAL METHODS
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 




This article has been cited by other articles:


Home page
J. Environ. Qual.Home page
A. Bedard-Haughn, K. W. Tate, and C. van Kessel
Using Nitrogen-15 to Quantify Vegetative Buffer Effectiveness for Sequestering Nitrogen in Runoff
J. Environ. Qual., November 1, 2004; 33(6): 2252 - 2262.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (5)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Spruill, T. B.