09-08-2013 07:46 PM
I'm new to SAS, and I have a database with over 6000 measurements of arsenic in drinking water wells. Some wells were measured multiple times, and each measurement is a separate row. I want to change this so that there is a single row for each well, with new columns for the additional measurements, like "date2, arsenic2, depth2, address2..." Can anyone help me with the code?
Eventually I want to randomly select 90% of the wells to make a training dataset to build a model predicting arsenic at unmeasured locations. The other 10% of wells will then be used as a validation dataset to see how well the model works. I was thinking of doing this using excel to assign a random number to each well once I get one row per well.
If you've read this far, thanks very much!
09-09-2013 09:48 AM
I would think at least twice about changing the data structure from long to wide, as when you get to modeling the data, you would just have to change it back to long format if you wish to accommodate the repeated nature. If you could present the logic for wanting to change to a wide format, I am sure we can accommodate it--I just want to give you a heads-up that the modeling will almost certainly require the data to be in the long format in which it currently resides.
As far as random sampling, SAS is superior to Excel in almost every way I can think of. It is easy to assign a single random number to each well that is propagated to all records for that well.
09-09-2013 04:44 PM
Thanks for the quick reply Steve!
I suppose however I can randomly sample wells, rather than measurements, is fine. The wells don't have ID numbers so far, just latitude and longitude. I was planning to do a geostatistical interpolation (kriging) model in ArcGIS, for starters at least, although I'm no ArcGIS expert either.
09-09-2013 04:53 PM
If you have SAS Stat, you may want to look at Procs Krige2d and/or Variogram, which I'm pretty sure would prefer one measure per well.
You may want to consider picking either the most recent measure for each well, or at a similar time, if practical instead of a random sample, especially if the time between measures is "long".
09-09-2013 05:10 PM
I saw there were some spatial analyses in SAS. Temporal variability seems small compared to spatial variability, so I could ignore it at first, but we may want to incorporate it later. In any case I'm still looking for the code to randomly select wells rather than measurements, or convert my multiple rows per well (if it was measured more than once) to a single row for each well, with multiple measurements in new columns. As I told Steve, there's no "Well ID number" column yet, just latitude and longitude.
09-09-2013 10:51 PM
Once you have a Well_id variable, you can randomly select wells using:
proc surveyselect data=HAVE samprate=0.9 outall out=WANT seed=76876;
The WANT dataset will contain an extra variable Selected = 1 for selected wells and 0 for non-selected wells. Remove the seed= option to get a different selection at every execution. Remove outall option to get only selected wells in the WANT dataset.
09-10-2013 09:40 AM
You could model both types of variability (temporal and spatial) through PROC GLIMMIX, with temporal correlation handled as an R-side structure and spatial as a G-side structure.
To add on to PG's comment, from the long dataset, create a concatenated variable from the longitude and latitude, sort on this variable, and use either first. in a datastep or select unique in PROC SQL to get a dataset with only single wells. Process through SURVEYSELECT, and then merge back against the original data. This should put Selected=1 on all records for the random sample. From there the analysis can operate on a WHERE= basis.