BookmarkSubscribeRSS Feed
dauphine
Calcite | Level 5

Hi!

I'm new to SAS, and I have a database with over 6000 measurements of arsenic in drinking water wells. Some wells were measured multiple times, and each measurement is a separate row. I want to change this so that there is a single row for each well, with new columns for the additional measurements, like "date2, arsenic2, depth2, address2..." Can anyone help me with the code?

Eventually I want to randomly select 90% of the wells to make a training dataset to build a model predicting arsenic at unmeasured locations. The other 10% of wells will then be used as a validation dataset to see how well the model works. I was thinking of doing this using excel to assign a random number to each well once I get one row per well.

If you've read this far, thanks very much!

6 REPLIES 6
SteveDenham
Jade | Level 19

I would think at least twice about changing the data structure from long to wide, as when you get to modeling the data, you would just have to change it back to long format if you wish to accommodate the repeated nature.  If you could present the logic for wanting to change to a wide format, I am sure we can accommodate it--I just want to give you a heads-up that the modeling will almost certainly require the data to be in the long format in which it currently resides.

As far as random sampling, SAS is superior to Excel in almost every way I can think of.  It is easy to assign a single random number to each well that is propagated to all records for that well.

Good luck,

Steve Denham

dauphine
Calcite | Level 5

Thanks for the quick reply Steve!

I suppose however I can randomly sample wells, rather than measurements, is fine. The wells don't have ID numbers so far, just latitude and longitude. I was planning to do a geostatistical interpolation (kriging) model in ArcGIS, for starters at least, although I'm no ArcGIS expert either.

ballardw
Super User

If you have SAS Stat, you may want to look at Procs Krige2d and/or Variogram, which I'm pretty sure would prefer one measure per well.

You may want to consider picking either the most recent measure for each well, or at a similar time, if practical instead of a random sample, especially if the time between measures is "long".

dauphine
Calcite | Level 5

Thanks!

I saw there were some spatial analyses in SAS. Temporal variability seems small compared to spatial variability, so I could ignore it at first, but we may want to incorporate it later. In any case I'm still looking for the code to randomly select wells rather than measurements, or convert my multiple rows per well (if it was measured more than once) to a single row for each well, with multiple measurements in new columns. As I told Steve, there's no "Well ID number" column yet, just latitude and longitude.

PGStats
Opal | Level 21

Once you have a Well_id variable, you can randomly select wells using:

 

proc surveyselect data=HAVE samprate=0.9 outall out=WANT seed=76876;

samplingunit Well_id;

run;

The WANT dataset will contain an extra variable Selected = 1 for selected wells and 0 for non-selected wells. Remove the seed= option to get a different selection at every execution. Remove outall option to get only selected wells in the WANT dataset.

PG

PG
SteveDenham
Jade | Level 19

You could model both types of variability (temporal and spatial) through PROC GLIMMIX, with temporal correlation handled as an R-side structure and spatial as a G-side structure.

To add on to PG's comment, from the long dataset, create a concatenated variable from the longitude and latitude, sort on this variable, and use either first. in a datastep or select unique in PROC SQL to get a dataset with only single wells.  Process through SURVEYSELECT, and then merge back against the original data.  This should put Selected=1 on all records for the random sample.  From there the analysis can operate on a WHERE= basis.

Steve Denham

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1673 views
  • 6 likes
  • 4 in conversation