Solved: Re: How to separate part of an alphanumeric string

Sam20001 · Posted 01-13-2023 05:57 PM

Hello,

I would like to know how I can separate part of a variable. The values for the variable named ID look like this: AAID1248987441122-998-IDSID-0-1. I need to only extract the 13 digits after the letter D and before the first hyphen (i.e., 1248987441122. the new variable name will be sid). I have tried scan and compress but neither produces what I need.

sid = scan(ID,4) results in a value of only 1

sid = compress(ID, """,'A') results in a value like this 1248987441122-998-0-1

Your help would be greatly appreciated.

Thank you.

ErikLund_Jensen · Posted 01-14-2023 07:46 AM

Hi @Sam20001

You could also use the PRXCHANGE function. It is more flexible, so it can be coded to handle many different input formats. The learning courve is a bit steep if you are unfamiliar with the SAS PRX functions, and I recomment the PRX Tip sheet as a great way of getting started: https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf:

In your case, the string seems to be machine generated, so a flexible solution is not called for. Other contributors have suggested different solutions that work perfectily well on your example string, so I am just beating the drum over the use of PRX functions; they have saved me the trouble of coding many lines of complicated code over the years. They are known to be ineffctive, but in my experience it's not a problem worth considering unless input is counted in millions of observations.

data have;
  length ID $40;
  input ID $char40.;
  datalines;
AAID1248987441122-998-IDSID-0-1
ID1248987441122-998--0-1
1248987441122
123
AAID1-998-IDSID-0-1
AAIN1248987441122_998-IDSID-0-1
AAID-1248987441122DSID-0-1
;
data want;
  set have;
  SID = prxchange('s/(\D+)(\d*)(\D.*)/$2/',-1,ID);

  * other suggestions in this post;
  SID2 = compress(scan(ID, 1, "-"), , 'kd');
  SID3 = scan(scan(ID,1,'-'),-2,'kd');
  SID4 = scan(ID,2,'D');
  SID5 = scan(scan(ID,2,'D'),1,'-');
run;

View solution in original post

Reeza · Posted 01-13-2023 06:17 PM

SID = compress(scan(ID, 1, "-"), , 'kd');

Use both COMPRESS + SCAN.

SCAN to get the first portion of the text and COMPRESS to remove the first 4 characters.

Or if the text is always a fixed length use SUBSTR.

SID = substr(ID, 5, 13);

Patrick · Posted 01-13-2023 06:18 PM

One or both of below two options should work.

data sample;
  infile datalines truncover;
  input have :$31.;
  length want1 want2 $13;
  want1=scan(scan(have,1,'-'),-2,'kd');
  /* if the wanted digits are always on the same position and it's always 13 digits */ 
  /*  then substr() will work as well */
  want2=substr(have,5,13);
  datalines;
AAID1248987441122-998-IDSID-0-1
 
AAID99999999-998-IDSID-0-1
;

proc print data=sample;
run;

s_lassen · Posted 01-14-2023 06:00 AM

You can try something like this:

data want;
  set have;
  length sid $13;
  sid=scan(ID,2,'D');
run;

That will work if you always have 13 digits after the "D". If the digit string is sometimes shorter (so that you get the hyphen and possibly other stuff in the SID variable, you could change it to:

data want;
  set have;
  length sid $13;
  sid=scan(scan(ID,2,'D'),1,'-');
run;

ErikLund_Jensen · Posted 01-14-2023 07:46 AM

Hi @Sam20001

You could also use the PRXCHANGE function. It is more flexible, so it can be coded to handle many different input formats. The learning courve is a bit steep if you are unfamiliar with the SAS PRX functions, and I recomment the PRX Tip sheet as a great way of getting started: https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf:

In your case, the string seems to be machine generated, so a flexible solution is not called for. Other contributors have suggested different solutions that work perfectily well on your example string, so I am just beating the drum over the use of PRX functions; they have saved me the trouble of coding many lines of complicated code over the years. They are known to be ineffctive, but in my experience it's not a problem worth considering unless input is counted in millions of observations.

data have;
  length ID $40;
  input ID $char40.;
  datalines;
AAID1248987441122-998-IDSID-0-1
ID1248987441122-998--0-1
1248987441122
123
AAID1-998-IDSID-0-1
AAIN1248987441122_998-IDSID-0-1
AAID-1248987441122DSID-0-1
;
data want;
  set have;
  SID = prxchange('s/(\D+)(\d*)(\D.*)/$2/',-1,ID);

  * other suggestions in this post;
  SID2 = compress(scan(ID, 1, "-"), , 'kd');
  SID3 = scan(scan(ID,1,'-'),-2,'kd');
  SID4 = scan(ID,2,'D');
  SID5 = scan(scan(ID,2,'D'),1,'-');
run;

Sam20001 · Posted 02-06-2023 11:52 AM

All the examples work. Thank you so much for your help with this!

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!