topic Re: Extract ID's separated by dashes from text string in SAS Programming

Extract ID's separated by dashes from text string

hkassam — Thu, 16 Sep 2010 19:50:46 GMT

I have a dataset that contains the following sample observations. Var1 has ID's separated by dashes. Var2 is date and Var3 is location.

Var1 Var2 Var3
AB6772-CD5885-ABCD6194-XY7199---- 09/14/2010 London
PQ5928-SP352-BZ5752----- 09/16/2010 New York

Each string in Var1 has from 1 to 8 ID's. In this example, there are 4 ID's in the first obs and 3 in the second. The dashes at the end of the string mean that that there are no more ID's for that obs.

I would like to extract each ID and transpose the file as follows:

ID Var2 Var3
AB6772 09/14/2010 London
CD5885 09/14/2010 London
ABCD6194 09/14/2010 London
XY7199 09/14/2010 London
PQ5928 09/16/2010 New York
SP352 09/16/2010 New York
BZ5752 09/16/2010 New York

I am looking for a simple way to achieve this. Any suggestions?

Re: Extract ID's separated by dashes from text string

data_null__ — Thu, 16 Sep 2010 20:11:17 GMT

Does this look right?

[pre]
data need;
input Var1 &$64. Var2:mmddyy. Var3 &$32.;
format var2 mmddyy.;
cards;
AB6772-CD5885-ABCD6194-XY7199---- 09/14/2010 London
PQ5928-SP352-BZ5752----- 09/16/2010 New York
;;;;
run;
data need;
length id $16;
set need;
do i = 1 to 8;
id = scan(var1,i,'-','O');
if missing(id) then leave;
output;
end;
drop var1 i;
run;
proc print;
run;
proc contents;
run;
[/pre]

Re: Extract ID's separated by dashes from text string

hkassam — Thu, 16 Sep 2010 20:32:22 GMT

Yes, it works like a charm. I inserted the do loop in my data step reading the original dataset and it did the job.

I did not realize the scan function is able to keep count of each word between the delimiters.

The irony is that I myself created the text string with the dash delimiters in the first place but now want to unravel it for further processing.

Thanks a millon.

Re: Extract ID's separated by dashes from text string

Cynthia_sas — Fri, 17 Sep 2010 18:37:03 GMT

Hi:
There is an alternate solution -- but it requires the following to be true:
1) there is always 1 space between VAR1 (ID), VAR2 and VAR3 and
2) the last piece of ID -ALWAYS- ends in a number

If these are true, then you can parse out ID when you first read the input file, as shown in the code below, using the ANYDIGIT function.

cynthia
[pre]
data readdash;
length var1 $35 var2 8 var3 $32;
** This infile/input combination assumes that there is ALWAYS;
** at least 1 space between each variable in the input file.;
infile datalines dlm=' ' ;
input Var1 &$ Var2:mmddyy10. Var3 &$32.;
format var2 mmddyy10.;

** As long as ID -always- ends in a number,;
** search backwards from the end of VAR1 and find the last digit;
** in the VAR1 variable. With ANYDIGIT, if start is negative, the;
** search starts at the END of the string and goes to the right.;
** So the value of GOTNUM is the position of the digit before the ----.;
lg = length(var1)*-1;
gotnum = anydigit(var1,lg);

** could obs 3 happen in the data??;
if gotnum gt 0 then id = substr(var1,1,gotnum);
else if gotnum = 0 then id = var1;
return;
datalines;
AB6772-CD5885-ABCD6194-XY7199---- 09/14/2010 London
PQ5928-SP352-BZ5752----- 09/16/2010 New York
ZZzzzz-SPeee-YYaaaa----- 09/16/2010 Albuquerque
;;;;
run;

proc print data=readdash;
run;
[/pre]

Re: Extract ID's separated by dashes from text string

Peter_C — Sun, 19 Sep 2010 16:51:38 GMT

pick up on Cynthia's solution and make the infile parsing work a little harder for you 😉
The DLM= option of the INFILE statement can accept a variable name as well as a constant. Each time an INPUT statement is executed from the file defined by that INFILE, the value of the DLM= variable is re-evaluated to prepare for parsing. So, for reading the ID variables, set the delimiter to dash (-) and then set it to blank to read the remains of the line. Once the whole line is read, loop over the ID variables to release an observation for each non-blank ID :[pre]data reduced( keep= ID Var2 Var3 ) ;
length id i1-i8 $11 Var2 6 Var3 $40 delimiter $1 ;
attrib var2 format= mmddyy10. informat= mmddyy10. ;
infile cards dsd dlm=delimiter column=c ;
delimiter = '-' ;
input i1-i8 @;
delimiter = ' ' ;
input +1 /* pass over the first blank*/
Var2 @ ;
Var3 = substr( _infile_, c ) ; *put rest of line into Var3 ;
array i(8) ;
do c=1 to 8 ;
if i(c) =' ' then continue ;
id = i(c) ;
output ;
end ;
list;cards;
AB6772-CD5885-ABCD6194-XY7199---- 09/14/2010 London
PQ5928-SP352-BZ5752----- 09/16/2010 New York
;[/pre]
great things available in data step INFILE and INPUT parsing

enjoy
peterC