Hello,
The basic goal of the program is to split any values that has range of values separated by dash or delimiter to be split into rows. The file we get is in excel format and we do not know which rows will need the split, the program should be able to detect it. So, my program does check the individual columns, I also know how to do the splitting for just one variable, but I am not sure of how to perform multiple splits on the same data. So, only the first column to be split works correctly, but the next column does not get split correctly.
The data looks like this:
var1 | var2 | var3 |
999997-999999 | 1 | |
999997-999999 | 7,8,9 | |
999997-999999 | 2 | |
999997-999999 | 7,8,9 | 1 |
999997-999999 | 1 | 1 |
This is what the expected data set should look like:
var1 | var2 | var3 |
999997 | 1 | |
999998 | 1 | |
999999 | 1 | |
999997 | 7 | |
999998 | 7 | |
999999 | 7 | |
999997 | 8 | |
999998 | 8 | |
999999 | 8 | |
999997 | 9 | |
999998 | 9 | |
999999 | 9 | |
999997 | 2 | |
999998 | 2 | |
999999 | 2 | |
999997 | 7 | |
999998 | 7 | |
999999 | 7 | |
999997 | 8 | |
999998 | 8 | |
999999 | 8 | |
999997 | 9 | |
999998 | 9 | |
999999 | 9 | |
999997 | 1 | 1 |
999998 | 1 | 1 |
999999 | 1 | 1 |
Here is the code I am using:
data test; set adult; array vars {*} _character_; do i = 1 to dim(vars); if findc(vars[i],'-') > 0 then do; first=scan(vars[i],1,'-'); last=scan(vars[i],2,'-'); pos = indexc(vars[i], '-'); do range = first to last; len = length(compress(put(range,32.))); len_new = pos - 1 - len; if len_new ne 0 then vars[i] = cats(repeat('0', len_new-1), range); else vars[i] = range; output; end; end; if findc(vars[i],',') > 0 then do; do j=1 by 1 while(scan(vars[i],j,',') ^=' '); new=scan(vars[i],j,','); output; end; end; end; *drop i j; run;
Please, help me fix this. Thanks!
After all of your explanations, it appears that you want to cross all numbers in the VAR1 range with all VAR2 values and then with all VAR3 values and output every combination. In principle, your code looks like a diligent stab at this logic. I'd suggest a bit simpler scheme:
data have ;
input (var1 var2 var3) (:$13.) ;
cards ;
999997-999999 . 1
999997-999999 . 7,8,9
999997-999999 . 2
999997-999999 7,8,9 .
999997-999999 1 .
999995-999997 1,2 3,4
999994 1,3 5,7,8
;
run ;
data want (drop = _:) ;
set have (rename=var1=_v1) ;
_v2 = var2 ;
_v3 = var3 ;
if cmiss (_v2) then _v2 = "#" ;
if cmiss (_v3) then _v3 = "#" ;
do _x2 = 1 to countw (_v2) ;
var2 = scan (_v2, _x2, ",") ;
var2 = ifc (var2 = "#", "", var2) ;
do _x3 = 1 to countw (_v3) ;
var3 = scan (_v3, _x3, ",") ;
var3 = ifc (var3 = "#", "", var3) ;
var1 = input (scan (_v1, 1), 8.) ;
do var1 = var1 to var1 max input (scan (_v1, 2), 8.) ;
output ;
end ;
end ;
end ;
run ;
The hash sign serves as a dummy character to simplify the scanning by eschewing more numerous IFs.
Kind regards
Paul D.
Can you explain why the row
999997-999999 | 7,8,9 | 1 |
has no values for var3 when expanded
but
999997-999999 | 1 | 1 |
does.
And you really need to show the expected result for something like
999997-999999 | 3,5,7 | 1,2 |
or explicitly state that there are absolutely never going to be cases where var2 and var3 both have more than one value at the same time.
Also, when there are values in var1 are they always sequential or could you have something like: 999995, 999997, 999999?
If you, or some one in your organization, took existing data and combined it into that form from another you might be better off going to an earlier data set.
@sson2019 wrote:
The value for var3 should be 1 when expanding
999997-999999 7,8,9 1
sorry I made a mistake in the expected data.
There can be cases were all three variables have a range of values like this:
999997-999999 3,5,7 1,2
var 1 can have sometimes values not showing range, it could have something like: 999995, 999997, 999999
This is how we receive data
So, what do the results for data like this look like in the output?
999997-999999 3,5,7 1,2
Is the order important?
It might be a lot easier to just convert your data into code instead. The values of your variables look like values you could use in a DO loop. For the ranges just replace the hyphen with TO. And make sure to replace the blanks with a missing value character.
filename code temp;
data _null_;
file code;
set have ;
array vars var1-var3 ;
put 'row=' _n_ ';' ;
do i=1 to dim(vars);
if missing(vars[i]) then vars[i]='.';
vars[i]=tranwrd(vars[i],'-',' to ');
put 'do ' vars[i]= ';' ;
end;
put 'output;';
do i=1 to dim(vars);
put 'end;';
end;
run;
data want ;
%include code ;
stop;
run;
After all of your explanations, it appears that you want to cross all numbers in the VAR1 range with all VAR2 values and then with all VAR3 values and output every combination. In principle, your code looks like a diligent stab at this logic. I'd suggest a bit simpler scheme:
data have ;
input (var1 var2 var3) (:$13.) ;
cards ;
999997-999999 . 1
999997-999999 . 7,8,9
999997-999999 . 2
999997-999999 7,8,9 .
999997-999999 1 .
999995-999997 1,2 3,4
999994 1,3 5,7,8
;
run ;
data want (drop = _:) ;
set have (rename=var1=_v1) ;
_v2 = var2 ;
_v3 = var3 ;
if cmiss (_v2) then _v2 = "#" ;
if cmiss (_v3) then _v3 = "#" ;
do _x2 = 1 to countw (_v2) ;
var2 = scan (_v2, _x2, ",") ;
var2 = ifc (var2 = "#", "", var2) ;
do _x3 = 1 to countw (_v3) ;
var3 = scan (_v3, _x3, ",") ;
var3 = ifc (var3 = "#", "", var3) ;
var1 = input (scan (_v1, 1), 8.) ;
do var1 = var1 to var1 max input (scan (_v1, 2), 8.) ;
output ;
end ;
end ;
end ;
run ;
The hash sign serves as a dummy character to simplify the scanning by eschewing more numerous IFs.
Kind regards
Paul D.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.