- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hey All!
I have a character variable of varying lengths that I want to separate into separate variables. The variable always follows the following pattern - AA.AA.AA.AA.AA (2 characters then a period, 2 characters then a period, etc) - and each "AA" signifies a unique code I need to have separated for further analysis.
Example of what it looks like now:
orig_var |
AA.4F.9U.EB |
E5.U9 |
GB.D2.4F.EB.AA.M2 |
T4 |
YY.4D.U9 |
How I'd like for it to look:
orig_var | new_var1 | new_var2 | new_var3 | new_var4 | new_var5 | new_var6 |
AA.4F.9U.EB | AA | 4F | 9U | EB | . | . |
E5.U9 | . | . | . | . | ||
GB.D2.4F.EB.AA.M2 | GB | D2 | 4F | EB | AA | M2 |
T4 | T4 | . | . | . | . | . |
YY.4D.U9 | YY | 4D | U9 | . | . | . |
Any thoughts on how to code this to parse those 2-char codes?
Thanks!
Erice
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Do you know the max number of variables? if so loop using the scan function and . as your delimiter.
data want;
set have;
array parsed_vars(*) new_var1-new_var6;
i=1;
do while(scan(str, i, ".") ne "");
parsed_vars(i) =scan(str, i, ".");
i+1;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Specify the period as your delimiter.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Can you be more specific? Also, note that orig_var is one variable in a SAS dataset that has like 200 other variables. It's not a raw data file for me to import.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
That changes things then. A simple approach would be to use something like the scan function.
newvar1=scan(orig_var, 1, '.');
newvar2=scan(orig_var, 2, '.');
etc
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Do you know the max number of variables? if so loop using the scan function and . as your delimiter.
data want;
set have;
array parsed_vars(*) new_var1-new_var6;
i=1;
do while(scan(str, i, ".") ne "");
parsed_vars(i) =scan(str, i, ".");
i+1;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Reeza --
That worked except that there was a numeric vs character mismatch betweent he array variables and the orig_var pieces. When I added _char_ at the end of the array line, it work PERFECTLY. I was attempting some sort of array thing but just couldn't get it quite right on my own.
THANKS!!
Eric
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I think you should add the dollar sign instead of _char_. That would add in all the character variables to your array and you could potentially overwrite a variable if you had more than 6 variables in your variable that was to be parsed. In fact, it may not be a bad idea to explicitly declare the array length rather than use the * method.
array parsed_vars(6) $ new_var1-new_var6;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you don't know the max. number of elements in orig_var, it can be found and stored in a macro variable by proc sql. The number can then be used to dynamically create the required number of new variables in the output data set:
proc sql noprint;
select max(count(orig_var,'.'))+1 into :maxelements from have;
quit;
data want (drop=i); set have;
array parsed_vars $ 2 new_var1-new_var%eval(&maxelements);
do i = 1 to &maxelements;
parsed_vars{i} = scan(orig_var,i,'.');
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Another low-maintenance dynamic option can be Proc Transpose:
data have;
input var $40.;
length svar $ 2;
do i=1 by 1 until (missing(svar));
svar=scan(var,i,'.');
if not missing(svar) then
output;
end;
cards;
AA.4F.9U.EB
E5.U9
GB.D2.4F.EB.AA.M2
T4
YY.4D.U9
;
proc sort data=have;
by var;
run;
proc transpose data=have out=want (drop=_name_) prefix=new_var;
by var;
var svar;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
* Nice ! ;
* My minor edits:
* 1. no PRoc sort needed. ;
* 2. use ID statement ;
proc transpose data=have out=want (drop=_name_) prefix=var;
by var notsorted;
id i;
var svar;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
An alternative to Reeza's solution is to use a for-loop rather than a while (for people that don't trust while loops):
data want;
set have;
array parsed_vars(*) new_var1-new_var6;
do i = 1 to 6;
parsed_vars(i) =scan(str, i, ".");
end;
run;
The 6 can be redefined as a macro variable. In my code I have a length statement before the array statement, I do not know whether it works for al data without the length statement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @CR204, curious what you mean by not trusting WHILE loops? They've always been trustworthy to me. I suppose one risk they bring is that you can accidentally write an infinite loop. But to me, they're still worthy of trust and are a useful tool. Are you thinking of other concerns?
Next up: Troy Martin Hughes presents Calling Open-Source Python Functions within SAS PROC FCMP: A Google Maps API Geocoding Adventure on Wednesday April 23.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Quentin While loops can be useful, but I dislike using them when a simple for loop would work. The "don't trust" part should be read with a wink. 😉
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
I used your code which worked well in my project. Today, however, I am getting an error (Array subscript out of range at line 29, column 1).
This is my code: Not sure why it worked last week and not working today. Any help would be appreciated!
data Parsed_Topics;
set have;
array PARSED_VARS(*)$100 Topic1-Topic12;
i=1;
do while(scan(TrainingTopic, i, ";") ne "");
PARSED_VARS(i) =scan(TrainingTopic, i, ";");
i+1;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
See if string variable TrainingTopic contains more than 12 delimiters ':' - 12 is the array upbound. Hop it helps.
L.L