Have a couple million records with a string like
"00 00 01 00 00 01 00 01 00 00 00 00 01 01 00 01 00 00 00 00 01"
"01 00 01 00 00 00 01 00 00 00 00 00 00 01 00 01 00 00 00 00 00"
String has a length of 56. Positions 2/4/6/8/10, etc are filled with either a 1 or most likely, 0.
Positions 1/3/5/7/9, etc. are all zeros and don't matter.
.
My job is parse the string of each record every two positions
(there are no spaces, that is just for clarification).
If there is a 1 in position two that means increment var1 +1
If there is ALSO a 1 in position four, (don't care about leading "0"'s
in position 1/3/5/9, etc.) increment var2 + 1. Each two positions
represent a different var, 28 total vars. String length is ALWAYS 56.
The entire 56 len string must be parsed every two characters. Potentially
there could be 28 variables that have to be incremented in each pass of
the string, (but not realistic, most likely there is only five or six)
which could be found in any section of the string, beginning to end
(as long as they are in position 2/4/6/8 up to 56, etc.)
Using the first example string, results would be
Var1 0
Var2 0
Var3 1
Var4 0
Var5 0
var6 1
var7 0
Var8 1
..
Var13 1
Var14 1
var15 0
Var16 1
Var21 1 (skipping a bunch of vars with 0's)
Now we process the second string.
Results are now:
Var1 1
Var2 0 (still zero, nothing in that position so far).
Var3 1+1 = 2
Var4 0 (still zero)
Var5 0 (still zero)
Var6 1 (from previous row)
Var7 1
Var8 0
..
Var14 1
Var15 0
Var 16 1
Var 21 1
This is what my boss gave me:
**if substr(BigString,2,1)='1' then var1+1;**
OK. Fine.
A) There are 27 more places to evaluate in the string.
B) there are a couple million records.
28 nested if then do loops doesn't sound like an answer (all I could think of). At least not to me.
Cut-and-pasted a lot of looping code as below, which check every two positions ,
but I get a finished product with twice as many records
as I started with in a table I can query, but I'm looking to store variables var1-28 to access them later.
(I don't have any choice about this).
Thanx.
data have;
set sources;
Obs = _n_;
len = length(sourcerow);
do i=1 to len by 2;
value = substr(sourcerow, i, 2);
if value gt '00' then
output;
end;
run;
If I understood correctly, like this?
data have;
input SOURCEROW $56.;
cards;
11111111111111110000000000000011111111111111111111111111
00000000000000111100000000000011111111111111111111111111
00000000000000001100000000000000000000000000000000000000
run;
data WANT;
set HAVE;
array VAR[28];
do I=2 to 56 by 2;
VAR[I/2]+ input(char(SOURCEROW,I),1.);
end;
run;
proc print noobs;
var VAR: ;
run;
2 |
VAR1 | VAR2 | VAR3 | VAR4 | VAR5 | VAR6 | VAR7 | VAR8 | VAR9 | VAR10 | VAR11 | VAR12 | VAR13 | VAR14 | VAR15 | VAR16 | VAR17 | VAR18 | VAR19 | VAR20 | VAR21 | VAR22 | VAR23 | VAR24 | VAR25 | VAR26 | VAR27 | VAR28 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
data have;
set sources;
array a{*} var1 -- var28;
Obs = _n_;
len = length(sourcerow);
do i=2 to len by 2;
a[i/2]= input(substr(sourcerow, i, 1));
end;
run;
OP, you can use PROC FREQ on the resulting table to get your counts but the solution from @smantha is quite succinct and does what you need efficiently.
Smantha,
Thanx for your quick reply.
Below is the error log from running the posted code:
56 data have;
57 set sources;
58 array a{*} var1 -- var28;
ERROR: Variable var1 cannot be found on the list of previously defined variables.
WARNING: Defining an array with zero elements.
59 Obs = _n_;
60 len = length(sourcerow);
61 do i=2 to len by 2;
62 a[i/2]= input(substr(sourcerow, i, 1));
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, *, **, +, ',', -, /, <,
<=, <>, =, >, ><, >=, AND, EQ, GE, GT, IN, LE, LT, MAX, MIN, NE, NG, NL, NOT,
NOTIN, OR, ^, ^=, |, ||, ~, ~=.
ERROR 76-322: Syntax error, statement will be ignored.
Not sure how to continue.
Your shown example string has 62 characters not counting the " characters and only represents 21 "tuples". Intentional, error or what?Or are the spaces not supposed to be in the string? Please provide actual representative data that matches your description. If the string has not spaces then make sure your example has no spaces, if it is 56 characters provide 56 characters.
Possibly better is provide an actual data set in the form of data step code we can use. Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.
I believe you may be skipping things a bit Are there any other variables involved that have to be carried along?
Your comment of "Potentially there could be 28 variables that have to be incremented in each pass of
the string, (but not realistic, most likely there is only five or six)" makes it sound like you have blank and not values is that actually case?
Then show what the final desired output data set looks like.
Borrowing code from @smantha , here is a solution that gives you what you looking for
data sources;
sourcerow="00000100000100010000000001010001000000000101010001010001";
output;
sourcerow="01000100000001000000000000010001000000000000010001010100";
output;
run;
data want(KEEP=Var:);
array a{28} 4 var1-var28;
do _n_=1 by 1 until(eof);
set sources end=eof;
len=length(sourcerow);
do i=2 to len by 2;
a[i/2]=sum(a[i/2],input(substr(sourcerow, i, 1),1.));
end;
end;
output;
run;
Hope this helps,
Ahmed
If I understood correctly, like this?
data have;
input SOURCEROW $56.;
cards;
11111111111111110000000000000011111111111111111111111111
00000000000000111100000000000011111111111111111111111111
00000000000000001100000000000000000000000000000000000000
run;
data WANT;
set HAVE;
array VAR[28];
do I=2 to 56 by 2;
VAR[I/2]+ input(char(SOURCEROW,I),1.);
end;
run;
proc print noobs;
var VAR: ;
run;
2 |
VAR1 | VAR2 | VAR3 | VAR4 | VAR5 | VAR6 | VAR7 | VAR8 | VAR9 | VAR10 | VAR11 | VAR12 | VAR13 | VAR14 | VAR15 | VAR16 | VAR17 | VAR18 | VAR19 | VAR20 | VAR21 | VAR22 | VAR23 | VAR24 | VAR25 | VAR26 | VAR27 | VAR28 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
Excellent!! Accepted this reply most gratefully.
Now I look at a couple thousand records, and I see someone(not me, of course!) has entered a number other than "01"
somewhere in the string. We are only counting each non -"00" entry as a 1 for this application.
Is there a similar process I could run before processing that will loop through every 2 characters and, if not "00", change to "01".
Looking through the data, 99% of the values are indeed "01", but I see some "05","09", even some "12" & "14" buried in there.
Thanks so much.
@Jumboshrimps wrote:
Excellent!! Accepted this reply most gratefully.
Now I look at a couple thousand records, and I see someone(not me, of course!) has entered a number other than "01"
somewhere in the string. We are only counting each non -"00" entry as a 1 for this application.
Is there a similar process I could run before processing that will loop through every 2 characters and, if not "00", change to "01".
Looking through the data, 99% of the values are indeed "01", but I see some "05","09", even some "12" & "14" buried in there.
Thanks so much.
Try
VAR[I/2]+ ( input(char(SOURCEROW,I),1.) > 0);
SAS returns 1 for a true comparison and 0 for false. So as long as that "other" character is actually a digit this should work. If you are getting things like A, B, C then the whole approach needs some modification.
You are a genius. I checked each variable row after row, and no row increments more than one from the previous row.
😁
@Jumboshrimps wrote:
You are a genius. I checked each variable row after row, and no row increments more than one from the previous row.
😁
Not really, just the recipient of revealed wisdom from a prior boss when I was learning SAS back in 1987. Computers were slower and this sort of comparison to numeric ran faster than IF/Then/Else. I had asked why we had some code that looked something like:
newvar = (a=10)*(a*b*c) + (a=5)*(a*a*b) + (a=1)*(a*b*b);
Which is one way to pick a calculation instead of If a=10 then ... ; else if a=5 then ... ; else if a=1 then ... ;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.