- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
hello,
I tried to clean my data using following codes. As you can see my data only has 6 cols. but for some reason it change to 7 cols after I run it. Could anyone tell me why?
Is it a way to check my array _test for details? and how?
The extra col i got is with colname=i, value=7. 🤣
Thanks.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@stataq wrote:
Isn't i is only used in looping? why I need to have it in my data?
No "i" is not only used in looping. Almost any variable name, other than reserved words, can be used for the name of a variable used for a loop index. The convention to use i, j, k as common "loop" variables comes from ancient code practices. FORTRAN, which has limited variable types, by default uses those names as Integer values. Otherwise you had to declare variables for use as integer with extra code before using in a role where integers were required, such as loop index values. Common in mathematics to use i, j, k as integer index values and early coders were usually from the math world.
Hint: Proc Contents will tell you the names of all the variables in your data set. So if you ran that it would show that the varible I had been added.
By default ANY variable that you reference anywhere in your code will end up in the output data set unless you explicitly DROP it (or provide a KEEP list).
A very common source of "added variables" is misspelling. If your data set has a variable named Group and you misspell the name in a statement, such as:
If grop=123 then <do something> ;
you have "added" a variable named "grop" to the data set. Typically this error will result in a "variable <name> has not been initialized" note as there are no values assigned.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@stataq wrote:
hello,
I tried to clean my data using following codes. As you can see my data only has 6 cols. but for some reason it change to 7 cols after I run it. Could anyone tell me why?
Is it a way to check my array _test for details? and how?
The very simple solution to the question about why there are 7 columns is for you to LOOK AT data set TEST with your own eyes, and you should see what has happened.
Then you say "Is it a way to check my array _test for details? and how?", but I don't really understand what this means or what "details" you are referring to or what you are need to "check".
Suggestion: do not use code like this, which overwrites your original data set
data test;
set test;
Instead, use code like this, which does not overwrite your original data set
data test1;
set test;
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks so much for the explaining. I want to make sure my array was setup correctly. Basically I want to remove any space from my outputs. I tried to loop clean my data. I checked my dim(_test) which is 6, but for some reason it will add 7th col to my data with name i and value 7.
Is it a way to output my data as example data? I wonder whether is my data problem but I don't know how to output it as example data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@stataq wrote:
Thanks so much for the explaining. I want to make sure my array was setup correctly. Basically I want to remove any space from my outputs. I tried to loop clean my data.
Your code seems to do this correctly. But you should check, you shouldn't have to ask us if it is doing it properly.
I checked my dim(_test) which is 6, but for some reason it will add 7th col to my data with name i and value 7.
Whenever you create a variable in a DATA step, the variable is added to the SAS data set that gets created. You can of course drop this variable if you want.
Is it a way to output my data as example data? I wonder whether is my data problem but I don't know how to output it as example data.
I'm not sure I understand what you want to do here.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
One way to check your data, and the logic of your DATA step, is to add PUT statements to write messages to the log.
Here's a step like yours, with PUT statements added:
data shoes ;
set sashelp.shoes (obs=3);
array chars {*} _character_ ;
put "before loop " _n_= (_character_)(=) /;
do i=1 to dim(chars) ;
put "before compress " i= chars{i}= ;
chars{i}=compress(chars{i}) ;
put "after compress " i= chars{i}= /;
end ;
put "after loop " _n_= i= (_character_)(=) ///;
drop i ;
run ;
The log is:
1 data shoes ; 2 set sashelp.shoes (obs=3); 3 array chars {*} _character_ ; 4 put "before loop " _n_= (_character_)(=) /; 5 do i=1 to dim(chars) ; 6 put "before compress " i= chars{i}= ; 7 chars{i}=compress(chars{i}) ; 8 put "after compress " i= chars{i}= /; 9 end ; 10 put "after loop " _n_= i= (_character_)(=) ///; 11 drop i ; 12 run ; before loop _N_=1 Region=Africa Product=Boot Subsidiary=Addis Ababa before compress i=1 Region=Africa after compress i=1 Region=Africa before compress i=2 Product=Boot after compress i=2 Product=Boot before compress i=3 Subsidiary=Addis Ababa after compress i=3 Subsidiary=AddisAbaba after loop _N_=1 i=4 Region=Africa Product=Boot Subsidiary=AddisAbaba before loop _N_=2 Region=Africa Product=Men's Casual Subsidiary=Addis Ababa before compress i=1 Region=Africa after compress i=1 Region=Africa before compress i=2 Product=Men's Casual after compress i=2 Product=Men'sCasual before compress i=3 Subsidiary=Addis Ababa after compress i=3 Subsidiary=AddisAbaba after loop _N_=2 i=4 Region=Africa Product=Men'sCasual Subsidiary=AddisAbaba before loop _N_=3 Region=Africa Product=Men's Dress Subsidiary=Addis Ababa before compress i=1 Region=Africa after compress i=1 Region=Africa before compress i=2 Product=Men's Dress after compress i=2 Product=Men'sDress before compress i=3 Subsidiary=Addis Ababa after compress i=3 Subsidiary=AddisAbaba after loop _N_=3 i=4 Region=Africa Product=Men'sDress Subsidiary=AddisAbaba NOTE: There were 3 observations read from the data set SASHELP.SHOES. NOTE: The data set WORK.SHOES has 3 observations and 7 variables.
There are three character variables in sashelp.shoes, so dim(chars) = 3. Note that the DO loop iterates 3 times for each record. The final value of the iterator variable i is 4, because during the third iteration i=3, then at the bottom of the loop i is incremented by 1, then the loop does not iterate again.
Next up: Troy Martin Hughes presents Calling Open-Source Python Functions within SAS PROC FCMP: A Google Maps API Geocoding Adventure on Wednesday April 23.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
this is very helpful. Thanks so much.👍
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Apparently your existing dataset did not already have a variable named I. So your DO loop added it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Isn't i is only used in looping? why I need to have it in my data?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@stataq wrote:
Isn't i is only used in looping? why I need to have it in my data?
No "i" is not only used in looping. Almost any variable name, other than reserved words, can be used for the name of a variable used for a loop index. The convention to use i, j, k as common "loop" variables comes from ancient code practices. FORTRAN, which has limited variable types, by default uses those names as Integer values. Otherwise you had to declare variables for use as integer with extra code before using in a role where integers were required, such as loop index values. Common in mathematics to use i, j, k as integer index values and early coders were usually from the math world.
Hint: Proc Contents will tell you the names of all the variables in your data set. So if you ran that it would show that the varible I had been added.
By default ANY variable that you reference anywhere in your code will end up in the output data set unless you explicitly DROP it (or provide a KEEP list).
A very common source of "added variables" is misspelling. If your data set has a variable named Group and you misspell the name in a statement, such as:
If grop=123 then <do something> ;
you have "added" a variable named "grop" to the data set. Typically this error will result in a "variable <name> has not been initialized" note as there are no values assigned.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I can manually drop 'i' variable. Is it a way to prevent this to happen. I applied the same code on another similar dataset and it did not add 'i'. Is this 'i' randomly added to data set? It should not but I could not figure out why this happen.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It's not random. Any time you create a variable, such as the variable named I in the DO loop, it is added to the data set. If it was not added in a different example, that is because the variable I already existed in that other data set in the other example.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@stataq wrote:
I can manually drop 'i' variable. Is it a way to prevent this to happen. I applied the same code on another similar dataset and it did not add 'i'. Is this 'i' randomly added to data set? It should not but I could not figure out why this happen.
There is a way to loop over an array without having to specify the index variable. The DO OVER statement. Let's do any example using SASHELP.CLASS and take out all of the M characters (since none of those values have embedded spaces to be removed).
data want;
set sashelp.class;
array charvar _character_;
do over charvar;
charvar=compress(charvar,'M');
end;
run;
NOTE: The data set WORK.WANT has 19 observations and 5 variables
Note it still does create a variable. In this case it creates a variable named _I_. But it also marks the variable to be dropped. Which can cause the opposite problem you had, writing out fewer variables than it read in, if the index variable used for the implicit array reference was one that already exited. You can fix it with a KEEP statement.
That was how ARRAYs originally worked. SAS has decided to no longer document that syntax. But as of now it still works and is a convenient way to process an array like yours where there is no meaning to the index value.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Tom wrote:That was how ARRAYs originally worked. SAS has decided to no longer document that syntax. But as of now it still works and is a convenient way to process an array like yours where there is no meaning to the index value.
But DO OVER may not work in future releases of SAS, maybe even the next release, we don't know. So any program that isn't a "one time only" program that uses DO OVER could run into problems in the future. And you can't even say that "as of now it still works" as we don't know all the possible ways that DO OVER may be used and so right now maybe it works for 99% and fails on the 1%. So in my opinion, using DO OVER is not a good habit to get into, especially since the only benefit of using DO OVER is that it eliminates the need to write drop i; in the data step.
Paige Miller