BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
stataq
Quartz | Level 8

hello,

I tried to clean my data using following codes. As you can see my data only has 6 cols. but for some reason it change to 7 cols after I run it. Could anyone tell me why?

stataq_0-1698154183249.png

Is it a way to check my array _test for details? and how?

The extra col i got is with colname=i, value=7. 🤣

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@stataq wrote:

Isn't i is only used in looping? why I need to have it in my data?


No "i" is not only used in looping. Almost any variable name, other than reserved words, can be used for the name of a variable used for a loop index. The convention to use i, j, k as common "loop" variables comes from ancient code practices. FORTRAN, which has limited variable types, by default uses those names as Integer values. Otherwise you had to declare variables for use as integer with extra code before using in a role where integers were required, such as loop index values. Common in mathematics to use i, j, k as integer index values and early coders were usually from the math world.

 

Hint: Proc Contents will tell you the names of all the variables in your data set. So if you ran that it would show that the varible I had been added.

 

By default ANY variable that you reference anywhere in your code will end up in the output data set unless you explicitly DROP it (or provide a KEEP list).

 

A very common source of "added variables" is misspelling. If your data set has a variable named Group and you misspell the name in a statement, such as:

If grop=123 then <do something> ;

you have "added" a variable named "grop" to the data set. Typically this error will result in a "variable <name> has not been initialized" note as there are no values assigned.

View solution in original post

12 REPLIES 12
PaigeMiller
Diamond | Level 26

@stataq wrote:

hello,

I tried to clean my data using following codes. As you can see my data only has 6 cols. but for some reason it change to 7 cols after I run it. Could anyone tell me why?

stataq_0-1698154183249.png

Is it a way to check my array _test for details? and how?


 

The very simple solution to the question about why there are 7 columns is for you to LOOK AT data set TEST with your own eyes, and you should see what has happened.

 

Then you say "Is it a way to check my array _test for details? and how?", but I don't really understand what this means or what "details" you are referring to or what you are need to "check".


Suggestion: do not use code like this, which overwrites your original data set

 

data test;
    set test;

 

Instead, use code like this, which does not overwrite your original data set

 

data test1;
    set test;
--
Paige Miller
stataq
Quartz | Level 8

Thanks so much for the explaining. I want to make sure my array was setup correctly. Basically I want to remove any space from my outputs. I tried to loop clean my data. I checked my dim(_test) which is 6, but for some reason it will add 7th col to my data with name i and value 7. 

 

Is it a way to output my data as example data? I wonder whether is my data problem but I don't know how to output it as example data.

PaigeMiller
Diamond | Level 26

@stataq wrote:

Thanks so much for the explaining. I want to make sure my array was setup correctly. Basically I want to remove any space from my outputs. I tried to loop clean my data.


Your code seems to do this correctly. But you should check, you shouldn't have to ask us if it is doing it properly.

 

I checked my dim(_test) which is 6, but for some reason it will add 7th col to my data with name i and value 7. 

Whenever you create a variable in a DATA step, the variable is added to the SAS data set that gets created. You can of course drop this variable if you want.

 

Is it a way to output my data as example data? I wonder whether is my data problem but I don't know how to output it as example data.

I'm not sure I understand what you want to do here. 

--
Paige Miller
Quentin
Super User

One way to check your data, and the logic of your DATA step, is to add PUT statements to write messages to the log.  

 

Here's a step like yours, with PUT statements added:

data shoes ;
  set sashelp.shoes (obs=3);
  array chars {*} _character_ ;
  put "before loop " _n_= (_character_)(=) /;
  do i=1 to dim(chars) ;
    put "before compress " i= chars{i}= ;
    chars{i}=compress(chars{i}) ;
    put "after compress "  i= chars{i}= /;
  end ;
  put "after loop " _n_= i= (_character_)(=) ///;
  drop i ;
run ;

The log is:

1    data shoes ;
2      set sashelp.shoes (obs=3);
3      array chars {*} _character_ ;
4      put "before loop " _n_= (_character_)(=) /;
5      do i=1 to dim(chars) ;
6        put "before compress " i= chars{i}= ;
7        chars{i}=compress(chars{i}) ;
8        put "after compress "  i= chars{i}= /;
9      end ;
10     put "after loop " _n_= i= (_character_)(=) ///;
11     drop i ;
12   run ;

before loop _N_=1 Region=Africa Product=Boot Subsidiary=Addis Ababa

before compress i=1 Region=Africa
after compress i=1 Region=Africa

before compress i=2 Product=Boot
after compress i=2 Product=Boot

before compress i=3 Subsidiary=Addis Ababa
after compress i=3 Subsidiary=AddisAbaba

after loop _N_=1 i=4 Region=Africa Product=Boot Subsidiary=AddisAbaba



before loop _N_=2 Region=Africa Product=Men's Casual Subsidiary=Addis Ababa

before compress i=1 Region=Africa
after compress i=1 Region=Africa

before compress i=2 Product=Men's Casual
after compress i=2 Product=Men'sCasual

before compress i=3 Subsidiary=Addis Ababa
after compress i=3 Subsidiary=AddisAbaba

after loop _N_=2 i=4 Region=Africa Product=Men'sCasual Subsidiary=AddisAbaba



before loop _N_=3 Region=Africa Product=Men's Dress Subsidiary=Addis Ababa

before compress i=1 Region=Africa
after compress i=1 Region=Africa

before compress i=2 Product=Men's Dress
after compress i=2 Product=Men'sDress

before compress i=3 Subsidiary=Addis Ababa
after compress i=3 Subsidiary=AddisAbaba

after loop _N_=3 i=4 Region=Africa Product=Men'sDress Subsidiary=AddisAbaba
NOTE: There were 3 observations read from the data set SASHELP.SHOES.
NOTE: The data set WORK.SHOES has 3 observations and 7 variables.

There are three character variables in sashelp.shoes, so dim(chars) = 3.  Note that the DO loop iterates 3 times for each record.  The final value of the iterator variable i is 4, because during the third iteration i=3, then at the bottom of the loop i is incremented by 1, then the loop does not iterate again.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
stataq
Quartz | Level 8

this is very helpful. Thanks so much.👍

 

Tom
Super User Tom
Super User

Apparently your existing dataset did not already have a variable named I.  So your DO loop added it.

stataq
Quartz | Level 8

Isn't i is only used in looping? why I need to have it in my data?

ballardw
Super User

@stataq wrote:

Isn't i is only used in looping? why I need to have it in my data?


No "i" is not only used in looping. Almost any variable name, other than reserved words, can be used for the name of a variable used for a loop index. The convention to use i, j, k as common "loop" variables comes from ancient code practices. FORTRAN, which has limited variable types, by default uses those names as Integer values. Otherwise you had to declare variables for use as integer with extra code before using in a role where integers were required, such as loop index values. Common in mathematics to use i, j, k as integer index values and early coders were usually from the math world.

 

Hint: Proc Contents will tell you the names of all the variables in your data set. So if you ran that it would show that the varible I had been added.

 

By default ANY variable that you reference anywhere in your code will end up in the output data set unless you explicitly DROP it (or provide a KEEP list).

 

A very common source of "added variables" is misspelling. If your data set has a variable named Group and you misspell the name in a statement, such as:

If grop=123 then <do something> ;

you have "added" a variable named "grop" to the data set. Typically this error will result in a "variable <name> has not been initialized" note as there are no values assigned.

stataq
Quartz | Level 8

I can manually drop 'i' variable. Is it a way to prevent this to happen. I applied the same code on another similar dataset and it did not add 'i'.  Is this 'i' randomly added to data set? It should not but I could not figure out why this happen.

PaigeMiller
Diamond | Level 26

It's not random. Any time you create a variable, such as the variable named I in the DO loop, it is added to the data set. If it was not added in a different example, that is because the variable I already existed in that other data set in the other example.

--
Paige Miller
Tom
Super User Tom
Super User

@stataq wrote:

I can manually drop 'i' variable. Is it a way to prevent this to happen. I applied the same code on another similar dataset and it did not add 'i'.  Is this 'i' randomly added to data set? It should not but I could not figure out why this happen.


There is a way to loop over an array without having to specify the index variable.  The DO OVER statement.  Let's do any example using SASHELP.CLASS and take out all of the M characters (since none of those values have embedded spaces to be removed).

data want;
  set sashelp.class;
  array charvar _character_;
  do over charvar;
    charvar=compress(charvar,'M');
  end;
run;
NOTE: The data set WORK.WANT has 19 observations and 5 variables

Tom_0-1698159724347.png

 

Note it still does create a variable.  In this case it creates a variable named _I_.  But it also marks the variable to be dropped.  Which can cause the opposite problem you had, writing out fewer variables than it read in, if the index variable used for the implicit array reference was one that already exited.  You can fix it with a KEEP statement.

 

That was how ARRAYs originally worked. SAS has decided to no longer document that syntax.  But as of now it still works and is a convenient way to process an array like yours where there is no meaning to the index value. 

PaigeMiller
Diamond | Level 26

@Tom wrote:

That was how ARRAYs originally worked. SAS has decided to no longer document that syntax.  But as of now it still works and is a convenient way to process an array like yours where there is no meaning to the index value. 


But DO OVER may not work in future releases of SAS, maybe even the next release, we don't know. So any program that isn't a "one time only" program that uses DO OVER could run into problems in the future. And you can't even say that "as of now it still works" as we don't know all the possible ways that DO OVER may be used and so right now maybe it works for 99% and fails on the 1%.  So in my opinion, using DO OVER is not a good habit to get into, especially since the only benefit of using DO OVER is that it eliminates the need to write drop i; in the data step.

--
Paige Miller

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1656 views
  • 4 likes
  • 5 in conversation