some datasteps problem

Accepted Solution Solved
Reply
Contributor
Posts: 64
Accepted Solution

some datasteps problem

1.        How can I achieve the output provided by the below mentioned coded using DATASTEPS.

proc print data = auss;

by gender;

sumby Gender;

run;

2.          How can we achieve 2nd hightest value using a DATASTEP. I do use PROC SQL to achieve same result.

DATA Emp;

input EmpID Salary;

cards;

1    100

2    200

3    500

4    400

5    400

;

run;

proc sql;

select a.Salary

from anand.aus a

where 2= (select count(distinct (Salary)) from anand.aus b where a.salary<=b.salary);

quit;

3.          Can anybody explain seeds of RANUNI. The following codes gives the 80% of sample data but I unable to understand the seeds (1234567).

I do use the following code but scared of the functioning of seeds (1234567).

data subset;

     set anand.blood;

     if ranuni(12345645) ge .2;

run;

Regards

Rahul


Accepted Solutions
Solution
‎12-19-2013 12:44 AM
Super Contributor
Posts: 276

Re: some datasteps problem

Posted in reply to Rahul_SAS

Hi Rahul,

For Question 1.

Not sure what you are trying to ask.Proc Print produre listing output ,how come this possible to get in Datastep ??

Or you want to add total sum of all the numaric variables at the end of the row in datastep ??.If that is your question,

Here you go,

Data Sal;

Input Id $ Sal;

cards;

100 2

101 4

102 6

103 10

;

run;

Data Sum(drop=sal_sum);

set sal end=a;output;

retain sal_sum;

If _n_=1 then sal_sum=sal;

else sal_sum=sal_sum+sal;

if a then do;

sal=sal_sum;id='ID' ;output;end;

run;

For Question 2 :

Yes, we can get n th highest salary in datastep in many ways.

Here is the one of the  way,,

Data Sal;

Input Id $ Sal;

cards;

100 2

101 4

102 6

103 10

104 10

;

run;

Proc Sort data=sal nodupkey;

by descending sal ;

Run;

Data Sal1;

set sal;

if _n_=2 ;

run;

For Question 3 :

Rechard already given a detailed explanation on this .

Thanks,
Sanjeev.K

View solution in original post


All Replies
Super Contributor
Posts: 644

Re: some datasteps problem

Posted in reply to Rahul_SAS

The first 2 questions look like course questions to me.

For Q1 I suggest you have a look at the ARRAY, RETAIN and PUT statements, the SUM() function and first. and last. processing.

For Q2 the best non SQL solution would be to sort descending and use _N_ to take the second observation.

You should know that random functions generated by any digital computer are not truly random; they are sometimes referred to as pseudo-random numbers.  These functions generate a stream of seemingly random numbers that over a wide range are indistinguishable from "noise" or a truly random sequence.  Eventuallly, however, after generating an enormously large sequence of these numbers the series repeat.  Think of shuffling a deck of cards to randomise them: there are a finite number (factorial 52) of different card sequences that can be generated, but eventually a shuffle would be repeated.

Random number generators, being based on digital (0 1) processes, would naturally always provide the same series of seemingly random but repeatable values, and that is usually undesirable. The purpose of the seed is to start the sequence at a different point from the beginning so that the sequence is unrecognisable.  Usually the method employed is to start the sequence at a different point each time, using a varying number to point to the first "random" number in the sequence.  This is the behaviour of RANUNI (0), which uses the value of datetime() to point to the starting number in the sequence.  After the first call to RANUNI () the following numbers come from the following poitions in the sequence and so every value appears to be unpredictable.

Sometimes you need predictable 'random' numbers, ie always to start at the same point in the sequence; typically during development or testing.  In this case you can provide an explicit seed - any positive number you like - which will point to the location of the starting value of the series you specify.

In the example given your 80% sample will always include the same rows in the output, however many times you repeat the data step.  Specifying a different seed would select a different 80% sample (largely overlapping the original selection, because 80% is most of the original set anyway).   Specifying RANUNI(0) will give you a diferent selection each time.

So why are you using a fixed seed?  Is it because someone has told you always to specify a seed, and has suggested 1234567 as a suitable number?

My retort would be that unless you know exactly why you need to specify a seed, and are prepared to use a different seed each time you specify, you should always use RANUNI(0).

Richard in NZ

Super Contributor
Super Contributor
Posts: 444

Re: some datasteps problem

Posted in reply to Rahul_SAS

or you can use this for Q2

DATA Emp;

input EmpID Salary;

cards;

1    100

2    200

3    500

4    400

5    400

;

run;

proc transpose data=emp out=emp(drop=_name_) prefix=Salary;

var salary;

run;

data want;

set emp;

Second_Largest=largest(2,of salary1-salary5);

proc print;run;

Solution
‎12-19-2013 12:44 AM
Super Contributor
Posts: 276

Re: some datasteps problem

Posted in reply to Rahul_SAS

Hi Rahul,

For Question 1.

Not sure what you are trying to ask.Proc Print produre listing output ,how come this possible to get in Datastep ??

Or you want to add total sum of all the numaric variables at the end of the row in datastep ??.If that is your question,

Here you go,

Data Sal;

Input Id $ Sal;

cards;

100 2

101 4

102 6

103 10

;

run;

Data Sum(drop=sal_sum);

set sal end=a;output;

retain sal_sum;

If _n_=1 then sal_sum=sal;

else sal_sum=sal_sum+sal;

if a then do;

sal=sal_sum;id='ID' ;output;end;

run;

For Question 2 :

Yes, we can get n th highest salary in datastep in many ways.

Here is the one of the  way,,

Data Sal;

Input Id $ Sal;

cards;

100 2

101 4

102 6

103 10

104 10

;

run;

Proc Sort data=sal nodupkey;

by descending sal ;

Run;

Data Sal1;

set sal;

if _n_=2 ;

run;

For Question 3 :

Rechard already given a detailed explanation on this .

Thanks,
Sanjeev.K

Contributor
Posts: 64

Re: some datasteps problem

Posted in reply to kuridisanjeev

Hi ALL,

Thank you so much for explaining things so well. It really helped me a lot and improved my understanding towards SAS.

       

Richard- Kindly clarify one more thing about Ranuni. when I use RANUNI(0), it gives unpredictable random numbur of observations but number of observations keep changing all the time so what if I have to get a certain number of observations (let say 70%)?????

Super Contributor
Posts: 644

Re: some datasteps problem

Posted in reply to Rahul_SAS

Rahul

Pseudo random sequences may not be truly random but for all practical purposes they are indistinguishable from random, and this means that the number selected using ranuni(0) may not exactly provide you with 80% every time.  In fact there is a finite chance that if you repeat the process often enough you might get a 100% sample, or a 50% sample.  But on average the sample size would be expected to approach 80% on average over a number of repeats.

An example of what I mean is where you toss a coin 20 times.  On average, you would expect 10 head and 10 tails.  But in any given trial you might end up with more or less than 10 of each.  You might be surprised to get 15:5 but it is entirely possible.

If you want to get an exact random sample of a particular size I suspect Proc Surveyselect is what you would use.  Here is another approach:

  • In your dataset create an extra column "random" and populate it with ranuni(0).
  • Note the number of rows in the dataset and calculate 80% of that number
  • Sort the data on "random"
  • in a following data step set obs= the number you calculated.

Voila.

A couple of other comments I have.

1.  Why are you trying for an 80% sample of your data?  Normally a sample would be a small fraction of the total, maybe larger if you are creating a training and a validation dataset for model building.

2.  In terms of clarity, it is much better to write code that can be readily followed.  Your original data step would be more transparent if it were written

data subset;

     set anand.blood;

     if ranuni(0) le .8;

run;

Richard in NZ

Super User
Posts: 11,343

Re: some datasteps problem

Posted in reply to Rahul_SAS

If you're selecting samples it may be worthwhile to examin Proc SurveySelect as you don't have to create an additional variable and options allow setting sampling rate directly. If you have subgroups (strata) that you want sampled at specified rates its an even better idea.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 541 views
  • 6 likes
  • 5 in conversation