## some datasteps problem

Solved
Frequent Contributor
Posts: 79

# some datasteps problem

1.        How can I achieve the output provided by the below mentioned coded using DATASTEPS.

proc print data = auss;

by gender;

sumby Gender;

run;

2.          How can we achieve 2nd hightest value using a DATASTEP. I do use PROC SQL to achieve same result.

DATA Emp;

input EmpID Salary;

cards;

1    100

2    200

3    500

4    400

5    400

;

run;

proc sql;

select a.Salary

from anand.aus a

where 2= (select count(distinct (Salary)) from anand.aus b where a.salary<=b.salary);

quit;

3.          Can anybody explain seeds of RANUNI. The following codes gives the 80% of sample data but I unable to understand the seeds (1234567).

I do use the following code but scared of the functioning of seeds (1234567).

data subset;

set anand.blood;

if ranuni(12345645) ge .2;

run;

Regards

Rahul

Accepted Solutions
Solution
‎12-19-2013 12:44 AM
Super Contributor
Posts: 276

## Re: some datasteps problem

Hi Rahul,

For Question 1.

Not sure what you are trying to ask.Proc Print produre listing output ,how come this possible to get in Datastep ??

Or you want to add total sum of all the numaric variables at the end of the row in datastep ??.If that is your question,

Here you go,

Data Sal;

Input Id \$ Sal;

cards;

100 2

101 4

102 6

103 10

;

run;

Data Sum(drop=sal_sum);

set sal end=a;output;

retain sal_sum;

If _n_=1 then sal_sum=sal;

else sal_sum=sal_sum+sal;

if a then do;

sal=sal_sum;id='ID' ;output;end;

run;

For Question 2 :

Yes, we can get n th highest salary in datastep in many ways.

Here is the one of the  way,,

Data Sal;

Input Id \$ Sal;

cards;

100 2

101 4

102 6

103 10

104 10

;

run;

Proc Sort data=sal nodupkey;

by descending sal ;

Run;

Data Sal1;

set sal;

if _n_=2 ;

run;

For Question 3 :

Rechard already given a detailed explanation on this .

Thanks,
Sanjeev.K

All Replies
Super Contributor
Posts: 644

## Re: some datasteps problem

The first 2 questions look like course questions to me.

For Q1 I suggest you have a look at the ARRAY, RETAIN and PUT statements, the SUM() function and first. and last. processing.

For Q2 the best non SQL solution would be to sort descending and use _N_ to take the second observation.

You should know that random functions generated by any digital computer are not truly random; they are sometimes referred to as pseudo-random numbers.  These functions generate a stream of seemingly random numbers that over a wide range are indistinguishable from "noise" or a truly random sequence.  Eventuallly, however, after generating an enormously large sequence of these numbers the series repeat.  Think of shuffling a deck of cards to randomise them: there are a finite number (factorial 52) of different card sequences that can be generated, but eventually a shuffle would be repeated.

Random number generators, being based on digital (0 1) processes, would naturally always provide the same series of seemingly random but repeatable values, and that is usually undesirable. The purpose of the seed is to start the sequence at a different point from the beginning so that the sequence is unrecognisable.  Usually the method employed is to start the sequence at a different point each time, using a varying number to point to the first "random" number in the sequence.  This is the behaviour of RANUNI (0), which uses the value of datetime() to point to the starting number in the sequence.  After the first call to RANUNI () the following numbers come from the following poitions in the sequence and so every value appears to be unpredictable.

Sometimes you need predictable 'random' numbers, ie always to start at the same point in the sequence; typically during development or testing.  In this case you can provide an explicit seed - any positive number you like - which will point to the location of the starting value of the series you specify.

In the example given your 80% sample will always include the same rows in the output, however many times you repeat the data step.  Specifying a different seed would select a different 80% sample (largely overlapping the original selection, because 80% is most of the original set anyway).   Specifying RANUNI(0) will give you a diferent selection each time.

So why are you using a fixed seed?  Is it because someone has told you always to specify a seed, and has suggested 1234567 as a suitable number?

My retort would be that unless you know exactly why you need to specify a seed, and are prepared to use a different seed each time you specify, you should always use RANUNI(0).

Richard in NZ

Super Contributor
Posts: 464

## Re: some datasteps problem

or you can use this for Q2

DATA Emp;

input EmpID Salary;

cards;

1    100

2    200

3    500

4    400

5    400

;

run;

proc transpose data=emp out=emp(drop=_name_) prefix=Salary;

var salary;

run;

data want;

set emp;

Second_Largest=largest(2,of salary1-salary5);

proc print;run;

Solution
‎12-19-2013 12:44 AM
Super Contributor
Posts: 276

## Re: some datasteps problem

Hi Rahul,

For Question 1.

Not sure what you are trying to ask.Proc Print produre listing output ,how come this possible to get in Datastep ??

Or you want to add total sum of all the numaric variables at the end of the row in datastep ??.If that is your question,

Here you go,

Data Sal;

Input Id \$ Sal;

cards;

100 2

101 4

102 6

103 10

;

run;

Data Sum(drop=sal_sum);

set sal end=a;output;

retain sal_sum;

If _n_=1 then sal_sum=sal;

else sal_sum=sal_sum+sal;

if a then do;

sal=sal_sum;id='ID' ;output;end;

run;

For Question 2 :

Yes, we can get n th highest salary in datastep in many ways.

Here is the one of the  way,,

Data Sal;

Input Id \$ Sal;

cards;

100 2

101 4

102 6

103 10

104 10

;

run;

Proc Sort data=sal nodupkey;

by descending sal ;

Run;

Data Sal1;

set sal;

if _n_=2 ;

run;

For Question 3 :

Rechard already given a detailed explanation on this .

Thanks,
Sanjeev.K

Frequent Contributor
Posts: 79

## Re: some datasteps problem

Hi ALL,

Thank you so much for explaining things so well. It really helped me a lot and improved my understanding towards SAS.

Richard- Kindly clarify one more thing about Ranuni. when I use RANUNI(0), it gives unpredictable random numbur of observations but number of observations keep changing all the time so what if I have to get a certain number of observations (let say 70%)?????

Super Contributor
Posts: 644

## Re: some datasteps problem

Rahul

Pseudo random sequences may not be truly random but for all practical purposes they are indistinguishable from random, and this means that the number selected using ranuni(0) may not exactly provide you with 80% every time.  In fact there is a finite chance that if you repeat the process often enough you might get a 100% sample, or a 50% sample.  But on average the sample size would be expected to approach 80% on average over a number of repeats.

An example of what I mean is where you toss a coin 20 times.  On average, you would expect 10 head and 10 tails.  But in any given trial you might end up with more or less than 10 of each.  You might be surprised to get 15:5 but it is entirely possible.

If you want to get an exact random sample of a particular size I suspect Proc Surveyselect is what you would use.  Here is another approach:

• In your dataset create an extra column "random" and populate it with ranuni(0).
• Note the number of rows in the dataset and calculate 80% of that number
• Sort the data on "random"
• in a following data step set obs= the number you calculated.

Voila.

A couple of other comments I have.

1.  Why are you trying for an 80% sample of your data?  Normally a sample would be a small fraction of the total, maybe larger if you are creating a training and a validation dataset for model building.

2.  In terms of clarity, it is much better to write code that can be readily followed.  Your original data step would be more transparent if it were written

data subset;

set anand.blood;

if ranuni(0) le .8;

run;

Richard in NZ

Super User
Posts: 13,584