BookmarkSubscribeRSS Feed
acordes
Rhodochrosite | Level 12

@IanWakeling ... continuing with the subscript theme:

 

I need to create a idx matrix dynamically.

 

X2=X[(1:5)||(20:30),];

 

In the above matrices, the last one determines how many elements get selected. This gets applied to the first matrix where row1 and row2 are the start end index. 

Here I want to select:

X2=X[(1:18) || (49:(49+59) || ...];

 

 

SUBSCRIPTS ADVANCED.png

4 REPLIES 4
acordes
Rhodochrosite | Level 12

I can solve it with a do loop, but it there a more elegant way to achieve this:

 

H3 is the first matrix and H5 the last. 

 

H6=DO(H3[1,1],H5[1],1);


DO I=2 TO NCOL(H5);
H7=DO(H3[1,I],H3[1,I] + H5[I] -1,1);
H6=H6 || H7;
END;

H6=H6`;

PRINT H6;
IanWakeling
Barite | Level 11

I can't think of any elegant way to avoid the loop.  If H3 and H5 have a lot of columns, then it is generally not a good idea to 'grow' a matrix by concatenation within a loop, since it involves creating a new matrix for each iteration.  The following avoid this issue by 1st declaring H6 to be large enough to hold everything.

 

H6 = j(sum(H5), 1);
e = 1;
do i = 1 to ncol(H3);
  r = 0:(H5[i] - 1);
  H6[ r + e ] = r + H3[1, i];
  e = e + H5[i];
end;
Rick_SAS
SAS Super FREQ

I don't understand the question. Can you provide a program that shows what you have and what you want?

acordes
Rhodochrosite | Level 12

What I asked for is a small piece of the following problem I am encountering: 

During the task of finding the best regression model, there's always a discussion going on what should be the scope of the data. I am NOT talking here about splitting the universe into training and validation data sets. 

Here I allude to the fact that the data owner could recommend taking out certain observations before running the regression. 

So I apply the corresponding filters and end up having the data set that feeds into the regression model. 

This time I didn't save neither the final data set nor the filter. I only have the universe data set and the regression model estimates. 

The task is to shuffle from the universe in such a manner that I approximate the estimates. 

 

I've rebuilt my approach with the SASHELP.CARS data set. 

 

My credit goes to @Ksharp for his many fantastic contributions in the regard of the Genetic Algorithms, especially the https://communities.sas.com/t5/SAS-IML-File-Exchange/Genetic-Algorithm-For-Integer-Programming-Footb...

 

And to @Rick_SAS for his Sweep operator trick. 

 

Frankly spoken I do not really control the crossover and mutuation options in the GA block. But I am happy with the solution I found.

 

My code goes like this:

  1. I departure from SASHELP.CARS
  2. I run a regression and make an output with the residuals
  3. I sort the output by the residuals
  4. I create ranks according to the residuals
  5. I subset the data set by playing with the residuals. (This mimics the data owners' perspective to leave some observations out due to special conditions under which they could have been produced).
  6. I run the regression whose results and estimates I delivered (and whose associated data set got lost and has to be reconstructed)
  7. I save the estimates and write them later as objective values to the PROC IML where the GA evolves
  8. I go back to the complete universe data set and run a proc transpose to create a dummy coding for the class variables in the regression
  9. I load this dummy matrix into IML
  10. I build the GA making use of the sweep operator and returning the deviation of the estimates from the objective values
  11. As I do the problem encoding as fixed-length real, I want to translate a value of 0.5 (i.e.) from the solution vector to an IDX 1:50 if the class has 100 members
  12. To improve the algorithm I use a "2." solution vector to move the 50 observations within the range. So perhaps in this case I select observations 20:70 from the Type "Truck" into the training data set. 
  13. As I had sorted the data set by its class variable and residual, 1:50 for the Type "Truck" from 100 available would mean that I choose the best 50 observations in terms of  residuals. The sorting adds logic to the GA and allows for better controlling the slope.

     


PROC GLM DATA=SASHELP.CARS PLOTS=ALL;
WHERE (UPCASE(Type) NE "HYBRID") * (Cylinders NE .);
CLASS TYPE  ;
MODEL MPG_Highway =  Cylinders Horsepower TYPE  / SOLUTION;
OUTPUT OUT=PRED_CARS PREDICTED=PRED RESIDUAL=RES;
RUN;

PROC SORT DATA=PRED_CARS;
BY Type DESCENDING RES;
RUN;

PROC RANK DATA=WORK.PRED_CARS GROUPS=4 OUT=PRED_CARS1;
BY Type;
VAR RES;
RANKS R_RES;
RUN;

PROC FREQ DATA=WORK.PRED_CARS1;
TABLE TYPE * R_RES;
RUN;

DATA PRED_CARS2;
SET WORK.PRED_CARS1;
WHERE  ((UPCASE(TYPE)="TRUCK")*(R_RES NE 0) + 
(UPCASE(TYPE)="SPORTS")*(R_RES NE 3) + 
(UPCASE(TYPE)="SEDAN")*(R_RES LT 2) +
(UPCASE(TYPE) NOT IN ("TRUCK" "SPORTS" "SEDAN")) );
RUN;

PROC FREQ DATA=WORK.PRED_CARS2;
TABLE TYPE * R_RES;
RUN;

ODS OUTPUT PARAMETERESTIMATES=PARMS;
PROC GLM DATA=WORK.PRED_CARS2 PLOTS=ALL;
CLASS TYPE (REF="SUV") ;
MODEL MPG_Highway =  Cylinders Horsepower TYPE  / SOLUTION;
RUN;

PROC TRANSREG DATA=	SASHELP.CARS DESIGN;
WHERE (UPCASE(Type) NE "HYBRID") * (Cylinders NE .);
   MODEL IDENTITY(LENGTH) 
         CLASS(TYPE / ZERO=NONE );
		 ID Cylinders Horsepower MPG_Highway Type;
   OUTPUT OUT=UNIVERSE(DROP=LENGTH _:) ;
RUN;

DATA UNIVERSE;
FORMAT INTERCEPT 3.0 &_TRGIND;
SET WORK.UNIVERSE;
INTERCEPT=1;
RUN;

PROC SORT DATA=WORK.UNIVERSE;
BY Type;
RUN;

DATA WORK.UNIVERSE;
SET WORK.UNIVERSE;
BY TYPE;
IF FIRST.TYPE THEN I+1;
N=I;
DROP I;
RUN;

PROC SQL;
SELECT "'"||STRIP(NAME)||"'" INTO :VARS SEPARATED BY " "
FROM DICTIONARY.COLUMNS
WHERE LIBNAME="WORK" AND MEMNAME="UNIVERSE";
QUIT; 

%PUT &VARS;


PROC IML;
USE WORK.UNIVERSE NOBS NOBS;
READ ALL VAR {N};
READ ALL VAR _NUM_ INTO VALS
[C=varNUMERIC];
READ ALL VAR {"TYPE"};
CLOSE UNIVERSE;

/*the objective values*/
OBJ_B1=-1.1609;
OBJ_B2=-0.022394;
OBJ_B0=33.4; /*this is i.e. the objective value for the intercept*/
OBJ_B3=4.99;

TEST=VALS[,{'INTERCEPT'  'TypeSedan' 'TypeSports' 
'TypeTruck' 'TypeWagon' 'TypeSUV' 'Cylinders' 'Horsepower' 'MPG_Highway'}];

S=SWEEP(TEST`*TEST,1:NCOL(TEST)-1);

PRINT S;

START_END=LOC(N^=T({.}||REMOVE(N,NOBS))) //
              LOC(N^=T(REMOVE(N,1)||{.}));

H2=START_END[2,]-START_END[1,]+1;

H3=START_END // H2 // CUSUM(H2);

ST_END=H3;

PRINT ST_END;

IDXH=DO(1,NCOL(ST_END)*2,2);

IDX_X=1:NCOL(ST_END)*2;

RANGE=IDX_X[,1:NCOL(ST_END)];
POS=IDX_X[,NCOL(ST_END)+1:NCOL(ST_END)*2];



IDER=(J(1, NCOL(ST_END),0.3) // J(1, NCOL(ST_END),1)) || (J(1, NCOL(ST_END),0) // J(1, NCOL(ST_END),1));

start football(x) global(VALS, ST_END,OBJ_B1,OBJ_B2, OBJ_B0, OBJ_B3, RANGE, POS, 
MUESTRA, BETA1, BETA2, BETA0, BETA3, IDX );

RANGER=X[,RANGE];
POSR=X[,POS];

X2=CEIL(ST_END[3,]#RANGER);

IDX=DO(ST_END[1,1],X2[1],1) + FLOOR((ST_END[3,1]-X2[1])*POSR[1]);

DO I=2 TO NCOL(RANGE);
IDX1=DO(ST_END[1,I],ST_END[1,I] + X2[I] -1,1) + FLOOR((ST_END[3,I]-X2[I])*POSR[I]);
IDX=IDX || IDX1;
END;

IDX=IDX`;

MUESTRA=VALS[IDX,{'INTERCEPT'  'TypeSedan' 'TypeSports' 
'TypeTruck' 'TypeWagon' 'TypeSUV' 'Cylinders' 'Horsepower' 'MPG_Highway'}];

S=SWEEP(MUESTRA`*MUESTRA,1:NCOL(MUESTRA)-1);

BETA1=S[NROW(S)-2,NCOL(S)];
BETA2=S[NROW(S)-1,NCOL(S)];
BETA0=S[1,NCOL(S)];
BETA3=S[NROW(S)-4,NCOL(S)];


POINTS = ABS((BETA1-OBJ_B1)/OBJ_B1) + ABS((BETA2-OBJ_B2)/OBJ_B2) + ABS((BETA0-OBJ_B0)/OBJ_B0)
+ ABS((BETA3-OBJ_B3)/OBJ_B3);

if NROW(MUESTRA) < 250 | ABS(BETA0 - OBJ_B0) > 1  then points=1;

return (points);

finish;

ENCOD=NCOL(ST_END)*2;

id=gasetup(1,ENCOD,123);


call gasetobj(id,0,"football");

call gasetcro(id,1.0,2);

call gasetmut(id,0.2,2,1);

call gasetsel(id,100,1,0.95);

call gainit(id,1000,IDER);

print id;

niter = 40;

summary = j(niter,3);

mattrib summary [c = {"Min Points", "A?", "B?"} l=""];

do i = 1 to niter;

call garegen(id);

call gagetval(value, id);

summary[i,1] = value[1];

summary[i,2] = value[2];

summary[i,3] = value[3];

end;

call gagetmem(mem, value, id, 1);

PRINT MEM BETA1 BETA2 BETA0 BETA3;


iteration = t(1:niter);

print iteration summary;

call gaend(id);

MUESTRA_CH=TYPE[IDX,];

CREATE MUESTRA_CH FROM MUESTRA_CH [COLNAME='TYPE'];
APPEND FROM MUESTRA_CH;
CLOSE;

CREATE MUESTRA FROM MUESTRA [COLNAME= {'INTERCEPT'  'TypeSedan' 'TypeSports' 
'TypeTruck' 'TypeWagon' 'TypeSUV' 'Cylinders' 'Horsepower' 'MPG_Highway'}];
APPEND FROM MUESTRA;
CLOSE;

quit;

DATA MUESTRA_ALL;
MERGE MUESTRA MUESTRA_CH;
RUN;

ODS OUTPUT PARAMETERESTIMATES=PARMS;
PROC GLM DATA=WORK.MUESTRA PLOTS=ALL;
MODEL MPG_Highway =  Cylinders Horsepower &_TRGIND / SOLUTION;
RUN;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

From The DO Loop
Want more? Visit our blog for more articles like these.
Discussion stats
  • 4 replies
  • 732 views
  • 2 likes
  • 3 in conversation