BookmarkSubscribeRSS Feed
noetsi
Obsidian | Level 7

I am very new to lasso although I have read a lot of articles on it recently. I am trying to determine if I am coding lasso correctly or not primarily, but also interpreting the results SAS generates. A little information first about why I am using lasso. The federal government chose 49 control variables for a major project I was assigned to. Initially I was using all of them to be consistent with their usage. I am testing the impact of service delivery on income controlling for these variables with linear regression. To do this I first look at who was eligible for a service and then determine if they got at least one service. So 1 you did, 0 you did not. These are the predictors of income I really care about. The problem is for some services the total population of those eligible is quite small as few as 43 cases. So I end up with more predictors than cases.  That is why I am using lasso (well also I think 49 predictors few of which have known theoretical links to income is absurd). The lasso code I run is attached. The concerns I have is that first in the results it says:

Selection stopped because all candidate effects for entry are linearly dependent on effects in the model.

 

This occurs every time I run lasso. I am not sure if is an issue or not. Nothing I have found in the documentation addresses it. Second, in the printout it has 

The selected model, based on SBC, is the model at Step 8.

and a list of variables. Is this the lasso variables selected? I have a list of education predictors in the overall model (there are about 7 dummies, it has 8 levels). The lasso chose say 5 of the 7 levels, five of the dummies. I assume not to violate the way dummies are normally used (everyone is in some level)I should not omit the levels the lasso leaves out, but as I said I am new to this. I am using a larger population that contains subpopulations  that I end up analyzing to choose the lasso model. Is it valid to do that to choose the variables that are used with the smaller subpopulations that way? Those subpopulations are so tiny I doubt lasso will run on them if I use only then. And since the large populations is related, they are all special service customers there is a link to me between the large population I used for the lasso and the subpopulations. 


 

proc sql;

Create table work.SEtest as
Select * from dora.incomerev
where plantype ='4';
quit;

proc glmselect data= work.setest;
CLASS 
"Age 25 to 44"n (ref ="0")
"Associate’s degree"n (ref ="0")
"Bachelor’s degree"n (ref ="0")
"Beyond a bachelor’s degree"n (ref ="0")
"High school diploma or equivalen"n (ref ="0")
/*"Individuals has a significant di"n (ref ="0")removed for SE analysis */
"Postsecondary education no degre"n (ref ="0")
"Race: Black"n (ref ="0")
"Race: More than one"n (ref ="0")
"Special education certicate/comp"n (ref ="0")
"Age 19 to 24"n (ref ="0")
"Age 45 to 54"n (ref ="0")
"Age 55 to 59"n (ref ="0")
"Age 60+"n (ref ="0")
'Age 16 to 18'n (ref ="0")
"Race: Asian"n (ref ="0")
"Race: Hawaiian/Pacific Islander"n (ref ="0")
"Race: White"n (ref ="0")
 "Foster care youth"n (ref ="0")
"Psychosocial and psychological d"n (ref ="0")
"Intellectual and learning disabi"n (ref ="0")
"Physical disability"n (ref ="0")
"Auditory and communicative disab"n (ref ="0")
Veteran (ref ="0")
"TANF recipient"n (ref ="0")
"Single parent"n (ref ="0")
/*"Received career services"n (ref ="0") */
/*"Received training services"n (ref ="0")*/
/*"Received other services"n (ref ="0")*/
"Received public support at appli"n (ref ="0")
"Employed at application"n (ref ="0")
"Homeless individual, runaway you"n (ref ="0")
"Low-income"n (ref ="0")
"Limited English-language profici"n (ref ="0")
"Migrant and seasonal farmworker"n (ref ="0")
"Long-term unemployed"n (ref ="0")
/* "Individuals is most significant"n (ref ="0")removed for SE analysis */
"Ethnicity-Hispanic Ethnicity"n (ref ="0")
"Ex-offender"n (ref ="0")
"Displaced homemaker"n (ref ="0")
Female (ref ="0")

	;
	MODEL Qtr2_Wage=	
"Age 25 to 44"n 
"Associate’s degree"n 
"Bachelor’s degree"n
"Beyond a bachelor’s degree"n
"High school diploma or equivalen"n 
/*"Individuals has a significant di"n */
"Postsecondary education no degre"n 
"Race: Black"n 
"Race: More than one"n
"Special education certicate/comp"n
"Age 19 to 24"n 
"Age 45 to 54"n 
"Age 55 to 59"n
"Age 60+"n
'Age 16 to 18'n 
"Race: Asian"n 
"Race: Hawaiian/Pacific Islander"n 
"Race: White"n
"Foster care youth"n
"Psychosocial and psychological d"n 
"Intellectual and learning disabi"n
"Physical disability"n 
"Auditory and communicative disab"n 
Veteran
"TANF recipient"n
"Single parent"n 
/*"Received career services"n
"Received training services"n 
"Received other services"n */
"Received public support at appli"n
"Employed at application"n
"Homeless individual, runaway you"n 
"Low-income"n 
"Limited English-language profici"n
"Migrant and seasonal farmworker"n 
"Long-term unemployed"n 
/*"Individuals is most significant"n */
"Ethnicity-Hispanic Ethnicity"n 
"Ex-offender"n 
"Displaced homemaker"n
Female 
"Construction Employment"n 
"Educational, or Health Care Rela"n 
"Financial Services Employment"n
"Information Services Employment"n
"Leisure, Hospitality, or Enterta"n
"Natural Resources Employment"n 
"Other Services Employment"n 
"Trade and Transportation Employm"n 
"Professional and Business Servic"n 
"Manufacturing Related Employment"n
"totalgovernment"n
 
/ selection=lasso(choose=sbc stop=none);

run;

 

1 REPLY 1
noetsi
Obsidian | Level 7
I realize that something I said above is very confusing but could not figure out how to edit it. I have a large population made up of many subpopulations that get services. Each subpopulation is used in a regression, one per service that regression consists of only those members who are eligible for a service. When I choose the lasso variables I used the larger population to select them. Then I use those variables on every subpopulation regression, I do not use a separate lasso for each subpopulation because I think there are too few cases for that to work. Is that a valid use of Lasso selection?

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 313 views
  • 0 likes
  • 1 in conversation