About triley

triley · ‎05-12-2020

Thanks. I am trying to see if there is anything besides basic imputation (should have been more clear in the original question). I would like to almost treat the prediction separately based if the state has the data or not (i.e. have a model for states with the extra variables and one for the states without type of thing).

triley · ‎05-12-2020

I was curious of the best ways to handle/model missing values when all values for certain states are missing. I am trying to predict purchase volume based on customer, and I have customers in all 50 states. There are a few variables in which i only have data for in about half of the states, but i want to use other independent variables that are available in all states as well. My 2 initial thoughts were to 1) include a "state_has_data" indicator and use that in the model, or 2) create a model that estimates the volume for the states that do have data, and use that prediction when there are missing values. Are there better ways of handling this? I've included an example below of something similar to what i'm trying to accomplish using the sashelp.cars data set. In the example, "x1" where origin='Asia' would be equivalent to a state with missing values for the independent variable. I've also added a regression procedure at the end to help add context. I am still exploring other regression procedures as well (e.g. proc COUNTREG, proc GLM, etc.). Also, I only have SAS E.G., without miner or other modeling 'add-ons'. proc sql; create table have as select make ,Origin ,avg(MSRP) as x1 ,avg(Horsepower) as x2 ,avg(MPG_Highway) as x3 ,count(1) as y from (select make ,Origin ,case when Origin = 'Asia' then . else MSRP end as MSRP ,Horsepower ,MPG_Highway from sashelp.cars ) group by make ,Origin order by Origin ;quit; proc reg data=have; model y = x1-x3; run;

triley · ‎05-04-2020

Thank you for providing another solution. As i am not very familiar with hash tables, I was wondering if it would be possible to modify this slightly and take it one step back and use it to come up with the "have" table. In other words, if I had NAME1 and RECORDS1 in a table, and NAME2 and RECORDS2 in a different table, could i use a similar approach to join the 2 to find the matches? I'm not sure if you can use HASH to find "similar" names based on some function (e.g. using COMPGED), but i think where this would be really helpful is in doing that step, since the SQL step I'm currently using takes over a day to produce that table. Also, if you are able to provide a solution to this, i can submit a new question on SAS communities and will "Accept" your solution since I already accepted one that worked for this one.

triley · ‎04-27-2020

I believe i found a solution using a combination of the replies above. Thank you all for your time. I basically used the "if records2 > records1 then do;" section to get the info in the order i needed, then used the proc optnet to get the clusters, then joined back to my original data to then sort by the appropriate "records" amount and took the one with the largest.

triley · ‎04-27-2020

Thank you for your response. It looks like the table provided the output i was looking for, but i'm not clear as to if the proc optnet procedure is taking into consideration the "RECORDS" variables. I am not familiar with the procedure so it may be inherently built in, but just wanted to clarify...

triley · ‎04-27-2020

Thank you for the response. As far as your original question, I only have ~600 total records. For the code you provided. I think i understand what you are doing, but i wasn't able to reproduce the outcome i was looking for (even after trying to make some modifications). I was trying to modify the parts where you had a data step with a "data" and "set" statement with the same table name (e.g. have3, have4). After i tried modifying, i was still getting a final dataset with a null value in the "name1" column. Is that expected?

triley · ‎04-27-2020

No i don't believe there should ever be a case like you mentioned. There could be a case though like i have below, Where it would be RICKY that i would like to "keep/retain" in NAME1, and have the association to RICHARD and RICK in NAME2 HAVE: NAME1 RECORDS1 NAME2 RECORDS2 RICHARD 1539 RICK 9 RICHARD 1539 RICKY 1681 RICKY 1681 RICK 9 RICKY 1681 RICHARD 1539 WANT: NAME1 RECORDS1 NAME2 RECORDS2 RICKY 1681 RICK 9 RICKY 1681 RICHARD 1539

triley · ‎04-24-2020

I want everything to roll up to the highest record count name, so since JOHN matched JOHNNY, and JOHNNY matched JOHNOTHAN,JOHNOTHAN should also roll up under JOHN. For your BRAD/BRADLEY scenario, the number of records will always be the same for each name, so it will never have a scenario like you mentioned.

triley · ‎04-24-2020

The subject might be misleading, but what i'm trying to do is take a dataset with matched names, and create a new dataset that lists all matches based on the "best" name (based on highest value in the RECORDS1 column). I have provided both a visual of what i'm trying to do as well as the HAVE/WANT code. If anyone needs additional information to help solve please let me know. Thanks in advance! HAVE: NAME1 RECORDS1 NAME2 RECORDS2 TOM 5243 TOMMY 4 BRAD 873 BRADLEY 219 BRADLEY 219 BRAD 873 JOHN 61017 JOHNNY 905 JOHNNY 905 JOHN 61017 JONATHAN 500 JOHNNY 905 WANT: NAME1 NAME2 TOM TOMMY BRAD BRADLEY JOHN JOHNNY JOHN JONATHAN DATA have; INFILE DATALINES DSD; INPUT NAME1 $ RECORDS1 NAME2 $ RECORDS2; DATALINES; TOM,5243,TOMMY,4 BRAD,873,BRADLEY,219 BRADLEY,219,BRAD,873 JOHN,61017,JOHNNY,905 JOHNNY,905,JOHN,61017 JONATHAN,500,JOHNNY,905 ; run; DATA want; INFILE DATALINES DSD; INPUT NAME1 $ NAME2 $ ; DATALINES; TOM,TOMMY BRAD,BRADLEY JOHN,JOHNNY JOHN,JONATHAN ; run;

triley · ‎07-11-2017

Makes sense...Thank you!

triley · ‎07-10-2017

Great, thank you...One more thing and I promise I won't ask anything additional of you...In my real data I have a "zip2_b" as well. I tried just adding it here (HAVE2.definekey('ZIP_B','ZIP2_B'); but it gave me an error. Essentially I want to do the same thing as a join like this: where a.zip_a=b.zip_b or a.zip_a=b.zip2_b or a.zip2_a=b.zip_b or a.zip2_a=b.zip2_b

triley · ‎07-10-2017

Chris I really appreciate all the help. I wasn't able to get the MATCH_IND to work though...I tried putting it everywhere in between data; and run; and it didn't get me values for every record. Could you maybe run it out and verify it is working on your end and then provide me with the code you got to work? Thanks again...

triley · ‎07-09-2017

What if in the tables the variables are named the same?...my "real" data has both the "names" called name (so instead of name_a=name_b it is name=name)

triley · ‎07-07-2017

The Hash table took my query from around 20 minutes to around 20 seconds! Looks like I need to start learning how to use Hash tables more now...Thank you very much for providing this info! One question though with hash tables, can i still use this to do a comparison of a variable that isn't brought in? So lets say in my example there is also an address for each record, for the matching i need to do i want to compare if the address is the same, do i need to declare that in the "define data" part, or i guess how could i put something equivelant to the sql: a.address=b.address?

triley · ‎07-06-2017

I was wondering if anyone had a better approach for a process I am currently running. What i'm trying to do is join 2 tables together based on Zip, and then do some matching techniques to find the "best" matches from the combined tables. I have the process working OK now using sql joins, but the run time is very long. I was wondering if any of you had any ideas to optimize this process. I think its important to point out also that i don't want to remove the non-matches (match_ind = 0), I need to keep all records even if it doesnt result in a "match". Please let me know if you have any questions or need further explanation of what i'm trying to do. I would be really interested to see if someone has a way to do this with either key indexing or hash tables as I haven't had much experiance with either. Thanks in advance! Tom data have1; input key_a 2. name_a $ 4. zip_a $ 6. zip2_a $ 6.; datalines ; 1 aaa 12345 2 bbb 12345 3 ccc 55555 12345 4 ddd 99999 ; run; 　　 data have2; input key_b 2. name_b $ 4. zip_b $ 6. ; datalines ; 5 aaa 12345 6 ggg 12345 7 ccc 12345 8 ddd 99999 9 hhh 99999 ; run; 　 proc sql; create table want as select a.* , b.key_b , b.name_b , (a.name_a=b.name_b) as match_ind from have1 a inner join have2 b on a.zip_a=b.zip_b or a.zip2_a=b.zip_b order by key_a, key_b ;quit; WANT: key_a name_a zip_a zip2_a key_b name_b match_ind 1 aaa 12345 5 aaa 1 1 aaa 12345 6 ggg 0 1 aaa 12345 7 ccc 0 2 bbb 12345 5 aaa 0 2 bbb 12345 6 ggg 0 2 bbb 12345 7 ccc 0 3 ccc 55555 12345 5 aaa 0 3 ccc 55555 12345 6 ggg 0 3 ccc 55555 12345 7 ccc 1 4 ddd 99999 8 ddd 1 4 ddd 99999 9 hhh 0

Online Status	Offline
Date Last Visited	‎05-14-2020 12:24 PM

Re: Handling missing values in volume prediction models

Handling missing values in volume prediction models

Re: Creating a table based on prior records

Re: Creating a table based on prior records

Re: Creating a table based on prior records

Re: Creating a table based on prior records

Re: Creating a table based on prior records

Re: Creating a table based on prior records

Creating a table based on prior records

Re: Table Joining

Re: Creating a table based on prior records

Re: Creating a table based on prior records

Re: Creating a table based on prior records

Re: Table Joining

Re: Table Joining

Re: Creating a table based on prior records

Re: Handling missing values in volume prediction models

Handling missing values in volume prediction models

Re: Creating a table based on prior records

Re: Creating a table based on prior records

Re: Creating a table based on prior records

Re: Creating a table based on prior records

Re: Creating a table based on prior records

Re: Creating a table based on prior records

Creating a table based on prior records

Re: Table Joining

Re: Table Joining

Re: Table Joining

Re: Table Joining

Re: Table Joining

Table Joining