About Cruise

Cruise · ‎12-04-2017

I'd like to perform one-to-one mentor-to-student match/linkage. One mentor can take only one student and also only if nearest in their residential location. Program below works in this demo settings. However, have no idea how many passes would do the job in my real data where I have around N=500 mentors and N=3500 students. And this ratio can be substantially different for different counties (N=62). What would be more efficient programming dealing with multiple passes? (SAS 9.4) /*MENTORS*/ data mentor; input mentor_ID $User_X User_Y; cards; U1 53.8 -32.1 0 23 1 U2 23.8 -96.3 1 25 2 U3 34.5 -70.7 0 28 1 U4 28.7 -76.5 1 35 2 ; run; /*STUDENTS*/ data students; input student_ID $ User_X User_Y; cards; A1 53.6 -32.1 1 63 1 A2 35.6 -12.3 0 25 2 A3 63.4 -85.4 1 69 1 A4 34.5 -70.9 0 45 2 A5 37.8 -77.7 1 55 1 ; run; /*CALCULATION OF DISTANCE BETWEEN MENTORS AND STUDENTS*/ proc sql; create table distance(keep=mentor_id student_id distance) as select t1.*, t1.user_x*1 as x1, t1.user_y*1 as y1, t2.user_x*1 as x2, t2.user_y*1 as y2, t2.student_id, geodist(calculated x1, calculated y1, calculated x2, calculated y2) as distance from mentor t1, students t2 group by t1.mentor_id having (calculated distance)=calculated distance; quit; /*KEEP MINIMUM DISTANCE*/ /*minimum distance*/ proc sql; create table dist_min(keep=mentor_id student_id distance) as select t1.*, t1.user_x*1 as x1, t1.user_y*1 as y1, t2.user_x*1 as x2, t2.user_y*1 as y2, t2.student_id, geodist(calculated x1, calculated y1, calculated x2, calculated y2) as distance from mentor t1, students t2 group by t1.mentor_id having min(calculated distance)=calculated distance; quit; /*CHECK OUTPUTS*/ proc print data=distance; run; proc print data=dist_min; run; /*MENTOR TO STUDENT / ONE-TO-ONE MATCHES*/; /**pass1 */ proc sort data=dist_min; by student_id distance; run; data dist_min_keep; set dist_min; by student_id distance; if first.student_id then output; run; /*MENTORS MATCHED TO MANY STUDENTS*/ *PASS2; proc sort data=mentor; by mentor_id; run; proc sort data=dist_min_keep; by mentor_id distance; run; data mentor2(where=(distance = .)); merge mentor (in=a) dist_min_keep (in=b); by mentor_id; drop student_id; if a; run; data mentor2a(drop=distance); set mentor2; run; /*STUDENTS MATCHED TO SAME MENTORS*/ proc sort data=students; by student_id; run; proc sort data=dist_min_keep; by student_id; run; data student2(where=(distance=.)); merge students (in=a) dist_min_keep (in=b); by student_id; drop mentor_id; if a; run; data student2a(drop=distance); set student2; run; *PASS2; /*CALCULATE AND KEEP MINIMUM DISTANCE AMONG NOT DISTINCTLY MATCHED MENTORS AND STUDENTS FROM THE FIRST PASS */ proc sql; create table dist_2p (keep=mentor_id student_id distance) as select t1.user_x*1 as x1, t1.user_y*1 as y1, t1.mentor_id, t2.user_x*1 as x2, t2.user_y*1 as y2, t2.student_id, geodist(calculated x1, calculated y1, calculated x2, calculated y2) as distance from mentor2a t1, student2a t2 group by t1.mentor_id having min(calculated distance)=calculated distance; quit; data final_matches; set Dist_min_keep dist_2p; run; proc print data=mentor; proc print data=students; proc print data=final_matches; run;

Cruise · ‎11-28-2017

Summarizing from responses: proc sql: proc sql; create table sharp as select a.id as id_a,b.id as id_b, a.latitude as lat_a,b.latitude as lat_b from data1 as a full join data2 as b on a.id=b.id; quit; data sharp1; set sharp; length source $ 9; if lat_a ^=. then source='a'; if lat_b ^=. then source='b'; if lat_a =. and lat_b=. then source='both_miss'; if lat_a ^=. and lat_b^=. then source='match'; run; proc sort data=first; by uid; proc sort data=second; by uid; data check; merge first (in=ina rename=(latitude=lat_a)) second(in=inb rename=(latitude=lat_b)); by uid; if ina then id_a=uid; if inb then id_b=uid; drop uid; if lat_a=. then first=.; if lat_b=. then second=.; if lat_a^=. then first=1; if lat_b^=. then second=1; run; proc freq data=m.check; tables first*second; run;

Cruise · ‎11-28-2017

I have achieved it in data step. I was wondering whether proc sql has a way to achieve the result similarly?

Cruise · ‎11-27-2017

How to trace source datasets in the proc sql output? My goal is to figure out the contribution of missings, say, for variable "latitude". I'm a novice to CASE WHEN and COALESCE functions in proc sql. Therefore, proc sql code below is to rather explain how I envisioned what might have worked out. There are nice SUGI papers but I didn't find one that discussed about indicator variable in "both_missing" scenario. data data1; input id latitude; cards; 1001 43.2 1003 43.6 1004 . ; DATA data2; INPUT id latitude; CARDS; 1001 43.2 1002 48.3 1004 . ; data want; input id_a id_b lat_a lat_b indicator $; cards; 1001 1001 43.2 43.2 match . 1002 . 48.3 a_miss 1003 . 43.6 . b_miss 1004 1004 . . both_miss ; PROC SQL; CREATE TABLE want AS SELECT a.id as id_a, a.latitude as lat_a, b.id as id_b, b.latitude as lat_b, CASE (a.latitude = b.latitude) WHEN lat_a=lat_b THEN 'Match' WHEN lat_a=. THEN 'a_miss' WHEN lat_b=. THEN 'b_miss' WHEN lat_a=. and lat_b=. THEN 'both_miss' ELSE 'else' END AS indvar LENGTH=5 FROM data1 a join data2 b ON a.id = b.id; QUIT;

Cruise · ‎11-24-2017

@Reeza Hi Reeza, below code worked out. Please let me know if it looks not quite right. Hope I got it. PROC SURVEYSELECT data=test method=srs n=1 seed=1235 out=test_sample; strata id; run;

Cruise · ‎11-22-2017

How to randomly select a single measurement of each individuals from a repeated measurement data? My original data is in shape of 'test' and resulting data would look like 'want' except its observations randomly populated. data test; input ID Age_at_measurement_right Age_at_measurement_left; cards; 3232 1.1 1.5 3232 19.3 18.3 3232 33.2 32.1 3236 1.2 1.6 3236 16.2 13.2 3236 23.1 22.1 3266 1.5 1.3 3266 19.3 18.3 3266 33.1 32.3 ; data want; input ID Age_at_measurement_right Age_at_measurement_left; cards; 3232 19.3 18.3 3236 23.1 22.1 3266 1.5 1.3 ;

Cruise · ‎11-21-2017

Yes Mkeintz, both data has unique individual identifiers

Cruise · ‎11-21-2017

Thanks a lot guys. I totally agree with you on the need to stratify the data for multiple sql sessions. @art297 and @mkeintz I have no issue of exact duplicates between two datasets. Because, one of datasets is the National Synthetic Population 2010 dataset by RTI https://www.rti.org/impact/synthpop where they created a synthetic dot for every person. However, those dots are spatially spread randomly within the block group but not intended to represent the exact location of each individuals. My goal is to point out to the synthetic 'dot-person' as it matched to actual human in my data by the nearest geographical distance in between while matching on some demographics. Hope I wrote clear enough!

Cruise · ‎11-20-2017

Matching people if geographically closest in distance and some other demographics: I have two datasets 'one' and 'two'. Dataset 'one' has 20 million people and their location (latitude and longitude), gender and race/ethnicity (synthetic data). Dataset 'two' has 120,000 people and their location (latitude and longitude), gender and race/ethnicity. Dataset 'one' includes dataset 'two'. I'd like take as 'matched' enough if following conditions met: - Shortest distance between individual in 'one' and 'two' datasets and - if gender and race/ethnicity matches. Below code works fine. However, would you agree if working code does the job true to my logic? Please help if my program would narrow down to cases as matched as explained above? proc sql; create table want as select s.longitude*1 as s_long, s.latitude*1 as s_lat, c.longitude*1 as c_long, c.latitude*1 as c_lat, c.sex as c_sex, s.sex as s_sex, geodist(calculated s_lat, calculated s_long, calculated c_lat, calculated c_long) as distance from two c inner join one s on c.sex=s.sex and c.race=s.race group by c.uniq_id having min(calculated distance)=calculated distance; quit;

Cruise · ‎11-17-2017

Figured out from RW9's insights and ballardw's suggestion: I created a copy folder with all datasets and use that as my library=m. And defined generic library after proc and eliminating libname initials in front of the datasets after "change" line did the job. All I have to do now is to put it in the macro and do the same thing for the rest of the 30-40 datasets. Thanks a lot. LIBNAME m "......\work_data"; proc datasets lib=m; change oldname=newname; run; quit

Cruise · ‎11-17-2017

I'd like call datasets from specific library and change their names in work library. I'm not allowed to change the dataset names permanently because of shared property. I have 30 datasets and will execute the proc datasets in macro once my code works out outside the macro. What am I doing wrong? LIBNAME m "......\work_data"; proc datasets lib=m; change m.data2015_bf=data_2015; run; quit Error log: 126 proc datasets lib=m; 127 change m.data2015_bf=data_2015; ---------------- 22 201 ERROR 22-322: Expecting a name. ERROR 201-322: The option is not recognized and will be ignored. 128 run; NOTE: Enter RUN; to continue or QUIT; to end the procedure. NOTE: Statements not processed because of errors noted above. 129 quit

Cruise · ‎08-30-2017

I'd like to keep 'age_in_days' in data plot which is obvisouly not distinct. Putting age_in_days after 'having' and before 'distinct' in select statement didn't work out. My goal is to line graph 'age_in_days' against 'tot'. proc sql; create table plot as select distinct id,a1, a2, a3 from temp1 where (0 <= age_in_days <= 10) group id, a1, a2, a3 having date = min(date); quit; proc transpose data=plot out=final; run; data final1(keep=tot _NAME_); set final; tot=sum(of Col:); run;

Cruise · ‎08-28-2017

Great ways to align the labels. Thank you. Do you know why "by descending" in proc sort has no effect in the proc sort? Does proc print "style" options override previous options set forth? proc means data=have stackods n maxdec=0; class op_fac_name; var a1-a17; ods output summary=fac_types(drop=NObs _control_ where=(N ne 0)); run; proc sort data=fac_types; by descending op_fac_name n; run; proc print data=fac_types label noobs style(header)={just=l}; label variable='Defect Types'; label n='Total number of defects'; label op_fac_name='Facility name'; format variable $group.; title "FIRST INCIDENCE"; run;

Cruise · ‎08-28-2017

I'd like to use proc means' output directly as a PDF file. For more visually appealing purpose, I wonder if i could left align the column label and the title in the final output as shown in image WANT below? proc means data=have stackods n maxdec=0; class fac; var a1-a17; ods output summary=result(drop=NObs _control_ where=(N ne 0)); run; proc print data=result label noobs; label variable='Birth Defect Types'; label n='Total number of defects'; label op_fac_name='Facility name'; format variable $group.; title "FIRST INCIDENCE"; run;

Cruise · ‎08-28-2017

Hi Reeza. Thanks millions. Proc means is the way. Btw, I had "options nolabel;" in my program conflicted with any latter label statements, I realized that late and corrected. Now I have any label statements applied with no conflict.

Online Status	Offline
Date Last Visited	‎04-15-2022 04:56 PM

proc sgplot - how to show / force fixed values on x-axis?

PROC SGPLOT how to get more diverse colors

Re: Proc sgplot how to achieve specific order for labels in keylegend?

Re: Proc sgplot how to achieve specific order for labels in keylegend?

Re: Proc sgplot how to achieve specific order for labels in keylegend?

Re: Proc sgplot how to achieve specific order for labels in keylegend?

Re: Swimmer's plot, how to show dose level and text inside the bars

Proc sgplot how to achieve specific order for labels in keylegend?

Swimmer's plot, how to show dose level and text inside the bars

Re: Data merge by multiple variables keeping distinct levels of both d...

Re: PROC SGPLOT how to get more diverse colors

Re: PROC SGPLOT how to get more diverse colors

Re: Proc sgplot how to achieve specific order for labels in keylegend?

Re: Proc sgplot how to achieve specific order for labels in keylegend?

Re: Swimmer's plot, how to show dose level and text inside the bars

Split mixture of strings separated by multiple different delimiters

Proc sgplot how to achieve specific order for labels in keylegend?

Swimmer's plot, how to show dose level and text inside the bars

Re: Compute IQR and STD per record to proc gmap

Re-organize table using proc tabulate or report or transpose?

Match/Link with multiple passes

Re: How to trace source datasets in the proc sql output?

Re: How to trace source datasets in the proc sql output?

How to trace source datasets in the proc sql output?

Re: Random selection of single measurement of each individuals from a ...

Random selection of single measurement of each individuals from a repe...

Re: Matching people if geographically closest in distance and some oth...

Re: Matching people if geographically closest in distance and some oth...

Matching people if geographically closest in distance and some other d...

Re: Datasets from specific library and change names in work library

Datasets from specific library and change names in work library

Proc sql, Select Distinct and not Distinct variables

Re: How to left align the text in Proc Means output?

How to left align the text in Proc Means output?

Re: PROC FREQ doesn't order with order=freq