About wlierman

wlierman · ‎06-15-2021

I have a SAS dataset of 1,694,321 obs. I wanted to convert into CSV and send to another party. The log of the export is here. NOTE: The file 'I:\Health Analytics\AMB_INFORMATICS\ORPHEUS_ONE_Syntax_Log_MM\Merged_File\Merged_ORPHEUS_ONE_6.15.2021.csv' is: Filename=I:\Health Analytics\AMB_INFORMATICS\ORPHEUS_ONE_Syntax_Log_MM\Merged_File\Merged_ORPHEUS_ONE_6.15.2021.csv, RECFM=V,LRECL=32767,File Size (bytes)=0, Last Modified=15Jun2021:10:49:23, Create Time=15Jun2021:10:49:23 NOTE: 1694322 records were written to the file 'I:\Health Analytics\AMB_INFORMATICS\ORPHEUS_ONE_Syntax_Log_MM\Merged_File\Merged_ORPHEUS_ONE_6.15.2021.csv'. The minimum record length was 22. The maximum record length was 118. NOTE: There were 1694321 observations read from the data set SASONE.OR_ONE_ALL_MERGE5A. NOTE: DATA statement used (Total process time): real time 2.50 seconds cpu time 2.28 seconds 1694321 records created in I:\Health Analytics\AMB_INFORMATICS\ORPHEUS_ONE_Syntax_Log_MM\Merged_File\Merged_ORPHEUS_ONE_6.15.2021.csv from SASONE.OR_ONE_ALL_MERGE5A. NOTE: "I:\Health Analytics\AMB_INFORMATICS\ORPHEUS_ONE_Syntax_Log_MM\Merged_File\Merged_ORPHEUS_ONE_6.15.2021.csv" file was successfully created. NOTE: PROCEDURE EXPORT used (Total process time): real time 2.71 seconds cpu time 2.36 seconds When I go out to the file and test whether it is being read into excel it doesn't load completely but stops after loading 1,048,576 obs. I thought a csv file could be loaded into excel without running into a size constraint. What is the best method to solve this problem? Thanks. wklierman

wlierman · ‎06-15-2021

Thank you for the help. I have a dataset that merged without dups so I'm sending that forward. I am going to work on this smaller but dup containing dataset. Thanks again. wklierman

wlierman · ‎06-15-2021

I have been working on a merge of two data sets: the number of obs are Orpheus_one_for_matching = 1,454,515 originally 1,817,358 obs Orpheus_for_matching = 299,088 originally 326,329 obs I created a concatenate common variable as shown Data SASONE.ORPHEUS_ONE_For_Matching; Set SASONE.ORPHEUS_ONE_For_Matching; Common_ID_4 = CATS(LastNm,FirstNm); run; Data SASONE1.ORPHEUS_For_Matching; Set SASONE1.ORPHEUS_For_Matching; Common_ID_4 = CATS(LastNm,FirstNm); run; I merged the two datasets on the common_id_4 Data SASONE.ORPOne_AllVarMerge_5A (drop = FirstMerge); Merge SASONE.ORPHEUS_ONE_For_Matching /*(In=In2)*/ SASONE1.ORPHEUS_For_Matching; *(In=In1); by Common_ID_4; *If In1 then output; *In1 = In2; If FirstNm = 'ANONYMOUS' then delete; Else If FirstNm = 'HIV' then delete; run; There were still duplicates so I did a proc sort and used the last.common_id_4 proc sort data = SASONE.ORPOne_AllVarMerge_5A NODUPKEYS; by DOB; run; Data SASONE.ORPOne_AllVarMerge_5_Keep; Set SASONE.ORPOne_AllVarMerge_5; by Common_ID_4; If Last.Common_ID_4 Then output; run; Which resulted in 1,694,385 in the combined data set. I suspect there are duplicates so I ran the following Proc sql noprint; CREATE TABLE SASONE.ORPOne_AllVarDups_5_Check AS SELECT 'SASONE.ORPHEUS_ONE_For_Matching' As Dataset, Count(Distinct Common_ID_4) as Ndistinct, Count(*) as N From SASONE.ORPHEUS_ONE_For_Matching Outer Union Corresponding SELECT 'SASONE1.ORPHEUS_For_Matching' As Dataset, Count(Distinct Common_ID_4) as Ndistinct, Count(*) as N From SASONE1.ORPHEUS_For_Matching Outer Union Corresponding SELECT 'SASONE.ORPOne_AllVarMerge_5A' As Dataset, Count(Distinct Common_ID_4) as Ndistinct, Count(*) as N From (SELECT Common_ID_4 From SASONE.ORPHEUS_ONE_For_Matching Outer Union Corresponding SELECT Common_ID_4 From SASONE1.ORPHEUS_For_Matching); quit; The next step was to print the duplicate obs Proc sql noprint; CREATE TABLE SASONE.DUPS_EXAMINE AS SELECT Common_ID_4 FROM (SELECT Common_ID_4 FROM SASONE.ORPHEUS_ONE_For_Matching OUTER UNION CORRESPONDING SELECT Common_ID_4 FROM SASONE1.ORPHEUS_For_Matching) GROUP BY Common_ID_4 HAVING Count(Common_ID_4) > 1; quit; /* 229,206 obs 6.14.2021 */ The total dups still left in the dataset total 229,206 with a portion coming from each individual data set. My question is how can I delete the dups from the data set with 1,694,385 obs. Nothing seems to work the last.var method; using proc sort and NODUPKEY. What would you propose as a solution, so I can provide a dataset without those remaining duplicates? Thank you. wklierman

wlierman · ‎06-09-2021

Thank you for the link to the paper - looks to be helpful. Also I appreciate your iterative approach to exploring this challenge. That iterative approach is very closely related to the response from SASkiwi. Thank you for your help. I really appreciate it. wklierman

wlierman · ‎06-09-2021

I really like the iterative approach. That sounds like it could provide a very complete approach - one that I can share with the researchers who eventually will use the data. Thank you. wklierman

wlierman · ‎06-08-2021

I have two datasets that I am working with that need to be merged. The potential problem is that neither dataset has what I would call an obvious choice for a sort / by variable. These are a REALD data set of individuals who are identified by REALD measures and Medicaid program data set (like TANF and SNAP) I will only give a fictionalized couple of rows from each: Orpheus dataset (data set with respondents to disability questions on survey) FirstNm LastNm DOB Sex MiddleNm City James Eastwood 09/19/75 M A Medford Alishay Connell 03/06/91 F Eugene ONE dataset (Medicaid program) FirstNm LastNm DOB Sex MiddleNm City Daniel Hart 07/18/80 M Patrick Baker City Jade 01/02/88 F Portland There are missing values (like LastNm) in the second field above. Also the raw some of the raw data has extraneous entries like commas and single and double quotes especially around names. I have looked over some SAS articles LexJansen.com but not quite what I was hoping to find. What I'd like to find is a straightforward way to merge these two datasets that have some problem data entries and missing values. What would be the best method to achieve a merge of these datasets? Concatenating two fields then sorting and using the concatenated variable to perform the merge? or Adding some type of indicator variable to each dataset to serve the sorting and merging by requirements? Your thoughts and help are much valued, thank you. wklierman

wlierman · ‎05-13-2021

Your code approach also produces the output that I am needing. I would also post your email as a solution along with the solution from Tom. But I don't know how to submit a second solution. Thanks for your help. wlierman

wlierman · ‎05-13-2021

Hello, By chunks I meant different portions of the dataset - demographics for the contact; race and ethn icity break outs; language; and the disability section that you have helped me on. While all the REALD data is sparsely p0opulated, the disability section is the most sparse. Your earlier coding helped to a solution. So I will label it as a solution. Thanks again. wlierman

wlierman · ‎05-12-2021

I have a dataset that has over 181,000 rows and upwards of 80-90 fields. It is not a huge, huge dataset but it is very sparse. (It is based on epidemiologist survey of covid-19 cases - so not everyone supplied some type of response.) What I need is code that will allow me to check various chunks (sections of columns) to get a count of the number of numeric observations that have some type of response (age at when a disability begins) . So hypothetically the table could be Variable_name # missing # non-missing total Percent_missing DEYEage 61,000 2,000 63,000 96.8 DEARage 59,900 3,100 63,000 95.0 I would also like to be able to "reuse" the code to test other parts of the dataset for missing/nonmissing maybe for annual income, population etc Thank you for your help. wlierman

wlierman · ‎05-12-2021

I have a dataset that has over 181,000 rows and upwards of 80-90 fields. It is not a huge, huge dataset but it is very sparse. (It is based on epidemiologist survey of covid-19 cases - so not everyone supplied some type of response.) What I need is code that will allow me to check various chunks (sections of columns) to get a count of the number of observations that have some type of response (could be No or Yes) . I don't want the No or Yes counted just if there was a response or if the obs is blank / missing. So hypothetically the table could be Variable name # non-missing # missing Total percent_missing DEYEdi 45,000 18,000 63,000 28.5 DEARdi 50,000 13,000 63,000 20.6 I would alos like to be able to "reuse" the code to test other parts of the dataset for missing/nonmissing maybe for zipcodes, counties, cities etc. Thank you for your help. (I will open another question for counting missing/nonmissingnumeric vars in the dataset.) wlierman

wlierman · ‎05-12-2021

Hello, I ran the code but ran into the usual error here proc format; 335 value $ missfmt ' ' = "Missing" other = "Not_missing_char"; NOTE: Format $MISSFMT is already on the library WORK.FORMATS. NOTE: Format $MISSFMT has been output. 336 value nmissfmt . = "Missing" other = "Not_missing_num"; NOTE: Format NMISSFMT is already on the library WORK.FORMATS. NOTE: Format NMISSFMT has been output. NOTE: PROCEDURE FORMAT used (Total process time): real time 24.06 seconds cpu time 0.28 seconds 337 Data OPERA.AKA_data_test; 338 *Format DLEAdi DMHDdi DCOMdi DEARdi DEYEdi DDRSdi DOUTdi DREMdi DPHYdi $missfmt.; 339 Format DEYEage DCOMage DREMage DEARage DDRSage DOUTage DLEAage DLIMage DMHDage DPHYage 339! AgAcq1st nmissfmt.; 340 set OPERA.AKA_data_forum; ERROR: Variable DEYEage has been defined as both character and numeric. ERROR: Variable DCOMage has been defined as both character and numeric. ERROR: Variable DREMage has been defined as both character and numeric. ERROR: Variable DEARage has been defined as both character and numeric. ERROR: Variable DDRSage has been defined as both character and numeric. ERROR: Variable DOUTage has been defined as both character and numeric. ERROR: Variable DLEAage has been defined as both character and numeric. ERROR: Variable DLIMage has been defined as both character and numeric. ERROR: Variable DMHDage has been defined as both character and numeric. ERROR: Variable DPHYage has been defined as both character and numeric. ERROR: Variable AgAcq1st has been defined as both character and numeric. 341 run; NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set OPERA.AKA_DATA_TEST may be incomplete. When this step was stopped there were 0 observations and 12 variables. WARNING: Data set OPERA.AKA_DATA_TEST was not replaced because this step was stopped. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.00 seconds I think he procedure may work. What is the best method to correct this. I use the rename, swap, and drop the new_var at the end of the code. Is that the best. Thanks. wlierman

wlierman · ‎05-12-2021

I'm not connecting the dots on this I ran the code that was was provided which I am sending from the proc format statement proc format; value $ missfmt ' ' = "Missing" other = "Not_Missing"; value nmissfmt low-high ="Missing" other="Not_Missing"; run; * turn off the output and capture the one-way freq table TEMP dataset; ods select none; ods table onewayfreqs=temp; proc freq data=&INPUT_DSN.; table _all_ / missing; format _numeric_ nmissfmt. _character_ $missfmt.; run; ** turn outputs back on; ods select all; ** Collapse to one observation per variable; Data &OUTPUT_DSN; length name $32 missing not_missing total 8; Set temp; by table notsorted; If first.table then call missing(of missing not_missing); name = substr(table,7); if vvaluex(name)='Missing' then missing=frequency; else not_missing=frequency; retain missing not_missing; if last.table then do; missing = sum(0,missing); not_missing=sum(0,not_missing); total=sum(missing,not_missing); percent = divide(missing,total); output; end; keep name missing not_missing total percent; run; The output is not correct. If the numeric format was working properly the age variables should have but a handful of nonmissing values - like 1200 or 12000 - and the percent would be .90. The brief output from the actual output data set shows that the numeric variables are being counted as though they are the character variables. The only reason that there are so many non-missing character variables is that the value is either No or Yes - but there are blanks too which seem to be counted okay. I have attached a one page word doc with a handful of examples. Thanks for helping. wlierman

wlierman · ‎05-12-2021

The method is really close to a solution. I re-ran my code to produce the tables. I noticed though that for the missing numeric variables that the missing and non-missing counts looked reversed that is the number of non-missing were listed as missing in the table and the missing count was shown in the non-missing column. If the var was character the count looked correct. How is the count adjusted when the missing value is for a numeric variable? Thank you. wlierman

wlierman · ‎05-11-2021

Thank you. I will test out the code. It looks like it is perfect for numeric vars. I am in the middle of testing some other code - but will be sure to credit your help too. Thanks. wlierman

wlierman · ‎05-11-2021

Okay I will try that. The included tables are exactly what I am trying to get at. Thanks. wlierman

Online Status	Offline
Date Last Visited	‎02-04-2025 09:25 PM

Re: Using ods to copy proc freq result tables to Excel

Using ods to copy proc freq result tables to Excel

Re: Copying proc freq tables from Results pane to pdf or excel format

Re: Copying proc freq tables from Results pane to pdf or excel format

Copying proc freq tables from Results pane to pdf or excel format

Re: Libname related opening folder with SAS result

Re: Libname related opening folder with SAS result

Re: Libname related opening folder with SAS result

Re: Libname related opening folder with SAS result

Libname related opening folder with SAS result

Re: proc export csv file to excel - error msg

Re: finding unique recipients by age groups that received all psychotr...

Re: finding unique recipients by age groups that received all psychotr...

Re: finding unique recipients by age groups that received all psychotr...

Top 10 tips on SAS® Enterprise Miner™

Re: Libname related opening folder with SAS result

Re: Developing proc informat set up

Re: extracting datepart then converting to date usable in SAS

Re: Eliminating all duplicates

Re: Merging two datasets with no obvious identifier/s

CSV not loading all records after Proc Export

Re: Eliminating all duplicates

Eliminating all duplicates

Re: Merging two datasets with no obvious identifier/s

Re: Merging two datasets with no obvious identifier/s

Merging two datasets with no obvious identifier/s

Re: Method to count the number of non-missing and missing character va...

Re: How to count missing and non-missing numeric variables in a data s...

How to count missing and non-missing numeric variables in a data set

Method to count the number of non-missing and missing character values...

Re: Counting missing and non-missing obs

Re: Counting missing and non-missing obs

Re: Counting missing and non-missing obs

Re: Counting missing and non-missing obs

Re: Counting missing and non-missing obs

SAS Analytics Explorers