Solved: Re: 4 variables not aligning after concatenating, impacting formats

twagner · Posted 11-21-2022 01:40 AM

Hi all,

I have a formatting question that is rather frustrating. I combined (concatenated) two CSV files after ensuring that the type of variable, length, format, and informat all aligned. Below are the variables from the combined file.

My issue is that four of the variables from the second file (8215 and down) did not align in the same way. On the surface this isn't a big deal, it appears to be mostly cosmetic. But then when I go to apply a format, such as for gender, 0 = male and 1 = female, it works for IDs 1 - 8214, but then erases the responses for 8215 and down. I tried to remove the format post concatenation and then apply my format, created via proc format, but it's unsuccessful.

I went to run a chi-square test using two of the variables, study_group and gender, and this is the result that I got. It should be a two x two table, with 0 and 1 as the only options for each variable. I have never experienced this before.

s_lassen · Posted 11-23-2022 07:22 AM

So, one of the .csv files seems to have some character values right aligned. To correct this, you could left-align all the character values in the data read after importing, e.g.:

data <name of table read>;
  set <name of table read>;
  array chars _char_;
  do _N_=1 to dim(chars);
    chars(_N_)=left(chars(_N_));
    end;
run;

View solution in original post

andreas_lds · Posted 11-21-2022 01:52 AM

Leading blanks in alphanumeric variables are often the result of not-properly converting numeric variables to alphanumeric. Can you post the code used to read the csv-files?

twagner · Posted 11-21-2022 09:33 AM

Absolutely. Here it is, the only thing I edited for the post was the name of the file pathway.

PROC IMPORT OUT = IG_PS_2019 /* Output data set */

DATAFILE = "G:file pathway" /* Input data set */

DBMS = csv; /* File type of input data set */

GUESSINGROWS=75;

RUN;

PROC IMPORT OUT = CG_PS_2019 /* Output data set */

DATAFILE = "G:file pathway" /* Input data set */

DBMS = csv; /* File type of input data set */

GUESSINGROWS=75;

RUN;

Maybe the issue came when trying to adjust the variables to be the same type, format, and informat between the two csv files.

DATA IG_PS_2019_r;
SET IG_PS_2019;
IF insurance___2 = 1 THEN medicare = 1; ELSE medicare = 0; ** Create Medicare variable **;
DROP insurance___1 insurance___2 insurance___3 insurance___4 insurance___5 
insurance___6 cc_total_age ldl_baseline_value race study_group gender redcap_id; ** Drop unwanted variables **;
race_r = put(race,8.); ** Converting from numeric to character **;
study_group_r = put(study_group,8.);
gender_r = put(gender,8.); 
record_id = put(redcap_id,8.); ** Taking chance to rename variable here**;
RENAME race_r=race study_group_r=study_group gender_r=gender;
FORMAT medicare BEST12. gender_r race_r study_group_r $8. record_id $11.;
INFORMAT medicare BEST12. gender_r race_r study_group_r $8. record_id $11.;
RUN;

DATA CG_PS_2019_r;
LENGTH gender race study_group $8.;
SET CG_PS_2019;
	age_r = input(age, 8.); ** Converting from character to numeric **;
	cc_total_r = input(cc_total, 8.);
	num_meds_r = input(num_meds, 8.);
	sbp_baseline_value_r = input(sbp_baseline_value, 8.);
	hba1c_baseline_value_r = input(hba1c_baseline_value, 8.);
	index_date = input(_2019_Visit_Date, YYMMDD10.); ** Taking chance to rename variable here **;
	medicare_r = input(medicare, 8.);
DROP age cc_total num_meds sbp_baseline_value hba1c_baseline_value insurance_plan insurance
_2019_Eligibility_Date _2019_Visit_Date medicare;
RENAME age_r=age cc_total_r=cc_total num_meds_r=num_meds sbp_baseline_value_r=sbp_baseline_value
hba1c_baseline_value_r=hba1c_baseline_value medicare_r=medicare;
FORMAT age_r cc_total_r num_meds_r sbp_baseline_value_r hba1c_baseline_value_r medicare_r best12. index_date YYMMDD10.
gender race study_group $8.;
INFORMAT age_r cc_total_r num_meds_r sbp_baseline_value_r hba1c_baseline_value_r medicare_r best32. index_date YYMMDD10.
gender race study_group $8.;
RUN;

The IG_PS_2019_r file is the one in which the positioning of the variables mentioned earlier starts to become a little off.

andreas_lds · Posted 11-22-2022 01:02 AM

The recommended tool for reading csv-files is the data step. Giving you full control about the types of all variables making additional steps to fix wrong guessed types and lengths superfluous.

The format used in the put function needs to match the number of digits that are expected, seems to be 1 in this case, too. A numeric format in the put function will always right-align the value, this can be avoided by adding -L:

race_r = put(race,8.-L);

Tom · Posted 11-23-2022 09:04 AM

Maybe the issue came when trying to adjust the variables to be the same type, format, and informat between the two csv files.

Yes. Don't do that. Do not "import" CSV files. Instead just read them in with a data step. Your example PROC IMPORT steps are a useful tool to help you LOOK at the data and figure out how you would want to define the variables so that you can then write your own data step to read them.

This will allow you to have full control over how the variables are named, their type and storage length. Any formats that you might need to attach to them and any labels you might want to attach to them. Note that of the 11 variables listed in the first photograph of your original post only one has any need to have a format attached, the one with the date values. The strings do not need to have any format attached. And unless you have some reason to display a fixed number of decimal places or the magnitude of the numbers would required more than 12 character to display there is no need to attach formats to the regular numeric variables either.

So you first one the data step could be as simple as this:

data IG_PS_2019;
  infile  "G:file pathway"  dsd truncover firstobs=2;
  input .... 
  format index_date yymmdd10.;
run;

You just need to fill out the INPUT statement to name the variables. For the INDEX_DATE add the :yymmdd. after it to tell SAS to use the YYMMDD informat to read it. For the character variables add :$10. after them (where 10 is the number of bytes of storage you want the variable to use) to force SAS to guess that the variables should be defined as character of that length.

s_lassen · Posted 11-23-2022 07:22 AM

So, one of the .csv files seems to have some character values right aligned. To correct this, you could left-align all the character values in the data read after importing, e.g.:

data <name of table read>;
  set <name of table read>;
  array chars _char_;
  do _N_=1 to dim(chars);
    chars(_N_)=left(chars(_N_));
    end;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!