Hello, I am analyzing data for a project. Unfortunately, the data was provided in tables in PDF documents. We imported the PDF tables into Excel and some of the data split into multiple rows. We then read the data into SAS for analysis. Is there a way in SAS to combine data in different rows without a linking ID (the linking ID itself is split). I've attached a sample of fake data below that has the same structural issues I am looking to resolve. Each row should start with "Initiating Action:" - any data in a row that does not have this needs to be combined with the row above it. Any recommendations for how I should proceed with cleaning this are much appreciated.
What I have:
Event Type | ID | Case Type |
Initiating Action: Pet | X-000-CV-2024- | Admin |
0123 | ||
Initiating Action: | X-000-CV-2024- | Contract & |
Complaint | 4432 | Debts |
Initiating Action: Pet | X-000-CV-2022- | Admin |
0681 | ||
Initiating Action: | X-000-CV-2023 | Miscellaneous |
Complaint | 0873 | Issue |
What it needs to look like:
Event Type | ID | Case Type |
Initiating Action: Pet | X-000-CV-2024-0123 | Admin |
Initiating Action: Complaint | X-000-CV-2024-4432 | Contract & Debts |
Initiating Action: Pet | X-000-CV-2022-0681 | Admin |
Initiating Action: Complaint | X-000-CV-2023-0873 | Miscellaneous issue |
This is a RENAME exercise which can be automated. Uses a version of the UPDATE trick. You can use SQL to generate the rename statement. Notice the issue with spacing sometimes you want the space others not. I update this UPDATE statement example using UPDATEMODE option.
data have;
infile cards dsd dlm='09'x firstobs=2;
length G_ 8;
input (Event ID Type) (:$64.);
if event eq: 'Initiating' then G_ + 1;
cards;
Event Type ID Case Type
Initiating Action: Pet X-000-CV-2024- Admin
0123
Initiating Action: X-000-CV-2024- Contract &
Complaint 4432 Debts
Initiating Action: Pet X-000-CV-2022- Admin
0681
Initiating Action: X-000-CV-2023 Miscellaneous
Complaint 0873 Issue
;;;;
run;
proc print;
run;
data want;
update have(obs=0) have(rename=(Event=C_Event ID=C_ID Type=C_Type)) updatemode=nomissingcheck;
by G_;
array C_ C_:;
array C Event ID Type;
do over c_;
c = catx(' ',of c C_);
end;
drop C_:;
run;
proc print;
run;
First, I feel your pain. I once received a bunch of "data" in the form of Fillable PDF forms and had multiple issues with getting values of of PDF into a usable form.
My first try would be to start over. How did you get the tables from PDF to Excel? Acrobat reader export tool?
Different PDF reader export? Copy and Paste?
Did you do anything in Excel like examine the data for hidden rows/columns? Merged cells? before attempting to read the data into SAS? How did you get the data from Excel to SAS?
With my project I had to manually fix a number of recurring hidden/merged cells and sometimes copy the data from one cell in Excel to another after getting rid of the merged cells and un-hiding all the hidden row/columns.
Clean up the Excel, then save as CSV and read that file and you may have more luck.
With LAG function you can get the values from a previous row of data but without an example of data in the from of a working data step I'm not going to try. For example, you are showing things that aren't normally valid variable names.
However your example shows problems with just determining when and where a row was split. Which may mean you'll have better luck cleaning before reading into SAS if possible.
This is a RENAME exercise which can be automated. Uses a version of the UPDATE trick. You can use SQL to generate the rename statement. Notice the issue with spacing sometimes you want the space others not. I update this UPDATE statement example using UPDATEMODE option.
data have;
infile cards dsd dlm='09'x firstobs=2;
length G_ 8;
input (Event ID Type) (:$64.);
if event eq: 'Initiating' then G_ + 1;
cards;
Event Type ID Case Type
Initiating Action: Pet X-000-CV-2024- Admin
0123
Initiating Action: X-000-CV-2024- Contract &
Complaint 4432 Debts
Initiating Action: Pet X-000-CV-2022- Admin
0681
Initiating Action: X-000-CV-2023 Miscellaneous
Complaint 0873 Issue
;;;;
run;
proc print;
run;
data want;
update have(obs=0) have(rename=(Event=C_Event ID=C_ID Type=C_Type)) updatemode=nomissingcheck;
by G_;
array C_ C_:;
array C Event ID Type;
do over c_;
c = catx(' ',of c C_);
end;
drop C_:;
run;
proc print;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.