Automatically Cleaning Laboratory Results in R using the ‘lab2clean’ package

1. Introduction

Navigating the shift of clinical laboratory data from primary everyday clinical use to secondary research purposes presents a significant challenge. Given the substantial time and expertise required to preprocess and clean this data and the lack of all-in-one tools tailored for this need, we developed our algorithm lab2clean as an open-source R-package. lab2clean package is set to automate and standardize the intricate process of cleaning clinical laboratory results. With a keen focus on improving the data quality of laboratory result values and units, our goal is to equip researchers with a straightforward, plug-and-play tool, making it smoother for them to unlock the true potential of clinical laboratory data in clinical research and clinical machine learning (ML) model development.

The lab2clean package contains four key functions: Two functions to clean & validate result values (Version 1.0) are described in detail in Zayed et al. (2024) [https://doi.org/10.1186/s12911-024-02652-7]. The clean_lab_result() function cleans and standardizes the laboratory results, and the validate_lab_result() function performs validation to ensure the plausibility of these results. The other two functions to standardize & harmonize result units (added in Version 2.0) are described in detail in Zayed et al. (2025) [https://doi.org/10.1016/j.ijmedinf.2025.106131]. The standardize_lab_unit() function cleans and standardize formats of laboratory units of measurement according to the Unified Code for Units of Measure (UCUM), and the harmonize_lab_unit() function harmonizes the units found in a laboratory data set to reference units following either SI or Conventional units, converting the numeric result values.

This vignette aims to explain the theoretical background, usage, and customization of these functions.

2. Setup

Installing and loading the `lab2clean` package

You can install and load the lab2clean package directly in R.

#install.packages("lab2clean")

After installation, load the package:

library(lab2clean)

3. Function 1: Clean and Standardize results

The clean_lab_result() has five arguments:

lab_data : A dataset containing laboratory data
raw_result : The column in lab_data that contains raw result values to be cleaned
locale : A string representing the locale for the laboratory data. Defaults to “NO”
report : A report is written in the console. Defaults to “TRUE”.
n_records : In case you are loading a grouped list of distinct results, then you can assign the n_records to the column that contains the frequency of each distinct result. Defaults to NA

Let us demonstrate the clean_lab_result() function using Function_1_dummy and inspect the first six rows:

data("Function_1_dummy", package = "lab2clean")
head(Function_1_dummy,6)

raw_result	frequency
?	108
*	243
[	140
_	268
1.1 x 10^9	284
2.34 x 10E12	42

This dataset -for demonstration purposes- contains two columns: raw_result and the frequency. The raw_result column holds raw laboratory results, and frequency indicates how often each result appeared. Let’s explore the report and n_records arguments:

cleaned_results <- clean_lab_result(Function_1_dummy, raw_result = "raw_result", report = TRUE, n_records = "frequency")

#> Step 1: Handling records with extra variables stored with the result value removing interpretative flags, or units
#> ==========================================================================================
#> ⚠ 8 distinct results (8.742% of the total result records) with interpretative flags (e.g. positive, negative, H, L) -> flags removed with cleaning comment added flag).
#> ⚠ 17 distinct results (20.043% of the total result records) with unit (%, exponents, or other units) -> units removed with cleaning comment added Percent, Exponent, or Units).
#> Step 2: classify and standardize different scale types - part 1
#> ==========================================================================================
#> ✔ 3 distinct results (5.373% of the total result records) of scale type ‘Ord.2’, which describes grades of positivity (e.g. 2+, 3+).
#> ✔ 7 distinct results (7.966% of the total result records) of scale type ‘Qn.2’, which describes inequality results (e.g. >120, <1).
#> ✔ 4 distinct results (6.233% of the total result records) of scale type ‘Qn.3’, which describes numeric range results (e.g. 2-4).
#> ✔ 4 distinct results (3.092% of the total result records) of scale type ‘Qn.4’, which describes titer results (e.g. 1/40).
#> ✔ 55 distinct results (61.335% of the total result records) of scale type ‘Qn.1’, which describes numeric results (e.g. 56, 5.6, 5600).
#> ⚠ 4 distinct results (4.853% of the total result records) with numeric result values that cannot be determined without predefined locale setting (US or DE) -> cleaning comment added locale_check).
#> ✔ 4 distinct results (4.888% of the total result records) of scale type ‘Ord.1’, which describes positive or negative results (Neg, Pos, or Normal).
#> ✔ 1 distinct results (1.019% of the total result records) of scale type ‘Nom.1’, which describes blood groups (e.g. A+, AB).
#> Last Step: Classifying non-standard text records
#> ==========================================================================================
#> ⚠ 0 distinct results (0% of the total result records) with multiple result values (e.g. postive X & negative Y) -> cleaning comment added (multiple_results).
#> ⚠ 0 distinct results (0% of the total result records) with words about sample or specimen (e.g. sample not found) -> cleaning comment added (test_not_performed).
#> ⚠ 8 distinct results (8.777% of the total result records) with meaningless inputs (e.g. = , .) -> cleaning comment added (No_result).
#> ⚠ 1 distinct results (1.317% of the total result records) that could not be standardized or classified -> cleaning comment added (not_standardized).
#> ==========================================================================================
#> ✔ 78 distinct results (89.906% of the total result records) were cleaned, classified, and standardized.
#> ⏰ Time taken is 0 min, 0 sec
#>

The report provides a detailed report on how the whole process of cleaning the data is done, and offers some descriptive insights of the process. The n_records argument adds percentages to each of the aforementioned steps to enhance the reporting. For simplicity, we will use report = FALSE in the rest of this tutorial:

cleaned_results <- clean_lab_result(Function_1_dummy, raw_result = "raw_result", report = FALSE)

#> ✔ 78 result records were cleaned, classified, and standardized.
#> ⏰ Time taken is 0 min, 0 sec
#>

cleaned_results

raw_result	frequency	clean_result	scale_type	cleaning_comments
?	108	NA	NA	No_result
*	243	NA	NA	No_result
[	140	NA	NA	No_result
_	268	NA	NA	No_result
1.1 x 10^9	284	1.1	Qn.1	Exponents
2.34 x 10E12	42	2.34	Qn.1	Exponents
2,34 X 10^12	173	2.34	Qn.1	Exponents
3.14159 * 10^30	271	3.142	Qn.1	Exponents
1.1x10+9	179	1.1	Qn.1	Exponents
2,34X10^12	153	2.34	Qn.1	Exponents
3.14159*10^30	288	3.142	Qn.1	Exponents
3.142*10^30	152	3.142	Qn.1	Exponents
1,1 x 10e9	213	1.1	Qn.1	Exponents
3	185	3	Qn.1	NA
1.1 x 10^-9	58	1.1	Qn.1	Exponents
2.34 X 10-12	273	2.34	Qn.1	Exponents
3.14159E-30	96	3.142	Qn.1	Exponents
1x10^9	41	1	Qn.1	Exponents
1E9	119	1	Qn.1	Exponents
2+	288	2+	Ord.2	NA
+	270	1+	Ord.2	NA
+++	217	3+	Ord.2	NA
0-1	203	0-1	Qn.3	NA
1-2	298	1-2	Qn.3	NA
1-	207	1	Qn.1	flag
01-02	221	1-2	Qn.3	NA
1 -2	177	1-2	Qn.3	NA
3 - 2	190	NA	NA	not_standardized
-	108	Neg	Ord.1	NA
+ 230	70	230	Qn.1	flag
100*	290	100	Qn.1	NA
+56	274	56	Qn.1	flag
- 5	216	5	Qn.1	flag
80%	245	80	Qn.1	Percent
-5	37	-5	Qn.1	NA
> 12	159	>12	Qn.2	NA
<1050	235	<1050	Qn.2	NA
< 02	88	<2	Qn.2	NA
>= 20.3	116	>=20.3	Qn.2	NA
>1:40	93	>1:40	Qn.4	NA
1/80	69	1:80	Qn.4	NA
<1/20	142	<1:20	Qn.4	NA
< 1/020	142	<1:020	Qn.4	NA
=	130	NA	NA	No_result
/	71	NA	NA	No_result
0.2	67	0.2	Qn.1	NA
33 Normal	93	33	Qn.1	flag
negative 0.1	156	0.1	Qn.1	flag
H 256	102	256	Qn.1	flag
30%	262	30	Qn.1	Percent
23 %	42	23	Qn.1	Percent
1056	149	1056	Qn.1	NA
1056040	246	1056040	Qn.1	NA
3560	63	3560	Qn.1	NA
0,3	181	0.3	Qn.1	NA
15,6	86	15.6	Qn.1	NA
2.9	64	2.9	Qn.1	NA
02.9	233	2.9	Qn.1	NA
2.90	272	2.9	Qn.1	NA
250	131	250	Qn.1	NA
1.025	210	1.025	Qn.1	locale_check
1.025	56	1.025	Qn.1	locale_check
1025	134	1025	Qn.1	NA
1025	104	1025	Qn.1	NA
1025.7	250	1025.7	Qn.1	NA
1.025,7	151	1025.7	Qn.1	NA
1.025,36	249	1025.36	Qn.1	NA
1,025.36	249	1025.36	Qn.1	NA
>1.025,36	244	>1025.36	Qn.2	NA
<=1,025.36	149	<=1025.36	Qn.2	NA
1.015	234	1.015	Qn.1	locale_check
1,060	200	1,060	Qn.1	locale_check
2,5	222	2.5	Qn.1	NA
2.5	30	2.5	Qn.1	NA
>3,48	158	>3.48	Qn.2	NA
3.48	89	3.48	Qn.1	NA
93	133	93	Qn.1	NA
,825	195	0.825	Qn.1	NA
0,825	125	0.825	Qn.1	NA
1.256894	60	1.257	Qn.1	NA
.	96	NA	NA	No_result
,	210	NA	NA	No_result
Négatif 0.3	143	0.3	Qn.1	flag
Négatif	243	Neg	Ord.1	NA
Pøsitivo	58	Pos	Ord.1	NA
A+	147	A	Nom.1	NA
pos & negative Y	296	Neg	Ord.1	NA

This function creates three different columns:

1- clean_result: The cleaned version of the raw_result column. For example, “?” is converted to , “3.14159 * 10^30” to “3.142”, and “+++” to “3+”.

2- scale_type : Categorizes the cleaned results into specific types like Quantitative (Qn), Ordinal (Ord), or Nominal (Nom), with further subcategories for nuanced differences, such as differentiating simple numeric results (Qn.1) from inequalities (Qn.2), range results (Qn.3), or titer results (Qn.4) within the Quantitative scale.

3- cleaning_comments: Provides insights on how the results were cleaned.

The process above provided a generic description on how the clean_lab_result() function operates. It would be useful to delve into more details on the exact way that some of the specific raw results are cleaned:

Locale variable:

In the clean_lab_result() function, we have an argument named locale. It addresses the variations in number formats with different decimal and thousand separators that arise due to locale-specific settings used internationally. We chose to standardize these varying languages and locale-specific settings to have the cleaned results in English, US. If the user did not identify the locale of the dataset, the default is NO, which means not specified. For example for rows 71 and 72, there is a locale_check in the cleaning_comments, and the results are 1.015 and 1,060 respectively. That means that either “US” or “DE” locale should be specified to identify this result value. If we specified the locale as US or DE, we can see different values as follows:

Function_1_dummy_subset <- Function_1_dummy[c(71,72),, drop = FALSE]
cleaned_results <- clean_lab_result(Function_1_dummy_subset, raw_result = "raw_result", report = FALSE, locale = "US")

#> ✔ 2 result records were cleaned, classified, and standardized.
#> ⏰ Time taken is 0 min, 0 sec
#>

cleaned_results

	raw_result	frequency	clean_result	scale_type	cleaning_comments
71	1.015	234	1.015	Qn.1
72	1,060	200	1060	Qn.1

cleaned_results <- clean_lab_result(Function_1_dummy_subset, raw_result = "raw_result", report = FALSE, locale = "DE")

#> ✔ 2 result records were cleaned, classified, and standardized.
#> ⏰ Time taken is 0 min, 0 sec
#>

cleaned_results

	raw_result	frequency	clean_result	scale_type	cleaning_comments
71	1.015	234	1015	Qn.1
72	1,060	200	1.06	Qn.1

Language in common words:

In the clean_lab_result() function, we support 19 distinct languages in representing frequently used terms such as “high,” “low,” “positive,” and “negative. For example, the word Pøsitivo is included in the common words and will be cleaned as Pos.

Let us see how this data table works in our function:

data("common_words", package = "lab2clean")
common_words

Language	Positive	Negative	Not_detected	High	Low	Normal	Sample	Specimen
English	Positive	Negative	Not detected	High	Low	Normal	Sample	Specimen
Spanish	Positivo	Negativo	No detectado	Alto	Bajo	Normal	Muestra	Especimen
Portuguese	Positivo	Negativo	Nao detectado	Alto	Baixo	Normal	Amostra	Especime
French	Positif	Negatif	Non detecte	Eleve	Bas	Normal	Echantillon	Specimen
German	Positiv	Negativ	Nicht erkannt	Hoch	Niedrig	Normal	Probe	Probe
Italian	Positivo	Negativo	Non rilevato	Alto	Basso	Normale	Campione	Campione
Dutch	Positief	Negatief	Niet gedetecteerd	Hoog	Laag	Normaal	Staal	Monster
Polish	Dodatni	Ujemny	Nie wykryto	Wysoki	Niski	Normalny	Probka	Probka
Swedish	Positiv	Negativ	Inte upptackt	Hog	Lag	Normal	Prov	Prov
Danish	Positiv	Negativ	Ikke opdaget	Hoj	Lav	Normal	Prove	Prove
Norwegian	Positiv	Negativ	Ikke oppdaget	Hoy	Lav	Normal	Prove	Prove
Finnish	Positiivinen	Negatiivinen	Ei havaittu	Korkea	Matala	Normaali	Nayte	Nayte
Czech	Pozitivni	Negativni	Nezjisteno	Vysoky	Nizky	Normalni	Vzorek	Vzorek
Hungarian	Pozitiv	Negativ	Nem eszlelt	Magas	Alacsony	Normal	Mintavetel	Mintadarab
Croatian	Pozitivan	Negativan	Nije otkriveno	Visok	Nizak	Normalan	Uzorak	Uzorak
Slovak	Pozitivny	Negativny	Nezistene	Vysoky	Nizky	Normalny	Vzorka	Vzorka
Slovenian	Pozitiven	Negativen	Ni zaznano	Visok	Nizek	Normalno	Vzorec	Vzorec
Estonian	Positiivne	Negatiivne	Ei tuvastatud	Korge	Madal	Normaalne	Proov	Proov
Lithuanian	Teigiamas	Neigiamas	Neaptiktas	Aukstas	Zemas	Normalus	Pavyzdys	Pavyzdys

As seen in this data, there are 19 languages for 8 common words. If the words are positive or negative, then the result will either be cleaned to Pos or Neg unless if it is proceeded by a number, therefore the word is removed and a flag is added to the cleaning_comments. For example, the word Négatif 0.3 is cleaned as 0.3 and the word 33 Normal is cleaned as 33. Finally, if the result has one of those words “Sample” or “Specimen”, then a comment will pop-up mentioning that test was not performed.

Flag creation:

In addition to the common words, when there is a space between a numeric value and a minus character, this also creates a flag. For example, result - 5 is cleaned as 5 with a flag, but the result -5 is cleaned as -5, and no flag is created because we can assume it was a negative value.

4. Function 2: Validate results

The validate_lab_result() has seven arguments:

lab_data : A data frame containing laboratory data
result_value : The column in lab_data with quantitative result values for validation
result_unit : The column in lab_data with result units in a UCUM-valid format
loinc_code : The column in lab_data indicating the LOINC code of the laboratory test
patient_id : The column in lab_data indicating the identifier of the tested patient.
lab_datetime : The column in lab_data with the date or datetime of the laboratory test.
report : A report is written in the console. Defaults to “TRUE”.

Let us check how our package validates the results using the validate_lab_result() function. Let us consider the Function_2_dummy data that contains 86,863 rows and inspect its first 6 rows;

data("Function_2_dummy", package = "lab2clean")
head(Function_2_dummy, 6)

patient_id	lab_datetime1	loinc_code	result_value	result_unit
10000003	2023-08-09	1975-2	19	umol/L
10000003	2023-08-09	1968-7	20	umol/L
10000003	2023-09-09	1975-2	19	mmol/L
10000003	2023-09-09	1968-7	20	umol/L
10000003	2023-09-09	1968-7	20	umol/L
10000011	2023-10-09	1975-2	19	umol/L

Let us apply the validate_lab_result() and see its functionality:

validate_results <- validate_lab_result(Function_2_dummy, 
                                        result_value="result_value",
                                        result_unit="result_unit",
                                        loinc_code="loinc_code",
                                        patient_id = "patient_id" , 
                                        lab_datetime="lab_datetime1")

#> Preprocessing Step for Duplicate Records
#> ===============================================================================================
#> ⚠ 166 duplicate records were flagged.
#> These are multiple records of the same test for the same patient at the same result timestamp.
#> Check 1: Reportable Limits Check
#> ===============================================================================================
#> ⚠ 5 extremely low result records were flagged (low_unreportable).
#> ⚠ 2 extremely high records were flagged (high_unreportable).
#> Check 2: Logic Consistency Checks
#> ===============================================================================================
#> ⚠ 7 result records were flagged for violating relational logic rules (logic_flag).
#> Check 3: Delta Change Limits Checks
#> ===============================================================================================
#> ⚠ 55 records were flagged for having extreme change values from previous results within 7 days (delta_flag_7d).
#> ⚠ 15 records were flagged for having extreme change values from previous results within 8-90 days (delta_flag_8_90d).
#> ===============================================================================================
#> ✔ 99.712% of the lab data records were validated with no flag detected.
#> ⏰ Time taken is 0 min, 1.8 sec
#>

The validate_lab_result() function generates a flag column, with different checks:

head(validate_results, 6)

loinc_code	result_unit	patient_id	lab_datetime1	result_value	flag
13457-7	mg/dL	1e+07	2023-09-09	100.0	NA
13457-7	mg/dL	1e+07	2023-10-09	100.0	logic_flag
1751-7	g/dl	1e+07	2023-08-09	3.1	NA
1751-7	g/dl	1e+07	2023-09-09	7.5	logic_flag
1751-7	g/dl	1e+07	2023-10-09	7.5	NA
18262-6	mg/dL	1e+07	2023-11-09	100.0	NA

levels(factor(validate_results$flag))

#> [1] “delta_flag_7d” “delta_flag_8_90d” “duplicate”
#> [4] “high_unreportable” “logic_flag” “low_unreportable”

We can now subset specific patients to explain the flags:

subset_patients <- validate_results[validate_results$patient_id %in% c("14236258", "10000003", "14499007"), ]
subset_patients

loinc_code	result_unit	patient_id	lab_datetime1	result_value	flag
13457-7	mg/dL	10000003	2023-09-09	100.0	NA
13457-7	mg/dL	10000003	2023-10-09	100.0	logic_flag
1751-7	g/dl	10000003	2023-08-09	3.1	NA
1751-7	g/dl	10000003	2023-09-09	7.5	logic_flag
1751-7	g/dl	10000003	2023-10-09	7.5	NA
18262-6	mg/dL	10000003	2023-11-09	100.0	NA
1968-7	umol/L	10000003	2023-08-09	20.0	logic_flag
1968-7	umol/L	10000003	2023-09-09	20.0	duplicate
1968-7	umol/L	10000003	2023-09-09	20.0	duplicate
1968-7	umol/L	10000003	2023-10-09	20.0	NA
1975-2	umol/L	10000003	2023-08-09	19.0	logic_flag
1975-2	mmol/L	10000003	2023-09-09	19.0	NA
2085-9	mg/dL	10000003	2023-09-09	130.0	NA
2085-9	mg/dL	10000003	2023-10-09	130.0	logic_flag
2085-9	mg/dL	10000003	2023-11-09	130.0	NA
2093-3	mg/dL	10000003	2023-08-09	230.0	NA
2093-3	mg/dL	10000003	2023-09-09	230.0	duplicate
2093-3	mg/dL	10000003	2023-09-09	215.0	duplicate
2093-3	mg/dL	10000003	2023-10-09	230.0	logic_flag
2093-3	ng/dL	10000003	2023-11-09	230.0	NA
2885-2	g/dl	10000003	2023-08-09	7.0	NA
2885-2	g/dl	10000003	2023-09-09	7.0	logic_flag
2885-2	mg/dl	10000003	2023-10-09	7.0	NA
2160-0	mg/dL	14236258	2180-11-23 22:30:00	13.2	NA
2160-0	mg/dL	14236258	2181-02-22 08:10:00	13.1	NA
2160-0	mg/dL	14236258	2181-03-07 11:00:00	9.4	NA
2160-0	mg/dL	14236258	2181-03-24 16:35:00	27.2	delta_flag_8_90d
2160-0	mg/dL	14236258	2181-03-25 06:25:00	16.8	delta_flag_7d
2160-0	mg/dL	14236258	2181-03-26 06:10:00	19.0	NA
2160-0	mg/dL	14236258	2181-04-02 10:00:00	9.7	delta_flag_7d
2160-0	mg/dL	14236258	2181-06-29 14:00:00	16.9	delta_flag_8_90d
2160-0	mg/dL	14236258	2181-06-30 05:32:00	10.8	delta_flag_7d
2160-0	mg/dL	14236258	2181-07-10 22:44:00	10.0	NA
2160-0	mg/dL	14236258	2181-07-10 23:25:00	10.3	NA
2160-0	mg/dL	14236258	2181-07-11 10:00:00	11.6	NA
2160-0	mg/dL	14236258	2181-07-12 02:30:00	13.6	NA
2160-0	mg/dL	14236258	2181-10-17 17:10:00	10.6	NA
2160-0	mg/dL	14236258	2181-10-18 06:40:00	12.6	NA
2160-0	mg/dL	14236258	2181-11-30 07:00:00	19.7	delta_flag_8_90d
2160-0	mg/dL	14236258	2181-12-17 06:44:00	12.1	delta_flag_8_90d
2160-0	mg/dL	14499007	2180-06-02 07:10:00	1.0	NA
2160-0	mg/dL	14499007	2180-10-26 15:00:00	0.8	NA
2160-0	mg/dL	14499007	2180-10-27 05:53:00	1.0	NA
2160-0	mg/dL	14499007	2180-10-27 15:15:00	0.0	low_unreportable
2160-0	mg/dL	14499007	2180-10-28 06:35:00	0.9	NA
2160-0	mg/dL	14499007	2180-10-29 05:52:00	1.0	NA
2160-0	mg/dL	14499007	2180-10-30 12:26:00	0.9	NA
2160-0	mg/dL	14499007	2180-10-31 03:11:00	0.8	NA
2160-0	mg/dL	14499007	2180-11-01 06:20:00	1.0	NA
2160-0	mg/dL	14499007	2180-11-02 04:22:00	1.0	NA

Patient 14236258 has both delta_flag_8_90d and delta_flag_7d that is calculated by lower and upper percentiles set to 0.0005 and 0.9995 respectively. While the delta check is effective in identifying potentially erroneous result values, we acknowledge that it may also flag clinically relevant changes. Therefore, it is crucial that users interpret these flagged results in conjunction with the patient’s clinical context.

Let us also explain two tables that we used for the validation function. Let us begin with the reportable interval table.

data("reportable_interval", package = "lab2clean")
reportable_interval_subset <- reportable_interval[reportable_interval$interval_loinc_code == "2160-0", ]
reportable_interval_subset

interval_loinc_code	UCUM_unit	low_reportable_limit	high_reportable_limit
2160-0	mg/dL	1e-04	120

Patient 14499007 has a flag named low_unreportable. As we can see, for the “2160-0” loinc_code, his result was 0.0 which was not in the reportable range (0.0001, 120). In a similar note, patient 17726236 has a high_unreportable.

Logic rules ensure that related test results are consistent:

data("logic_rules", package = "lab2clean")
logic_rules <- logic_rules[logic_rules$rule_id == 3, ]
logic_rules

rule_id	rule_index	rule_part	rule_part_type
3	1	2093-3	loinc_code
3	2	>(	operator
3	3	2085-9	loinc_code
3	4	+	operator
3	5	13457-7	loinc_code
3	6	)	operator

Patient 10000003 has both logic_flag and duplicate. The duplicate means that this patient has a duplicate row, whereas the logic_flag should be interpreted as follows. For the loinc_code “2093-3”, which is cholesterol, we need that the “2093-3” > “2085-9” + “13457-7”, or equivalently cholesterol > hdl cholesterol + ldl cholesterol (from the logic rules table). Therefore for patient 10000003, we have a logic flag because LDL (“13457-7”) equals 100.0 and HDL (“2085-9”) equals 130.0. Total cholesterol (“2093-3”) equals 230. Therefore we see that the rule “2093-3” > “2085-9” + “13457-7” is not satisfied because we have 230 > 100+130, i.e. 230>230, which is clearly false, and thus a logic flag is created.

4. Function 3: Clean and Standardize units of measurement:

The standardize_lab_unit() has four arguments:

lab_data : A dataset containing laboratory data
raw_unit : The column in lab_data that contains raw units to be cleaned.
report : A report is written in the console. Defaults to “TRUE”.
n_records : In case you are loading a grouped list of distinct results, then you can assign the n_records to the column that contains the frequency of each distinct result. Defaults to NA

Let us check how our package standardizes the units of measurement using the standardize_lab_unit() function. Let us consider the Function_3_dummy data that contains 32 rows and inspect its first 6 rows;

data("Function_3_dummy", package = "lab2clean")
head(Function_3_dummy, 6)

unit_raw	n_records	note
mg / dl	42	spaces+case
Mg/Dl	15	case
mcg/L	30	mcg alias
µg/L	27	micro sign U+00B5
<U+00B5>g/L	10	textual unicode
μg/L	11	greek mu U+03BC

This dataset -for demonstration purposes- contains three columns: unit_raw, n_records, and note. The unit_raw column holds raw laboratory units as reported in the database, and frequency indicates how often each unit appeared, while the note details the different cases handled by our function.

standardized_units <- standardize_lab_unit(Function_3_dummy, raw_unit = "unit_raw", n_records = "n_records")

#> Step 1: Preprocessing unit srings
#> ==========================================================================================
#> ⚠ 1 distinct unit strings (5.155% of the total result records) with no units after pre-processing -> cleaning comment added No unit).
#> Step 2: Lookup in commom units database
#> ==========================================================================================
#> ✔ 19 distinct units (61.856% of the total result records) were matched to ucum codes.
#> Step 3: Check Syntax Integrity of units with no UCUM match
#> ==========================================================================================
#> ⚠ 0 distinct unit strings (0% of the total result records) with not valid syntax -> detailed cleaning comments added not_valid - reason).
#> Step 4: Parsing of units which passesd checks (tokenize and classify)
#> ==========================================================================================
#> ⚠ 2 distinct unit strings (1.804% of the total result records) with unrecognisable text -> cleaning comment added not_valid - unrecognisable text).
#> Step 5: Restructuring of parsed units (apply correction rules & final validation)
#> ==========================================================================================
#> ⚠ 0 distinct unit strings (0% of the total result records) with space characters (not corrected) -> cleaning comment added not_valid - space characters).
#> ✔ 10 distinct units (31.186% of the total result records) were transformed and validated to ucum codes.
#> ==========================================================================================
#> ✔ 29 distinct results (93.041% of the total result records) were standardized to UCUM.
#> ⏰ Time taken is 0 min, 0.1 sec
#>

This function creates two new columns:

head(standardized_units, 10)

	unit_raw	n_records	note	cleaning_comments	ucum_code
1		20	empty	No unit	NA
2	%	22	percent only	NA	%
4	10^3/uL	7	caret exponent	NA	10*3/uL
5	10exp3/uL	8	10exp exponent	NA	10*3/uL
6	10E3/uL	9	10E exponent	NA	10*3/uL
7	10*3/ul	16	lowercase uL + star	NA	10*3/uL
9	U.I./mL	8	U.I. form	NA	[IU]/mL
10	I.U./mL	8	I.U. form	NA	[IU]/mL
14	IU/L	18	plain IU	NA	[IU]/L
16	meq/l	13	meq alias	NA	meq/L

1- ucum_code: Cleaned and standardized units according to UCUM syntax.

2- cleaning_comments: Comments about the cleaning process for each unit.

5. Function 4: Harmonize results to reference units

The harmonize_lab_unit() has six arguments:

lab_data : A data frame containing laboratory data
result_value : The column in lab_data with quantitative result values for validation
result_unit : The column in lab_data with result units in a UCUM-valid format
loinc_code : The column in lab_data indicating the LOINC code of the laboratory test
preferred_unit_system : A string representing the preference of the user for the unit system used for standardization. Defaults to “SI”, the other option is “Conventional”.
report : A report is written in the console. Defaults to “TRUE”.

Let us demonstrate the harmonize_lab_unit() function using Function_4_dummy and inspect the first six rows:

data("Function_4_dummy", package = "lab2clean")
head(Function_4_dummy,6)

loinc_code	result_value	result_unit
26444-0	0.00	x1000/<U+00B5>L
14334-7	0.90	meq/L
785-6	30.00	pg
2160-0	0.81	mg/dL
14679-5	13.10	<U+00B5>g/dL
1963-8	27.00	mmol/l31 mOsm/l

This dataset -for demonstration purposes- contains three columns: loinc_code, result_value and the result_unit.

harmonized_units <- harmonize_lab_unit(Function_4_dummy,
                                       loinc_code="loinc_code",
                                       result_value="result_value",
                                       result_unit="result_unit")

#> Step 1: Extracting unit parameters (dimension & magnitude)
#> Step 2: Setting reference unit (LOINC-UCUM mapping)
#> Step 3: Check compatibility between reported unit and reference unit
#> Step 4: Executing regular conversion
#> Step 5: Executing mass<>molar conversion
#> Step 6: Checking LOINC codes
#> ===============================================================================================
#> Reporting Results:
#> ✔ 37.5% of the lab data records were harmonized to reference units.
#> ✔ 4 records had reported units same as reference units -> result value not converted.
#> ✔ 2 records had different reported units, but equivalent to reference units -> result value not converted.
#> ✔ 5 records harmonized to reference units by regular conversions.
#> ✔ 7 records harmonized to reference units by mass to mole conversions.
#> ✔ 0 records harmonized to reference units by mole to mass conversions.
#> ===============================================================================================
#> ⚠ 62.5% of the lab data records could NOT harmonized to reference units.
#> ⚠ 5 records not harmonized: no conversion between units of different dimensions.
#> ⚠ 1 records not harmonized: no conversion between arbitrary units and non-arbitrary units.
#> ⚠ 0 records not harmonized: no conversion between different arbitrary units.
#> ⚠ 4 records not harmonized: no reference unit available for the given loinc codes.
#> ⚠ 5 records not harmonized: no molecular weight available for the given analytes (loinc codes).
#> ⚠ 15 records not harmonized: reported units are not ucum-valid.
#> ⚠ 0 records not harmonized: reported units require special conversion.
#> ⚠ 0 records not harmonized: result values are not numeric.
#> ⚠ 0 records not harmonized: no reported units.
#> ===============================================================================================
#> ⚠ 10 records with updated loinc code to match the harmonized unit system.
#> ⏰ Time taken is 0 min, 0.1 sec
#>

This function creates six different columns:

head(harmonized_units, 6)

loinc_code	result_value	result_unit	new_loinc_code	new_value	harmonized_unit	OMOP_concept_id	property_group_id	cleaning_comments
14334-7	0.90	meq/L	3719-2	0.90000	mmol/L	8753	NA	harmonized: different_unit_same_value, loinc_unit_mismatch
785-6	29.00	pg	785-6	29.00000	pg	8564	NA	harmonized: source = reference unit
785-6	30.00	pg	785-6	30.00000	pg	8564	NA	harmonized: source = reference unit
2160-0	0.70	mg/dL	14682-9	61.88123	umol/L	8749	LG100-4/LG2923-3	harmonized: mass_to_mole_conversion
2160-0	0.82	mg/dL	14682-9	72.48944	umol/L	8749	LG100-4/LG2923-3	harmonized: mass_to_mole_conversion
2160-0	0.81	mg/dL	14682-9	71.60542	umol/L	8749	LG100-4/LG2923-3	harmonized: mass_to_mole_conversion

1- harmonized_unit: Harmonized units according to the preferred unit system.

2- OMOP_concept_id: The concept id of the harmonized unit, necessary for databases standardized to the OMOP Common Data Model.

3- new_value: The result value after the conversion.

4- new_loinc_code: The unit conversion can lead to a new loinc code than the reported one in two cases: * If the reported unit did not match the property of the given loinc code. For example “mmol/L” with a LOINC code of mass concentration property –> “loinc_unitsystem_mismatch” is added in the cleaning comments. * If a mass<>molar conversion was executed.

harmonized_units[which(harmonized_units$loinc_code != harmonized_units$new_loinc_code), ]

	loinc_code	result_value	result_unit	new_loinc_code	new_value	harmonized_unit	OMOP_concept_id	property_group_id	cleaning_comments
1	14334-7	0.900	meq/L	3719-2	0.90000	mmol/L	8753	NA	harmonized: different_unit_same_value, loinc_unit_mismatch
4	2160-0	0.700	mg/dL	14682-9	61.88123	umol/L	8749	LG100-4/LG2923-3	harmonized: mass_to_mole_conversion
5	2160-0	0.820	mg/dL	14682-9	72.48944	umol/L	8749	LG100-4/LG2923-3	harmonized: mass_to_mole_conversion
6	2160-0	0.810	mg/dL	14682-9	71.60542	umol/L	8749	LG100-4/LG2923-3	harmonized: mass_to_mole_conversion
9	2991-8	0.001	nmol/L	14914-6	1.00000	pmol/L	8729	LG100-4/LG2970-4	harmonized: regular_conversion, loinc_unit_mismatch
10	2991-8	0.005	nmol/L	14914-6	5.00000	pmol/L	8729	LG100-4/LG2970-4	harmonized: regular_conversion, loinc_unit_mismatch
11	2991-8	0.003	nmol/L	14914-6	3.00000	pmol/L	8729	LG100-4/LG2970-4	harmonized: regular_conversion, loinc_unit_mismatch
16	53049-3	88.000	mg/dL	40193-5	4.88455	mmol/L	8753	NA	harmonized: mass_to_mole_conversion

5- property_group_id: the code of the LOINC group (parent group ID / Group ID).

6- cleaning_comments: Comments about the harmonization and conversion process for each lab result with two main cases: * Success: harmonized with same value or with converted new value: - No conversion in case of similar or equivalent source and reference units. - Conversion with method clarified whether regular or mass<>molar conversion. * Failure: not harmonized with detailed reason for each failure case.

levels(factor(harmonized_units$cleaning_comments))

#> [1] “harmonized: different_unit_same_value”
#> [2] “harmonized: different_unit_same_value, loinc_unit_mismatch”
#> [3] “harmonized: mass_to_mole_conversion”
#> [4] “harmonized: mass_to_mole_conversion, loinc_unit_mismatch”
#> [5] “harmonized: regular_conversion”
#> [6] “harmonized: regular_conversion, loinc_unit_mismatch”
#> [7] “harmonized: source = reference unit”
#> [8] “not_harmonized: Non UCUM unit”
#> [9] “not_harmonized: between arbitrary units and non-arbitrary units”
#> [10] “not_harmonized: different dimensions”
#> [11] “not_harmonized: different dimensions, loinc_unit_mismatch”
#> [12] “not_harmonized: no molecular weight available”
#> [13] “not_harmonized: no_reference_unit_available”

In the harmonize_lab_unit() function, we have an argument named preferred_unit_system.

preferred_unit_system: According to the user preference, the reference units may change from SI units (usually molar concentration) to conventional units commonly used in practice (usually mass concentration) through mass<>molar conversions. For LOINC codes which don’t have mass<>molar equivalent, the conventional and SI units were considered the same. For some LOINC codes, the molar concentration is the one used conventionally. Examples of differences in using different preferred_unit_system is detailed as follows:

Function_4_dummy_subset <- Function_4_dummy[c(27, 15, 38, 45),, drop = FALSE]
harmonized_units <- harmonize_lab_unit(Function_4_dummy_subset,
                                       loinc_code="loinc_code",
                                       result_value="result_value",
                                       result_unit="result_unit",
                                       report = FALSE,
                                       preferred_unit_system = "SI")

#> ⏰ Time taken is 0 min, 0.1 sec
#>

harmonized_units

loinc_code	result_value	result_unit	new_loinc_code	new_value	harmonized_unit	OMOP_concept_id	property_group_id	cleaning_comments
2160-0	0.820	mg/dL	14682-9	72.48944	umol/L	8749	LG100-4/LG2923-3	harmonized: mass_to_mole_conversion
2991-8	0.001	nmol/L	14914-6	1.00000	pmol/L	8729	LG100-4/LG2970-4	harmonized: regular_conversion, loinc_unit_mismatch
2951-2	142.000	mmol/L	2951-2	142.00000	mmol/L	8753	LG100-4/LG11363-5	harmonized: source = reference unit
786-4	32.000	g/dL	786-4	320.00000	g/L	8636	NA	harmonized: regular_conversion

harmonized_units <- harmonize_lab_unit(Function_4_dummy_subset,
                                       loinc_code="loinc_code",
                                       result_value="result_value",
                                       result_unit="result_unit", 
                                       report = FALSE,
                                       preferred_unit_system = "conventional")

#> ⏰ Time taken is 0 min, 0.1 sec
#>

harmonized_units

loinc_code	result_value	result_unit	new_loinc_code	new_value	harmonized_unit	OMOP_concept_id	property_group_id	cleaning_comments
2160-0	0.820	mg/dL	2160-0	8.2000000	mg/L	8751	LG100-4/LG6657-3	harmonized: regular_conversion
2991-8	0.001	nmol/L	2991-8	0.2883998	ng/L	8725	LG100-4/LG11447-6	harmonized: mole_to_mass_conversion, loinc_unit_mismatch
2951-2	142.000	mmol/L	2951-2	142.0000000	mmol/L	8753	LG100-4/LG11363-5	harmonized: source = reference unit
786-4	32.000	g/dL	786-4	320.0000000	g/L	8636	NA	harmonized: regular_conversion

6. Customization

We fully acknowledge the importance of customization to accommodate diverse user needs and tailor the functions to specific datasets. To this end, the data in common_words, logic_rules, reportable_interval, RWD_units_to_UCUM_V2, annotable_strings, and loinc_reference_unit_v1 are not hard-coded within the function scripts but are instead provided as separate data files in the “data” folder of the package. This approach allows users to benefit from the default data we have included, which reflects our best knowledge, while also providing the flexibility to append or modify the data as needed.

For example, users can easily customize the common_words RData file by adding phrases that are used across different languages and laboratory settings. This allows the clean_lab_result() function to better accommodate the specific linguistic and contextual nuances of their datasets. Similarly, users can adjust the logic_rules and reportable_interval data files for validate_lab_result() function to reflect the unique requirements or standards of their research or clinical environment. Additionally, users can extend the RWD_units_to_UCUM_V2 data file by adding some locally used units or strings (especially which have non-English letters or abbreviations) with their ucum-valid equivalents customizing the output of standardize_lab_unit() function. Similarly, the annotable_strings data file can extended by adding non-English strings for analytes locally used in units. Finally, the harmonize_lab_unit() function can be customized by adding reference units to LOINC codes that were not covered in the loinc_reference_unit_v1 or even editing the reference units for some existing LOINC codes (though not recommended).

By providing these customizable data files, we aim to ensure that the lab2clean package is not only powerful but also adaptable to the varied needs of the research and clinical communities.

Automatically Cleaning Laboratory Results in R using the ‘lab2clean’ package

Ahmed Zayed, Ilias Sarikakis, Arne Janssens, Pavlos Mamouris

1. Introduction

2. Setup

Installing and loading the lab2clean package

3. Function 1: Clean and Standardize results

4. Function 2: Validate results

4. Function 3: Clean and Standardize units of measurement:

5. Function 4: Harmonize results to reference units

6. Customization

Installing and loading the `lab2clean` package