# Multi-Document Statistics

As stated in an earlier question about sortby, I am building a repository of (anonymous) information about surgical procedures performed at a large research hospital over a period of several years. This data is used in support of statistical studies of clinical questions.

In a typical study, we first identify a patient population that satisfies the study criteria (which often are very specific) and then extract selected data from those patients’ records for statistical analysis. So far I have not been able to find a simple way using Quip to first identify a group of patients and then do simple calculations “across” the group. I would appreciate any suggestions the user community might make to address the following example.

In the general form of this example, we want to know the averave nunmber of red blood cell units given during or after surgery (, ) to patients in each of three groups who underwent a particular surgical procedure in each of four years (1990-1993). One XML document contains the complete relevant record for one patient and may contain reports of several surgeries and other procedures and tests. If a patient did not receive red blood cell units, the and elements will be missing, even though these patients need to be counted when calculating the average.

In simplified form, the following query correctly identifies one group of patients and operations for the year 1990:

{
for \$b in collection(“Demo500”)/PatientRecord
[.//Surgery[./@SurgeryDate > “1989-12-31” and ./@SurgeryDate < “1994- 01-01”]/RevascularizationProcedure],
\$c in \$b//Surgery[./@SurgeryDate > “1989-12-31” and ./@SurgeryDate < “1994-01-01”][./RevascularizationProcedure]
where empty(\$c//ValveRepair) and
empty(\$c//ValveReplacement) and
empty . . .
return
(
for \$Pri90 in \$c[substring(string-value(./@SurgeryDate), 1, 4) = ‘1990’ and
substring(string-value(./HistoryOfCardiacOperations), 1, 2) = ‘No’]
where not(contains(string-value(\$Pri90/EmergencySurgeryIndication), ‘Emergent’))
return

{
(
A-Elective Primary CABG,
{string-value(\$b/@PatientID)},
{string-value(\$Pri90/@SurgeryDate)},
\$Pri90//RBCUnitsIntra,
\$Pri90//RBCUnitsPost
)
}

)
}

At this point, \$Pri90 contains a forest of surgery nodes, each of which contains details about one operation of interest. I can easily extract information (e.g., ) about individual operations, but so far have not been able to calculate statistics (i.e., average ) for the entire group.

Any suggestions? Would the ‘concatenate’ function when implemented allow me to combine these nodes into a sequence that could be processed with a Let statement?

Thanks again.

Bill

XQuery provides functions to calculate the sum
or the average value of a list of numbers, e.g.
try sum(1 to 10).

So for your example you should some up all the
blood values you find in the patient records
and divide this by the size (use function 'count’
of all relevant patient records) (you could try to
use the avg function, but since there are records with no entry this might get not exactly the result you want.)

Take care that the function ‘sum’ is realy invoked
on a list of numbers and not on a list of string
values or even nodes. You will probably need to
use a constructor, which converts some string into
a number.

Tiny example:
let \$xs :=

s
387.4

e

p
387.4

let \$sum := sum(for \$x in \$xs/patient/va/text() return float(\$x))
return \$sum div count(\$xs/patient)

Sven Eric

Thanks for responding to both of my questions. I will try to apply your strategy to the “average RBC units” calculation.

Bill

I have been experimenting with your “Tiny Example” and would like to clarify a few things. Sometimes the simplest cases are the most enlightening, or maybe just the “dumbest.” So at the risk of asking some dumb questions, here goes.

1.) My original question was not about the sum, avg, and count functions but rather about how to get my data into a form where I could use them. Recall, I can construct a variable \$Pri90 which contains a forest of nodes, each of which contains some of the data for which I want to generate statistics.

It’s as if, insteasd of beginning your Tiny example with “let \$xs := …” you had said:

for \$xs in

s
387.4

e

p
378.4
200.2

Now how do I calculate the average value of all the va’s? In other words, in the language of the W3C Xquery 1.0 (2.9) how can I duplicate the effect of the let statement to convert the tuple of variable bindings (which I have) into a variable which is bound directly to the entire expression (which sum and other functions could process correctly)? Can’t just say let \$c := \$xs because \$c will continue to be a collection of independent values.

2.) When I tried to actually run your Tiny example, along with an assortment of “missing parentheses” and “missing return clause” messages, I also got one that said “Unknown function - float”. You were quite right, I do need floating point to do what I need to do, so that could become a problem for us after awhile.

3.) I also have a have a simple scope problem.
The following code works fine:

{
let \$xs :=

s
387

e

p
200

return
{\$xs/patient/va}
}

producing:

387
200

If I replace

return
{\$xs/patient/va}
}

with

return
count(\$xs/patient/va)
}

I also get a correct result

2.

But if I try to get both results using

return
\$xs/patient/va,
count(\$xs/patient/va)
}

or

return
count(\$xs/patient/va),
\$xs/patient/va
}

(with assorted arrangements of commas and curly brackets), I get the error message

Error: (20,3) Unknown variable xs.

What is wrong?

Thanks.

Bill

question, about the first question I need to read
closer in order to understand, what exactly your problem is:

2.) sorry that the ‘float’ constructor function is
not yet in the released version of quip. The next
version, which will be released around next
monday, will contain that function. Furthermore I
am sorry that my example did not work directly.
Actually I tested it before posting it, but
probably copy/paste was not performed correctly by me and some parts got lost

3.) As you might have allready guessed, you have a
grouping problem here. Simply compare:
for \$x in (1,2)
return ,

and
for \$x in (1,2)
return (
, )

I gues now you see what was your problem. If you
do not put the two results you want to return in
parantheses, then only the first one belongs to
the reurn clause, i.e.
for \$x in (1,2)
return
,

is parsed as:

(for \$x in (1,2)
return

, )

Sven Eric

I am sorry, I am not sure that I grab your first
question. What is the problem? If you have a
document of all your patient data:

a{6}
b
c{1}
d{8}
e{4}

you can bind this to a variable:

let \$patients :=

a{6}
b
c{1}
d{8}
e{4}

return \$patients

You can bind the number of patients in a varaible:

let \$patients :=

a{6}
b
c{1}
d{8}
e{4}

let \$numberOfPatients := count(\$patients/patient)
return \$numberOfPatients

or you can sum up values in the patients records:

let \$patients :=
let \$patients :=

a{6}
b
c{1}
d{8}
e{4}

let \$numberOfPatients := count(\$patients/patient)
let \$numberOfPatients
:= sum(\$patients/patient/val/node())
return \$numberOfPatients

and eventually you can iterate over patient and
filter them:

let \$patients :=

a{6}
b
c{1}
d{8}
e{4}

let \$numberOfPatients := count(\$patients/patient)
let \$sumOfValues
:= sum(\$patients/patient/val/node())
let \$avg := \$sumOfValues div \$numberOfPatients
for \$patient in \$patients/patient
where empty(\$patient/val/node()) or \$patient/val/node() < \$avg
return {\$avg,\$patient}

Does this in some respect cover your question?

Sven Eric

My original concern – possibly misguided – with using ‘let’ the way you suggest was a feeling that with a large and complex data base it might lead to unneccessarily long response times or overflow conditions somewhere, or both.

The operational system we have in mind currently contains about 172,000 anonymous patient records with an average size in XML of more than 13 kb, based on a sample of 500 records, and grows at the rate of 25 records per day, seven days a week. I have been trying to demonstrate the use of QUIP for nine samples selected from our files of more that 200 actual searchs conducted during the last two years.

The problem at hand is the following:

Find the total number of elective and emergency primary and re-op isolated coronary- artery-bypass-graft (CABG) patients between 1990 and 1993, and the mean red blood cell units (RBC’s) used per patient each year. (Primary means first open-heart surgery, re-op means any open-heart surgery after the first one, isolated means no other procedure performed at the same time as the bypass.) Data are to be presented in three groups:

1) Elective (anyone that is not an emergency) Primary Isolated CABG
2) Reoperations (elective or emergent) Isolated CABG
3) Emergent Primary CABG

Combine the Intra and Post-Op RBC’s and list the mean units per patient for each year. (The Intra and Post-op units are stored in the XML data model as separate sub-elements under the complex element , which is located at the 6th level in the hierarchy. I expect to flatten this out somewhat in the next prototype design.).

At a detail level we are completing the following table.

—Average RBC’s per patient–
Year 1990 1991 1992 1993

Group 1 x x x x
Group 2 x x x x
Group 3 x x x x

In addition to the detailed data requested we are usually also expected to provide a set of base data about each of the patients in the selected group. This includes items like birth date, ***, prior diagnoses for certain conditions, prior blood test results, status as a smoker, etc. It is therefore unacceptable to retrieve detailed data without the ability to retrieve other associated patient-specific information.

My approach to this problem was to begin with:

for \$b in collection (“Demo500”)/PatientRecord[./EpisodeOfCare/TherapeuticProcedures/Surgery[./@SurgeryDate > “1989-12-31” and ./@Surgery < “1994-01-01”]/RevascularizationProcedure],

\$c in \$b/EpisodeOfCare/Management/TherapeuticProcedures/Surgery[./@SurgeryDate > “1989-12-31” and ./@SurgeryDate < “1994-01-01”][./RevascularizationProcedure]

(\$b contains complete patient records, \$c contains only the surgical details for those operations which included revascularization (CABG).)

I would retrieve the same set of patients and the same set of operations if I said:

let \$b := collection(“Demo500”)/PatientRecord???.. ,
\$c := \$b/Episode???..

Next, to identify the operations that were isolated CABG, I started adding conditions in a where clause. For example,

for \$b in collection (“Demo500”)/PatientRecord??? ,
\$c in \$b/Episode???.
where empty(\$c//ValveReplacement)

continues to produce a correct (but shorter) list of patients and procedures, while

let \$b := collection (“Demo500”)/PatientRecord??? ,
\$c := \$b/Episode???.
where empty(\$c//ValveReplacement)

does not produce any output at all.

To complete the query I still have to specify six more empty conditions, divide the results into twelve sets (three groups x four years), and calculate the required averages for each set, including the fact that any given operation may have both intra-op and post-op numbers that have to be taken into the same average.

Bill