Oct 4, 2009

Fusion Quality and Statistics

Have you heard about Fusion Data Analysis?


For many decades fusion techniques have been experimented in many European countries. Fusion data is very efficient to recover information in many cases. It's a simple 3 step procedure:

1. Extract a sub-sample from the large respondents data which is split in 2 halves
2. Use these two sub-samples to produce a fusion of a third sample
3. Run your calculations with his fusion data

Then we can compare the results issued from the fusion data set and the ones coming from the observed dataset.

The fusion algorithm involved is the latest available, including the use of a new distance to evaluate the similarity between donors and recipients.

According to Procustrean Fusion Algorithm - Sample 1 is called the donor sample and sample 2 is called the recipient sample.

The Procustrean Fusion Algorithm (PFA) obeys five principles :

1- Each recipient should receive data from a single donor.

2- The data collected for a donor is transferred as a whole to the linked recipients.

3- Any donor already linked should be highly discouraged to produce further links

4- The cross-distributions between common and additional variables should be preserved unchanged by the ascription process.

5- The similarity between two respondents should be evaluated globally.

The rational underlying the first two principles is to avoid breaks of the inter-correlations between the
additional variables during the ascription process.

The third principle protects against a decrease of the effective sample size.

The fourth principle refers to one of the basic requirements for a good fusion..

The last principle is somewhat more subtle. The idea behind is that if one considers that two donors are close (i.e. similar) they need to be so, not only on the basis of the common variables but also on the basis of the additional variables, otherwise the fusion process could distort the dependencies existing among these variables

Altogether the five previous principles are useful to protect the fused database against distortion of the relationships existing between the variables

This article may be useful for you while you undertake to manage any project (six sigma or otherwise) which requires analysis of huge database.
Let me know if you have any questions.

Regards,
Tina