*--------------------------------*
* BIAS-Analysis
*--------------------------------*

/* 

•	5 characteristics that most strongly predict non-participation (characteristic always from wave 13)
•	Exclude all cases from the gross sample and the realized sample that were not part of the experiment (e.g., temporary suspensions)
First, calculate for all:

(1) What is the proportion with this characteristic X in the gross sample? 
(2) What is the proportion with this characteristic X in the realized sample? 
(3) The difference of (2) - (1) is the nonresponse bias of characteristic X.
Now, repeat the same three steps for all non-incentivized cases from the low-propensity half:
(4) What is the proportion with this characteristic X in this part of the gross sample? 
(5) What is the proportion with this characteristic X in this part of the realized sample? 
(6) The difference of (5) - (4) is the nonresponse bias of characteristic X in this part of the sample.
And finally, for the incentivized cases from the low-propensity half:
(7) What is the proportion with this characteristic X in this part of the gross sample? 
(8) What is the proportion with this characteristic X in this part of the realized sample? 
(9) The difference of (8) - (7) is the nonresponse bias of characteristic X in this part of the sample.
Additionally, for completeness, we need the whole thing in the high-propensity half: 
(10) What is the proportion with this characteristic X in this part of the gross sample? 
(11) What is the proportion with this characteristic X in this part of the realized sample? 
(12) The difference of (11) - (10) is the nonresponse bias of characteristic X in this part of the sample.
(13) = ((11) * N_h_net + (5) * N_I_net + ((7)+(6)) * N_l_i_gross * N_l_net/N_l_gross) / (N_h_net + N_l_net + N_l_i_gross * N_l_net/N_l_gross)
(14) = ((11) * N_h_net + (8) * N_I_i_net + ((4)+9) * N_l_gross * N_l_i_net/N_l_i_gross) / (N_h_net + N_l_i_net + N_l_gross * N_l_i_net/N_l_i_gross)
Where: N_h_net is the number of high-propensity cases in the realized sample
N_l_gross is the number of non-incentivized low-propensity cases in the gross sample N_l_net is the number of non-incentivized low-propensity cases in the realized sample
N_l_i_gross is the number of incentivized low-propensity cases in the gross sample N_l_i_net is the number of incentivized low-propensity cases in the realized sample
(15) Calculate the respective bias in both scenarios as (13) - (1) (16) or (14) - (1)
In a table for each characteristic, it can be arranged as follows:
Column labels: 
1: Proportion/mean in gross sample wave 14 
2: Proportion/mean in realized sample wave 14 
3: Nonresponse bias in wave 14
Row labels: 
1: Total sample 
2: Only high-propensity cases 
3: Only non-incentivized low-propensity cases 
4: Only incentivized low-propensity cases 
5: Hypothetical outcomes if nobody was incentivized 
6: Hypothetical outcome if everyone was incentivized
Arrangement of values in the table:
(1) (2) (3)
(10)(11)(12) 
(4) (5) (6)
(7) (8) (9)
--------------
(1)(13)(15)
(1)(14)(16)

*/

/* List of attributes

Relevance of attributes inb the training process can be identified in two ways; only minor differences

Liste1
kon_nrh       100.000
palter         59.561 *
real_wave      29.679
HEK1200        26.986 *
prop_wave      26.596
HEK0600        22.057
PMI0100        13.569 *
interesse      13.340 *
depind         11.187
dauerHH        10.161
dauerP          9.859
PA0200          9.798 *
HW03003         9.610
beruf2          8.530 *
HA0100          8.235
PA0300          8.079
PA0100          8.043
verstaendnis2   6.725
int_alter       6.691
ekinu181        6.228

Liste 2
kon_nrh        100.00
palter          74.01
dauerHH         48.81
dauerP          48.28
HEK0600         44.18
real_wave       41.26
int_alter       38.15
prop_wave       35.23
HEK1200         28.51
int_erfahrung   25.97
PSK0200         24.33 *
PA0200          22.39
depind          22.22
PA0100          21.80 *
PG0100          20.71
PA0800          17.90 *
PA0300          17.85
PA0900          17.38
PA1000          16.97
interesse2      14.57
*/




* use relevant attributes from wave 13 (from raw data used for training)
use  "$temp\data_w13.dta", clear 
	keep hnr palter HEK0600 PSK0200 HA0100 PA0300 PA0800 PA0100	PMI0100 interesse 
save "$temp\welle13.dta", replace	
	
	
// load original dataset 
use "$treatment\PASS_W14_Panelstichprobe_HH_Experimentzuweisung_infas_7123_20200205.dta", clear 
		drop if treatment ~= . // delete temporary dropouts (who were not part of experiment)  
save "$temp\hnr_exp.dta", replace	

// merge Wave 13 
merge 1:1 hnr using "$temp\welle13.dta", nogen keep(3)

// keep cases for which props were estimated
	merge 1:1 hnr using "$temp\predictions_w13.dta", nogen keep(3)

save "$temp\aufbereitung_bias1.dta", replace	// = Bruttostichprobe

* merge with contact data
merge 1:1 hnr using "$temp\03_aufber_EXP_kw$aktkw.dta", nogen keep(3)

	
// gross sample
	gen brutto = 1
// net sample
	gen netto = 0
	replace netto = 1 if interview == 1

	tab brutto netto, row


save "$temp\aufbereitung_bias2.dta", replace	
