Random variables

1. Random variables#

The water monitoring project at the Montreux Jazz has been going on since 2016. The data has been collected and treated by a variety of people since then.

Objective: Standardize the nomenclature from the different sampling years. Provide a model for storing and collecting data in the future.

Purpose: Define the probability that a survey will exceed a threshold value within the period of the year defined by the survey results.

1.1. Definitions#

colony: a circular growth of individual bacteria from a water sample
colony-count: the number of colonies of the same color for a given media type
media/medium: the provided nutrients and substrates of a microbial plate or card
color: the observed color of the colony
label: the assumed category of the color:
- Bioindicator
- Coliform
- Other
coef: the correction factor applied, to allow reporting of colony counts per 100ml of the original water sample.

The purpose of the sampling is to identify colonies that appear in the media and classify them as one of the possible labels. The label of interest is Bioindicators, this represents the bacteria that are issue from the organism of interest. The organism in this case is people, the Bioindicator is issue from fecal contaminants.

1.2. Methods#

The process requires collaborating with the data-manager(s) from the different project years and ensuring that the data from each year can be combined and interpreted together. The data for this collaboration is stored in the componentdata folder.

The relationship of previous label <—> new label is stored in a dictionary or an array for the different possibilities of medium, color, label and coefficient. The new labels are applied to a data-frame.

The finsihed data (the result of the collaboration) is stored in the end folder

1.3. Sample data#

The sample data is an example of the desired output per year. This includes the following parameters:

colony-count
label
location
coefficient*count
week number
day of year
is-jazz: boolean
rain fall in millimeters

1.4. Survey data#

The format of the survey data after collaboration

	date	location	sample	date_sample	event	before event	after event	medium	label	count	coef	week	doy	year	color	image
0	2016-07-05	MRD	MRD1	('2016-07-05', 'MRD1')	True	False	False	EasyGel	Bioindicator	0.0	250	27	187	2016	big_blue	none
1	2016-07-12	MRD	MRD1	('2016-07-12', 'MRD1')	True	False	False	EasyGel	Bioindicator	22.0	25	28	194	2016	big_blue	none
2	2016-07-19	MRD	MRD1	('2016-07-19', 'MRD1')	False	False	True	EasyGel	Bioindicator	8.0	25	29	201	2016	big_blue	none
3	2016-06-21	MRD	MRD1	('2016-06-21', 'MRD1')	False	True	False	EasyGel	Bioindicator	2.0	100	25	173	2016	big_blue	none
4	2016-06-28	MRD	MRD1	('2016-06-28', 'MRD1')	False	True	False	EasyGel	Bioindicator	0.0	25	26	180	2016	big_blue	none

1.4.1. Current data to process#

None

The data from 2017 will require quite a bit of formatting:

1.4.2. Applying labels#

The colors that were used for the observations can be placed into three broad categories.

Bioindicator
Coliforms
Other

The microbiologist determines the correct label for the recorded color based on the specifics of the media/medium used to grow the culture.

The colors appropriate to each label are stored in an array. The color for each record is tested for membership in one of the arrays. If it is in one of the arrays, the name of that array is returned. If the color is not in any array the original value is returned. The result is added to the data-frame.

bioindicators = ["Dark Blue", "Blue", "Turquoise fast", "metallic_green", "green_met", "fluo_halo", "big_blue"]
coliforms = ["Pink", "pink", "purple", "med_blue"]
other = ["Turquoise", "Turquoise slow", "other", "mauve", "fluo_other", "green"]

def translate_colors(x, bioindicators, coliforms, other):
    if x in bioindicators:
        return "Bioindicator"
    elif x in coliforms:
        return "Coliform"
    elif x in other:
        return "Other"
    else:
        return x

stddf ["label"] = stddf .color.apply(lambda x: translate_colors(x, bioindicators, coliforms, other))

We do the same for the media/medium except we use a dictionary to store that information

media_names =  {
    "ECC-A Card":"ECC-A",
    "new ECCA":"ECC-A",
    "E-coli side": "E coli",
    "Double side E coli": "E coli",
    "ECC-side":"ECC",
    "Double side ECC":"ECC",
    "selective":"Levine",
    "media":"EasyGel",
    "plus uv":"EasyGelPlus",
    "UVplus":"EasyGelPlus",
    "non-restrictive":"LB",
    "levine": "Levine",
    "easy_gel":"EasyGel",
    "unil_kitchen":"LB",
    "micrology_card": "ECC"
}

def translate_media(x, media_names):
    if x in media_names.keys():
        return media_names[x]
    else:
        return x


stddf ["medium"] = stddf .media.apply(lambda x: translate_media(x, media_names))

1.4.3. Labeling the date range of interest#

Voici les dates de Jazz pour toutes les années de prélèvement :

2016: 2016-07-01 - 2016-07-16
2017: 2017-06-30 - 2017-07-15
2020: 2020-07-03 - 2020-07-18
2022: 2022-07-01 - 2022-07-16
2023: 2023-06-30 - 2023-07-15

before event: samples before the begining of the event of interest

after event: samples after the end of the event

2. Rain fall#

Expected format of rain data

	date	mm
0	2016-06-21	4.0
1	2016-06-22	0.6
2	2016-06-23	0.9
3	2016-06-24	13.1
4	2016-06-25	9.8