1. Random variables#
The water monitoring project at the Montreux Jazz has been going on since 2016. The data has been collected and treated by a variety of people since then.
Objective: Standardize the nomenclature from the different sampling years. Provide a model for storing and collecting data in the future.
Purpose: Define the probability that a survey will exceed a threshold value within the period of the year defined by the survey results.
1.1. Definitions#
colony: a circular growth of individual bacteria from a water sample
colony-count: the number of colonies of the same color for a given media type
media/medium: the provided nutrients and substrates of a microbial plate or card
color: the observed color of the colony
label: the assumed category of the color:
Bioindicator
Coliform
Other
coef: the correction factor applied, to allow reporting of colony counts per 100ml of the original water sample.
The purpose of the sampling is to identify colonies that appear in the media and classify them as one of the possible labels. The label of interest is Bioindicators, this represents the bacteria that are issue from the organism of interest. The organism in this case is people, the Bioindicator is issue from fecal contaminants.
1.2. Methods#
The process requires collaborating with the data-manager(s) from the different project years and ensuring that the data from each year can be combined and interpreted together. The data for this collaboration is stored in the componentdata folder.
The relationship of previous label <—> new label is stored in a dictionary or an array for the different possibilities of medium, color, label and coefficient. The new labels are applied to a data-frame.
The finsihed data (the result of the collaboration) is stored in the end folder
1.3. Sample data#
The sample data is an example of the desired output per year. This includes the following parameters:
colony-count
label
location
coefficient*count
week number
day of year
is-jazz: boolean
rain fall in millimeters
1.4. Survey data#
The format of the survey data after collaboration
date | location | sample | date_sample | event | before event | after event | medium | label | count | coef | week | doy | year | color | image | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-07-05 | MRD | MRD1 | ('2016-07-05', 'MRD1') | True | False | False | EasyGel | Bioindicator | 0.0 | 250 | 27 | 187 | 2016 | big_blue | none |
1 | 2016-07-12 | MRD | MRD1 | ('2016-07-12', 'MRD1') | True | False | False | EasyGel | Bioindicator | 22.0 | 25 | 28 | 194 | 2016 | big_blue | none |
2 | 2016-07-19 | MRD | MRD1 | ('2016-07-19', 'MRD1') | False | False | True | EasyGel | Bioindicator | 8.0 | 25 | 29 | 201 | 2016 | big_blue | none |
3 | 2016-06-21 | MRD | MRD1 | ('2016-06-21', 'MRD1') | False | True | False | EasyGel | Bioindicator | 2.0 | 100 | 25 | 173 | 2016 | big_blue | none |
4 | 2016-06-28 | MRD | MRD1 | ('2016-06-28', 'MRD1') | False | True | False | EasyGel | Bioindicator | 0.0 | 25 | 26 | 180 | 2016 | big_blue | none |
1.4.2. Applying labels#
The colors that were used for the observations can be placed into three broad categories.
Bioindicator
Coliforms
Other
The microbiologist determines the correct label for the recorded color based on the specifics of the media/medium used to grow the culture.
The colors appropriate to each label are stored in an array. The color for each record is tested for membership in one of the arrays. If it is in one of the arrays, the name of that array is returned. If the color is not in any array the original value is returned. The result is added to the data-frame.
bioindicators = ["Dark Blue", "Blue", "Turquoise fast", "metallic_green", "green_met", "fluo_halo", "big_blue"]
coliforms = ["Pink", "pink", "purple", "med_blue"]
other = ["Turquoise", "Turquoise slow", "other", "mauve", "fluo_other", "green"]
def translate_colors(x, bioindicators, coliforms, other):
if x in bioindicators:
return "Bioindicator"
elif x in coliforms:
return "Coliform"
elif x in other:
return "Other"
else:
return x
stddf ["label"] = stddf .color.apply(lambda x: translate_colors(x, bioindicators, coliforms, other))
We do the same for the media/medium except we use a dictionary to store that information
media_names = {
"ECC-A Card":"ECC-A",
"new ECCA":"ECC-A",
"E-coli side": "E coli",
"Double side E coli": "E coli",
"ECC-side":"ECC",
"Double side ECC":"ECC",
"selective":"Levine",
"media":"EasyGel",
"plus uv":"EasyGelPlus",
"UVplus":"EasyGelPlus",
"non-restrictive":"LB",
"levine": "Levine",
"easy_gel":"EasyGel",
"unil_kitchen":"LB",
"micrology_card": "ECC"
}
def translate_media(x, media_names):
if x in media_names.keys():
return media_names[x]
else:
return x
stddf ["medium"] = stddf .media.apply(lambda x: translate_media(x, media_names))
1.4.3. Labeling the date range of interest#
Voici les dates de Jazz pour toutes les années de prélèvement :
2016: 2016-07-01 - 2016-07-16
2017: 2017-06-30 - 2017-07-15
2020: 2020-07-03 - 2020-07-18
2022: 2022-07-01 - 2022-07-16
2023: 2023-06-30 - 2023-07-15
before event: samples before the begining of the event of interest
after event: samples after the end of the event
2. Rain fall#
Expected format of rain data
date | mm | |
---|---|---|
0 | 2016-06-21 | 4.0 |
1 | 2016-06-22 | 0.6 |
2 | 2016-06-23 | 0.9 |
3 | 2016-06-24 | 13.1 |
4 | 2016-06-25 | 9.8 |