1. Random variables#

The water monitoring project at the Montreux Jazz has been going on since 2016. The data has been collected and treated by a variety of people since then.

Objective: Standardize the nomenclature from the different sampling years. Provide a model for storing and collecting data in the future.

Purpose: Define the probability that a survey will exceed a threshold value within the period of the year defined by the survey results.

1.1. Definitions#

  • colony: a circular growth of individual bacteria from a water sample

  • colony-count: the number of colonies of the same color for a given media type

  • media/medium: the provided nutrients and substrates of a microbial plate or card

  • color: the observed color of the colony

  • label: the assumed category of the color:

    • Bioindicator

    • Coliform

    • Other

  • coef: the correction factor applied, to allow reporting of colony counts per 100ml of the original water sample.

The purpose of the sampling is to identify colonies that appear in the media and classify them as one of the possible labels. The label of interest is Bioindicators, this represents the bacteria that are issue from the organism of interest. The organism in this case is people, the Bioindicator is issue from fecal contaminants.

1.2. Methods#

The process requires collaborating with the data-manager(s) from the different project years and ensuring that the data from each year can be combined and interpreted together. The data for this collaboration is stored in the componentdata folder.

The relationship of previous label <—> new label is stored in a dictionary or an array for the different possibilities of medium, color, label and coefficient. The new labels are applied to a data-frame.

The finsihed data (the result of the collaboration) is stored in the end folder

1.3. Sample data#

The sample data is an example of the desired output per year. This includes the following parameters:

  1. colony-count

  2. label

  3. location

  4. coefficient*count

  5. week number

  6. day of year

  7. is-jazz: boolean

  8. rain fall in millimeters

1.4. Survey data#

The format of the survey data after collaboration

date location sample date_sample event before event after event medium label count coef week doy year color image
0 2016-07-05 MRD MRD1 ('2016-07-05', 'MRD1') True False False EasyGel Bioindicator 0.0 250 27 187 2016 big_blue none
1 2016-07-12 MRD MRD1 ('2016-07-12', 'MRD1') True False False EasyGel Bioindicator 22.0 25 28 194 2016 big_blue none
2 2016-07-19 MRD MRD1 ('2016-07-19', 'MRD1') False False True EasyGel Bioindicator 8.0 25 29 201 2016 big_blue none
3 2016-06-21 MRD MRD1 ('2016-06-21', 'MRD1') False True False EasyGel Bioindicator 2.0 100 25 173 2016 big_blue none
4 2016-06-28 MRD MRD1 ('2016-06-28', 'MRD1') False True False EasyGel Bioindicator 0.0 25 26 180 2016 big_blue none

1.4.1. Current data to process#

None

The data from 2017 will require quite a bit of formatting:

1.4.2. Applying labels#

The colors that were used for the observations can be placed into three broad categories.

  1. Bioindicator

  2. Coliforms

  3. Other

The microbiologist determines the correct label for the recorded color based on the specifics of the media/medium used to grow the culture.

The colors appropriate to each label are stored in an array. The color for each record is tested for membership in one of the arrays. If it is in one of the arrays, the name of that array is returned. If the color is not in any array the original value is returned. The result is added to the data-frame.

bioindicators = ["Dark Blue", "Blue", "Turquoise fast", "metallic_green", "green_met", "fluo_halo", "big_blue"]
coliforms = ["Pink", "pink", "purple", "med_blue"]
other = ["Turquoise", "Turquoise slow", "other", "mauve", "fluo_other", "green"]

def translate_colors(x, bioindicators, coliforms, other):
    if x in bioindicators:
        return "Bioindicator"
    elif x in coliforms:
        return "Coliform"
    elif x in other:
        return "Other"
    else:
        return x

stddf ["label"] = stddf .color.apply(lambda x: translate_colors(x, bioindicators, coliforms, other))

We do the same for the media/medium except we use a dictionary to store that information

media_names =  {
    "ECC-A Card":"ECC-A",
    "new ECCA":"ECC-A",
    "E-coli side": "E coli",
    "Double side E coli": "E coli",
    "ECC-side":"ECC",
    "Double side ECC":"ECC",
    "selective":"Levine",
    "media":"EasyGel",
    "plus uv":"EasyGelPlus",
    "UVplus":"EasyGelPlus",
    "non-restrictive":"LB",
    "levine": "Levine",
    "easy_gel":"EasyGel",
    "unil_kitchen":"LB",
    "micrology_card": "ECC"
}

def translate_media(x, media_names):
    if x in media_names.keys():
        return media_names[x]
    else:
        return x


stddf ["medium"] = stddf .media.apply(lambda x: translate_media(x, media_names))

1.4.3. Labeling the date range of interest#

Voici les dates de Jazz pour toutes les années de prélèvement :

  • 2016: 2016-07-01 - 2016-07-16

  • 2017: 2017-06-30 - 2017-07-15

  • 2020: 2020-07-03 - 2020-07-18

  • 2022: 2022-07-01 - 2022-07-16

  • 2023: 2023-06-30 - 2023-07-15

before event: samples before the begining of the event of interest

after event: samples after the end of the event

2. Rain fall#

Expected format of rain data

date mm
0 2016-06-21 4.0
1 2016-06-22 0.6
2 2016-06-23 0.9
3 2016-06-24 13.1
4 2016-06-25 9.8