Hide code cell source
%load_ext watermark
import pandas as pd
import setvariables as conf_
import reportclass as r_class
from typing import Type, Optional, Callable
from typing import List, Dict, Union, Tuple
from sklearn.preprocessing import MinMaxScaler

Checking the assistant#

This page is a reference point for testing the accuracy of the GPT assigned to accompany readers of the federal report. The GPT should reproduce the calculations on this page at any time. This includes values not in the federal report. Stakeholders will need to apply these results to their proper geographic or administrative responsibilities. The hammerdirt GPT assists in this process.

The product is a dataframe that is the combination of columns from the ReportClass and columns from the LandUseClass. The intention is to allow easy access to the magnitude of toprgaphical features within 1 500 m of the observed density for any object in the data.

Important

April 17, 2023: The app that uses the hammerdirtGPT is in demo-form. We have abandoned the intial method of defining the prompt through the api. We are now developing a RAG application. A component of the context for the prompt is the results from users request. With this we combine the references from the federal report and any updated references that can be included.

Changes to class definitions: Building a RAG application means that we have to consider both the user visualisation of the report and the consumption of that data for the AI model, data_frame or array for the former, .JSON for the latter. These considerations will have a transformative effect on all the code in this module.

PREVIOUS

November 20, 2023: There is a known issue we are working on now. Remind the assistant to follow intsructions. Specifically in the following cases:

  1. Always getting a value of zero for the median sample total

    • The GPT has specific instructions on this

  2. Tells you the correct columns are not available

    • The GPT has the column names and definitions from this page

The data has a two column index, somtheing the GPT does not always recognize. An issue has been submitted issue

Note

The assistants role is to provide mathematical and graphical representations of the data in response to the researchers request. This often involves aggregating values at different levels, combining attributes and the like.

This page allows all users to verify that these complex transactions are happening correctly. The GPT may not use the same method to calculate the final result, but the results should be same.

Default data of hammerdirt GPT:#

beta version = .01

The default data for the GPT can be reproduced on the command line if the hammerdirtgpt package is installed:

# Collecting required data to establish a report
# This includes the language maps for all the common
# abbreviations and columns or index labels.
c_l = r_class.language_maps()

# The survey data in units of pcs/m or pcs/m². The reports
# are aggregated first to the sample_id. Which means that the operations
# are the same wether using pcs/m or pcs/m².
surveys = r_class.collect_survey_data_for_report()

# The support or evnvironmental data. This includes plain text descriptions 
# of the Codes. Details for each survey location and topogrphical data
# extracted from the buffer around each survey location.
codes, beaches, land_cover, land_use, streets, river_intersect_lakes = r_class.collect_env_data_for_report()

# Add columns to survey data. The support data contains information that can be used to
# group objects or survey locations that may not be stored with the survey data. In this
# example an adiminstrative label is attached to each survey_id. The cantonal label:
survey_data = surveys.merge(beaches['canton'], left_on='slug', right_index=True, validate='many_to_one')
# survey_data = survey_data.loc[survey_data.code == 'G27'].copy()
survey_data = survey_data[survey_data.feature_name != 'aare'].copy()

# ! USER DEFINED INPUT
# Temporal and geographic boundaries.
boundaries = dict(feature_type ='l', language='fr', start_date='2015-01-01', end_date='2022-01-01')
# Make the report data and report
top_label, language, w_df, w_di = r_class.report_data(boundaries, survey_data.copy(), beaches, codes)
a_report = r_class.ReportClass(w_df, boundaries, top_label, 'fr', c_l)
w_df_locations = w_df.slug.unique()

# call the land use class on the two different location groups
m_ui = LandUse(land_cover, land_use, streets, w_df_locations)

# for the region of interest
lcui = m_ui.use_of_land.copy()
lc_sti, no_datai = match_topo_attributes_to_surveys(lcui, a_report.w_df)

# the basic work data contains the survey results and the 
# topographical data merged on the <slug> column
work_data_i = merge_topodata_and_surveydata(lc_sti, a_report.w_df)

new_names = {
    'slug':'location',
    'loc_date':'sample_id',
    'pcs_m':'pcs/m',
    'Obstanlage': "orchards",
    'Reben':'vineyards',
    'Siedl':'buildings',
    'Wald':'forest',
    'land_use':'public services'
}
gptdf = work_data_i.rename(columns=new_names)

The preceding code produces the following table:

location sample_id date feature_name parent_boundary city canton pcs/m quantity code feature_type orchards vineyards buildings forest undefined public services streets
0 aabach ('aabach', '2020-10-22') 2020-10-22 zurichsee linth Schmerikon St. Gallen 0.0 0 G1 l 0.0 0.0 0.204522 0.535456 0.561892 0.018241 0.207248
1 aabach ('aabach', '2020-10-22') 2020-10-22 zurichsee linth Schmerikon St. Gallen 0.0 0 G10 l 0.0 0.0 0.204522 0.535456 0.561892 0.018241 0.207248
2 aabach ('aabach', '2020-10-22') 2020-10-22 zurichsee linth Schmerikon St. Gallen 0.0 0 G100 l 0.0 0.0 0.204522 0.535456 0.561892 0.018241 0.207248
3 aabach ('aabach', '2020-10-22') 2020-10-22 zurichsee linth Schmerikon St. Gallen 0.0 0 G101 l 0.0 0.0 0.204522 0.535456 0.561892 0.018241 0.207248
4 aabach ('aabach', '2020-10-22') 2020-10-22 zurichsee linth Schmerikon St. Gallen 0.0 0 G102 l 0.0 0.0 0.204522 0.535456 0.561892 0.018241 0.207248

Hand file to assistant#

Add language definitions#

The language definitions ensure an efficient transmission of intent from the observer to the model. We could leave the translations and definitions up to a translator and thus reduce the weight of the .csv file or API request. Howver this would generate an additional service by the client to get the requested information translated. Providing the definitions according to the standard set in the Federal report is a good baseline. If their is support amongst stakeholders to change the definitions then this can be handled by a pull request or raising an issue on the repo.

gptdf['fr'] = gptdf.code.map(lambda x: codes.loc[x, 'fr'])
gptdf['en'] = gptdf.code.map(lambda x: codes.loc[x, 'en'])
gptdf['de'] = gptdf.code.map(lambda x: codes.loc[x, 'de'])

gptdf.to_csv('data/in_process/lakes.csv', index=False)
gptdfx.head()
sample_id location date feature_name parent_boundary city canton feature_type orchards vineyards ... forest undefined public services streets code pcs/m quantity fr en de
0 ('aabach', '2020-10-22') aabach 2020-10-22 zurichsee linth Schmerikon St. Gallen l 0.0 0.0 ... 0.535456 0.561892 0.018241 0.207248 G1 0.0 0 Anneaux pour six packs Six pack rings Sixpack-Ringe
1 ('aabach', '2020-10-22') aabach 2020-10-22 zurichsee linth Schmerikon St. Gallen l 0.0 0.0 ... 0.535456 0.561892 0.018241 0.207248 G10 0.0 0 Emballage fast food Food containers single use foamed or plastic Lebensmittelbehälter zum einmaligen Gebrauch a...
2 ('aabach', '2020-10-22') aabach 2020-10-22 zurichsee linth Schmerikon St. Gallen l 0.0 0.0 ... 0.535456 0.561892 0.018241 0.207248 G100 0.0 0 Médical conteneurs/tubes/ emballages Medical; containers/tubes/ packaging Medizin; Behälter/Röhrchen/Verpackungen
3 ('aabach', '2020-10-22') aabach 2020-10-22 zurichsee linth Schmerikon St. Gallen l 0.0 0.0 ... 0.535456 0.561892 0.018241 0.207248 G101 0.0 0 Sac pour déjections canines Dog feces bag Robidog Hundekot-Säcklein, andere Hundekotsäck...
4 ('aabach', '2020-10-22') aabach 2020-10-22 zurichsee linth Schmerikon St. Gallen l 0.0 0.0 ... 0.535456 0.561892 0.018241 0.207248 G102 0.0 0 Tongs Flip-flops Flip-Flops

5 rows × 21 columns

gptdfx[gptdfx.code == 'G79']
sample_id location date feature_name parent_boundary city canton feature_type orchards vineyards ... forest undefined public services streets code pcs/m quantity fr en de

0 rows × 21 columns

gptdfx.columns
Index(['sample_id', 'location', 'date', 'feature_name', 'parent_boundary',
       'city', 'canton', 'feature_type', 'orchards', 'vineyards', 'buildings',
       'forest', 'undefined', 'public services', 'streets', 'code', 'pcs/m',
       'quantity', 'fr', 'en', 'de'],
      dtype='object')

Column names and definitions#

These column names and definitions are given to the GPT assistant.

  1. location: the name of the location used by people doing the survey

  2. sample_id: the combination of the location and date, the unique identifier of a sampling event

  3. date: the data of the sample

  4. feature_name: the name of the park, lake, or river where the sample was collected

  5. parent_boundary: a designated survey area, usually a river basin or regional label

  6. city: the muniicpality where the sample was taken

  7. canton: the canton where the sample was taken

  8. pcs/m: the number of objects identified by the column code collected at the sampling event divided by the length of shoreline, river bank or trail that was sampled.

  9. quantity: the number of objects identified by the column code collected at the sampling event

  10. code: the Marine Litter Watch object code

  11. feature_type: identifies the sample location as either a park, lake or river

  12. orchard: % of dry land attributed to this land-use within 1’500 m of the survey location

  13. vineyards: % of dry land attributed to this land-use within 1’500 m of the survey location

  14. buildings: % of dry land attributed to this land-use within 1’500 m of the survey location

  15. forest: % of dry land attributed to this land-use within 1’500 m of the survey location

  16. undefined: % of dry land with no land-use label

  17. public services: % of dry land attributed to hospitals, schools, sports, administration

  18. streets: the number of meters of streets within 1 500 m of the survey location. scaled between 0 - 1.

  19. fr: french code definitions

  20. en: english code definitions

  21. de: german code definitions

Note

The GPT will go through data exploration at the begining of the chat. These column defintions are given to the GPT and can be requested at any time. The definitions the GPT gives you should be very close to these definitions, it is not tell the GPT to use the provided definition in its instructions. These definitions should come back.

Verifying the output#

Test statistics#

Asking for each of these individually or telling the assistant to produce them all should yield the following results:

  • the median sample total of the data frame

  • the total quantity

  • the number of lakes

  • the number of samples

  • the number of cantons

  • the number of cities

Hide code cell source
gp_dt = gptdfx.groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})

lakes = gptdfx[gptdfx.feature_type == 'l'].feature_name.nunique()
cities = gptdfx.city.nunique()
quantity = gptdfx.quantity.sum()
samples = gptdfx.sample_id.nunique()
cantons = gptdfx.canton.nunique()
pc_med = gp_dt['pcs/m'].median()

test_1 = dict(lakes=lakes, cities=cities, quantity=quantity, samples=samples, cantons=cantons, median_pcs_m = pc_med)
print(test_1)
{'lakes': 16, 'cities': 67, 'quantity': 146066, 'samples': 753, 'cantons': 14, 'median_pcs_m': 2.77}

Most common#

The most common codes are those codes that are either in the top ten by quantity or present in at lease 50% of the surveys.

most_common, weight = a_report.most_common
most_common
quantity % pcs_m fail rate
G27 29033 0.198674 0.33 0.904636
Gfrags 17073 0.116831 0.04 0.905960
Gfoams 14989 0.102570 0.00 0.754967
G208 10445 0.071475 0.00 0.328477
G30 8931 0.061115 0.12 0.845033
G67 6926 0.047395 0.06 0.650331
Gcaps 5382 0.036829 0.00 0.750993
G95 4610 0.031546 0.00 0.482119
G200 4232 0.028960 0.01 0.508609
G178 2558 0.017504 0.03 0.607947
G156 2190 0.014986 0.00 0.429139
G177 1447 0.009902 0.01 0.525828

Aggregating samples#

Sample total pcs/m#

gp_dt['pcs/m'].describe()
count    753.000000
mean       5.628738
std        9.367900
min        0.040000
25%        1.220000
50%        2.770000
75%        6.000000
max       77.100000
Name: pcs/m, dtype: float64

Single code#

cigarette ends

gp_dtcode = gptdfx[gptdfx.code.isin(['G27'])].groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})
gp_dtcode['pcs/m'].describe()
count    753.000000
mean       0.907304
std        1.605537
min        0.000000
25%        0.090000
50%        0.340000
75%        1.040000
max       19.700000
Name: pcs/m, dtype: float64

Combining codes#

combining cigarette ends and snack wrappers

gp_dtcodes = gptdfx[gptdfx.code.isin(['G27', 'G30'])].groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})
gp_dtcodes['pcs/m'].describe()
count    753.000000
mean       1.238433
std        2.035005
min        0.000000
25%        0.210000
50%        0.550000
75%        1.450000
max       23.270000
Name: pcs/m, dtype: float64

Single feature#

the results on Bielersee

gp_dtbsee = gptdfx[gptdfx.feature_name == 'bielersee'].groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})
gp_dtbsee['pcs/m'].describe()
count    51.000000
mean      4.023725
std       2.995087
min       0.400000
25%       1.450000
50%       3.380000
75%       5.470000
max      14.800000
Name: pcs/m, dtype: float64

Combined features#

Bielersee and Thunersee

gp_dtbt = gptdfx[gptdfx.feature_name.isin(['bielersee', 'thunersee'])].groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})
gp_dtbt['pcs/m'].describe()
count    94.000000
mean      2.738085
std       2.682369
min       0.160000
25%       0.862500
50%       1.685000
75%       3.477500
max      14.800000
Name: pcs/m, dtype: float64

Land use#

Correlation matrix of the land use variables with each other

corrs = gp_dtbt[geo_features[1:-1]].corr()
corrs
vineyards orchards buildings forest undefined public services streets
vineyards 1.000000 -0.088721 -0.174585 -0.087445 0.057518 -0.110552 0.101623
orchards -0.088721 1.000000 -0.195273 0.586422 -0.335178 -0.220276 -0.185527
buildings -0.174585 -0.195273 1.000000 -0.249992 -0.699952 0.833019 0.631690
forest -0.087445 0.586422 -0.249992 1.000000 -0.487967 -0.077841 -0.129156
undefined 0.057518 -0.335178 -0.699952 -0.487967 1.000000 -0.664196 -0.487873
public services -0.110552 -0.220276 0.833019 -0.077841 -0.664196 1.000000 0.831108
streets 0.101623 -0.185527 0.631690 -0.129156 -0.487873 0.831108 1.000000
Author: hammerdirt-analyst

conda environment: cantonal_report

pandas: 2.0.3