Show code cell source
%load_ext watermark
import pandas as pd
import setvariables as conf_
import reportclass as r_class
from typing import Type, Optional, Callable
from typing import List, Dict, Union, Tuple
from sklearn.preprocessing import MinMaxScaler
Checking the assistant#
This page is a reference point for testing the accuracy of the GPT assigned to accompany readers of the federal report. The GPT should reproduce the calculations on this page at any time. This includes values not in the federal report. Stakeholders will need to apply these results to their proper geographic or administrative responsibilities. The hammerdirt GPT assists in this process.
The product is a dataframe that is the combination of columns from the ReportClass
and columns from the LandUseClass
. The intention is to allow easy access to the magnitude of toprgaphical features within 1 500 m of the observed density for any object in the data.
Important
April 17, 2023: The app that uses the hammerdirtGPT is in demo-form. We have abandoned the intial method of defining the prompt through the api. We are now developing a RAG application. A component of the context for the prompt is the results from users request. With this we combine the references from the federal report and any updated references that can be included.
Changes to class definitions: Building a RAG application means that we have to consider both the user visualisation of the report and the consumption of that data for the AI model, data_frame or array for the former, .JSON for the latter. These considerations will have a transformative effect on all the code in this module.
PREVIOUS
November 20, 2023: There is a known issue we are working on now. Remind the assistant to follow intsructions. Specifically in the following cases:
Always getting a value of zero for the median sample total
The GPT has specific instructions on this
Tells you the correct columns are not available
The GPT has the column names and definitions from this page
The data has a two column index, somtheing the GPT does not always recognize. An issue has been submitted issue
Note
The assistants role is to provide mathematical and graphical representations of the data in response to the researchers request. This often involves aggregating values at different levels, combining attributes and the like.
This page allows all users to verify that these complex transactions are happening correctly. The GPT may not use the same method to calculate the final result, but the results should be same.
Default data of hammerdirt GPT:#
beta version = .01
The default data for the GPT can be reproduced on the command line if the hammerdirtgpt
package is installed:
# Collecting required data to establish a report
# This includes the language maps for all the common
# abbreviations and columns or index labels.
c_l = r_class.language_maps()
# The survey data in units of pcs/m or pcs/m². The reports
# are aggregated first to the sample_id. Which means that the operations
# are the same wether using pcs/m or pcs/m².
surveys = r_class.collect_survey_data_for_report()
# The support or evnvironmental data. This includes plain text descriptions
# of the Codes. Details for each survey location and topogrphical data
# extracted from the buffer around each survey location.
codes, beaches, land_cover, land_use, streets, river_intersect_lakes = r_class.collect_env_data_for_report()
# Add columns to survey data. The support data contains information that can be used to
# group objects or survey locations that may not be stored with the survey data. In this
# example an adiminstrative label is attached to each survey_id. The cantonal label:
survey_data = surveys.merge(beaches['canton'], left_on='slug', right_index=True, validate='many_to_one')
# survey_data = survey_data.loc[survey_data.code == 'G27'].copy()
survey_data = survey_data[survey_data.feature_name != 'aare'].copy()
# ! USER DEFINED INPUT
# Temporal and geographic boundaries.
boundaries = dict(feature_type ='l', language='fr', start_date='2015-01-01', end_date='2022-01-01')
# Make the report data and report
top_label, language, w_df, w_di = r_class.report_data(boundaries, survey_data.copy(), beaches, codes)
a_report = r_class.ReportClass(w_df, boundaries, top_label, 'fr', c_l)
w_df_locations = w_df.slug.unique()
# call the land use class on the two different location groups
m_ui = LandUse(land_cover, land_use, streets, w_df_locations)
# for the region of interest
lcui = m_ui.use_of_land.copy()
lc_sti, no_datai = match_topo_attributes_to_surveys(lcui, a_report.w_df)
# the basic work data contains the survey results and the
# topographical data merged on the <slug> column
work_data_i = merge_topodata_and_surveydata(lc_sti, a_report.w_df)
new_names = {
'slug':'location',
'loc_date':'sample_id',
'pcs_m':'pcs/m',
'Obstanlage': "orchards",
'Reben':'vineyards',
'Siedl':'buildings',
'Wald':'forest',
'land_use':'public services'
}
gptdf = work_data_i.rename(columns=new_names)
The preceding code produces the following table:
location | sample_id | date | feature_name | parent_boundary | city | canton | pcs/m | quantity | code | feature_type | orchards | vineyards | buildings | forest | undefined | public services | streets | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | aabach | ('aabach', '2020-10-22') | 2020-10-22 | zurichsee | linth | Schmerikon | St. Gallen | 0.0 | 0 | G1 | l | 0.0 | 0.0 | 0.204522 | 0.535456 | 0.561892 | 0.018241 | 0.207248 |
1 | aabach | ('aabach', '2020-10-22') | 2020-10-22 | zurichsee | linth | Schmerikon | St. Gallen | 0.0 | 0 | G10 | l | 0.0 | 0.0 | 0.204522 | 0.535456 | 0.561892 | 0.018241 | 0.207248 |
2 | aabach | ('aabach', '2020-10-22') | 2020-10-22 | zurichsee | linth | Schmerikon | St. Gallen | 0.0 | 0 | G100 | l | 0.0 | 0.0 | 0.204522 | 0.535456 | 0.561892 | 0.018241 | 0.207248 |
3 | aabach | ('aabach', '2020-10-22') | 2020-10-22 | zurichsee | linth | Schmerikon | St. Gallen | 0.0 | 0 | G101 | l | 0.0 | 0.0 | 0.204522 | 0.535456 | 0.561892 | 0.018241 | 0.207248 |
4 | aabach | ('aabach', '2020-10-22') | 2020-10-22 | zurichsee | linth | Schmerikon | St. Gallen | 0.0 | 0 | G102 | l | 0.0 | 0.0 | 0.204522 | 0.535456 | 0.561892 | 0.018241 | 0.207248 |
Hand file to assistant#
Add language definitions#
The language definitions ensure an efficient transmission of intent from the observer to the model. We could leave the translations and definitions up to a translator and thus reduce the weight of the .csv file or API request. Howver this would generate an additional service by the client to get the requested information translated. Providing the definitions according to the standard set in the Federal report is a good baseline. If their is support amongst stakeholders to change the definitions then this can be handled by a pull request or raising an issue on the repo.
gptdf['fr'] = gptdf.code.map(lambda x: codes.loc[x, 'fr'])
gptdf['en'] = gptdf.code.map(lambda x: codes.loc[x, 'en'])
gptdf['de'] = gptdf.code.map(lambda x: codes.loc[x, 'de'])
gptdf.to_csv('data/in_process/lakes.csv', index=False)
gptdfx.head()
sample_id | location | date | feature_name | parent_boundary | city | canton | feature_type | orchards | vineyards | ... | forest | undefined | public services | streets | code | pcs/m | quantity | fr | en | de | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ('aabach', '2020-10-22') | aabach | 2020-10-22 | zurichsee | linth | Schmerikon | St. Gallen | l | 0.0 | 0.0 | ... | 0.535456 | 0.561892 | 0.018241 | 0.207248 | G1 | 0.0 | 0 | Anneaux pour six packs | Six pack rings | Sixpack-Ringe |
1 | ('aabach', '2020-10-22') | aabach | 2020-10-22 | zurichsee | linth | Schmerikon | St. Gallen | l | 0.0 | 0.0 | ... | 0.535456 | 0.561892 | 0.018241 | 0.207248 | G10 | 0.0 | 0 | Emballage fast food | Food containers single use foamed or plastic | Lebensmittelbehälter zum einmaligen Gebrauch a... |
2 | ('aabach', '2020-10-22') | aabach | 2020-10-22 | zurichsee | linth | Schmerikon | St. Gallen | l | 0.0 | 0.0 | ... | 0.535456 | 0.561892 | 0.018241 | 0.207248 | G100 | 0.0 | 0 | Médical conteneurs/tubes/ emballages | Medical; containers/tubes/ packaging | Medizin; Behälter/Röhrchen/Verpackungen |
3 | ('aabach', '2020-10-22') | aabach | 2020-10-22 | zurichsee | linth | Schmerikon | St. Gallen | l | 0.0 | 0.0 | ... | 0.535456 | 0.561892 | 0.018241 | 0.207248 | G101 | 0.0 | 0 | Sac pour déjections canines | Dog feces bag | Robidog Hundekot-Säcklein, andere Hundekotsäck... |
4 | ('aabach', '2020-10-22') | aabach | 2020-10-22 | zurichsee | linth | Schmerikon | St. Gallen | l | 0.0 | 0.0 | ... | 0.535456 | 0.561892 | 0.018241 | 0.207248 | G102 | 0.0 | 0 | Tongs | Flip-flops | Flip-Flops |
5 rows × 21 columns
gptdfx[gptdfx.code == 'G79']
sample_id | location | date | feature_name | parent_boundary | city | canton | feature_type | orchards | vineyards | ... | forest | undefined | public services | streets | code | pcs/m | quantity | fr | en | de |
---|
0 rows × 21 columns
gptdfx.columns
Index(['sample_id', 'location', 'date', 'feature_name', 'parent_boundary',
'city', 'canton', 'feature_type', 'orchards', 'vineyards', 'buildings',
'forest', 'undefined', 'public services', 'streets', 'code', 'pcs/m',
'quantity', 'fr', 'en', 'de'],
dtype='object')
Column names and definitions#
These column names and definitions are given to the GPT assistant.
location: the name of the location used by people doing the survey
sample_id: the combination of the location and date, the unique identifier of a sampling event
date: the data of the sample
feature_name: the name of the park, lake, or river where the sample was collected
parent_boundary: a designated survey area, usually a river basin or regional label
city: the muniicpality where the sample was taken
canton: the canton where the sample was taken
pcs/m: the number of objects identified by the column code collected at the sampling event divided by the length of shoreline, river bank or trail that was sampled.
quantity: the number of objects identified by the column code collected at the sampling event
code: the Marine Litter Watch object code
feature_type: identifies the sample location as either a park, lake or river
orchard: % of dry land attributed to this land-use within 1’500 m of the survey location
vineyards: % of dry land attributed to this land-use within 1’500 m of the survey location
buildings: % of dry land attributed to this land-use within 1’500 m of the survey location
forest: % of dry land attributed to this land-use within 1’500 m of the survey location
undefined: % of dry land with no land-use label
public services: % of dry land attributed to hospitals, schools, sports, administration
streets: the number of meters of streets within 1 500 m of the survey location. scaled between 0 - 1.
fr: french code definitions
en: english code definitions
de: german code definitions
Note
The GPT will go through data exploration at the begining of the chat. These column defintions are given to the GPT and can be requested at any time. The definitions the GPT gives you should be very close to these definitions, it is not tell the GPT to use the provided definition in its instructions. These definitions should come back.
Verifying the output#
Test statistics#
Asking for each of these individually or telling the assistant to produce them all should yield the following results:
the median sample total of the data frame
the total quantity
the number of lakes
the number of samples
the number of cantons
the number of cities
Show code cell source
gp_dt = gptdfx.groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})
lakes = gptdfx[gptdfx.feature_type == 'l'].feature_name.nunique()
cities = gptdfx.city.nunique()
quantity = gptdfx.quantity.sum()
samples = gptdfx.sample_id.nunique()
cantons = gptdfx.canton.nunique()
pc_med = gp_dt['pcs/m'].median()
test_1 = dict(lakes=lakes, cities=cities, quantity=quantity, samples=samples, cantons=cantons, median_pcs_m = pc_med)
print(test_1)
{'lakes': 16, 'cities': 67, 'quantity': 146066, 'samples': 753, 'cantons': 14, 'median_pcs_m': 2.77}
Most common#
The most common codes are those codes that are either in the top ten by quantity or present in at lease 50% of the surveys.
most_common, weight = a_report.most_common
most_common
quantity | % | pcs_m | fail rate | |
---|---|---|---|---|
G27 | 29033 | 0.198674 | 0.33 | 0.904636 |
Gfrags | 17073 | 0.116831 | 0.04 | 0.905960 |
Gfoams | 14989 | 0.102570 | 0.00 | 0.754967 |
G208 | 10445 | 0.071475 | 0.00 | 0.328477 |
G30 | 8931 | 0.061115 | 0.12 | 0.845033 |
G67 | 6926 | 0.047395 | 0.06 | 0.650331 |
Gcaps | 5382 | 0.036829 | 0.00 | 0.750993 |
G95 | 4610 | 0.031546 | 0.00 | 0.482119 |
G200 | 4232 | 0.028960 | 0.01 | 0.508609 |
G178 | 2558 | 0.017504 | 0.03 | 0.607947 |
G156 | 2190 | 0.014986 | 0.00 | 0.429139 |
G177 | 1447 | 0.009902 | 0.01 | 0.525828 |
Aggregating samples#
Sample total pcs/m#
gp_dt['pcs/m'].describe()
count 753.000000
mean 5.628738
std 9.367900
min 0.040000
25% 1.220000
50% 2.770000
75% 6.000000
max 77.100000
Name: pcs/m, dtype: float64
Single code#
cigarette ends
gp_dtcode = gptdfx[gptdfx.code.isin(['G27'])].groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})
gp_dtcode['pcs/m'].describe()
count 753.000000
mean 0.907304
std 1.605537
min 0.000000
25% 0.090000
50% 0.340000
75% 1.040000
max 19.700000
Name: pcs/m, dtype: float64
Combining codes#
combining cigarette ends and snack wrappers
gp_dtcodes = gptdfx[gptdfx.code.isin(['G27', 'G30'])].groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})
gp_dtcodes['pcs/m'].describe()
count 753.000000
mean 1.238433
std 2.035005
min 0.000000
25% 0.210000
50% 0.550000
75% 1.450000
max 23.270000
Name: pcs/m, dtype: float64
Single feature#
the results on Bielersee
gp_dtbsee = gptdfx[gptdfx.feature_name == 'bielersee'].groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})
gp_dtbsee['pcs/m'].describe()
count 51.000000
mean 4.023725
std 2.995087
min 0.400000
25% 1.450000
50% 3.380000
75% 5.470000
max 14.800000
Name: pcs/m, dtype: float64
Combined features#
Bielersee and Thunersee
gp_dtbt = gptdfx[gptdfx.feature_name.isin(['bielersee', 'thunersee'])].groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})
gp_dtbt['pcs/m'].describe()
count 94.000000
mean 2.738085
std 2.682369
min 0.160000
25% 0.862500
50% 1.685000
75% 3.477500
max 14.800000
Name: pcs/m, dtype: float64
Land use#
Correlation matrix of the land use variables with each other
corrs = gp_dtbt[geo_features[1:-1]].corr()
corrs
vineyards | orchards | buildings | forest | undefined | public services | streets | |
---|---|---|---|---|---|---|---|
vineyards | 1.000000 | -0.088721 | -0.174585 | -0.087445 | 0.057518 | -0.110552 | 0.101623 |
orchards | -0.088721 | 1.000000 | -0.195273 | 0.586422 | -0.335178 | -0.220276 | -0.185527 |
buildings | -0.174585 | -0.195273 | 1.000000 | -0.249992 | -0.699952 | 0.833019 | 0.631690 |
forest | -0.087445 | 0.586422 | -0.249992 | 1.000000 | -0.487967 | -0.077841 | -0.129156 |
undefined | 0.057518 | -0.335178 | -0.699952 | -0.487967 | 1.000000 | -0.664196 | -0.487873 |
public services | -0.110552 | -0.220276 | 0.833019 | -0.077841 | -0.664196 | 1.000000 | 0.831108 |
streets | 0.101623 | -0.185527 | 0.631690 | -0.129156 | -0.487873 | 0.831108 | 1.000000 |
Author: hammerdirt-analyst
conda environment: cantonal_report
pandas: 2.0.3