Checking the assistant

Checking the assistant#

This page is a reference point for testing the accuracy of the GPT assigned to accompany readers of the federal report. The GPT should reproduce the calculations on this page at any time. This includes values not in the federal report. Stakeholders will need to apply these results to their proper geographic or administrative responsibilities. The hammerdirt GPT assists in this process.

The product is a dataframe that is the combination of columns from the ReportClass and columns from the LandUseClass. The intention is to allow easy access to the magnitude of toprgaphical features within 1 500 m of the observed density for any object in the data.

Important

April 17, 2023: The app that uses the hammerdirtGPT is in demo-form. We have abandoned the intial method of defining the prompt through the api. We are now developing a RAG application. A component of the context for the prompt is the results from users request. With this we combine the references from the federal report and any updated references that can be included.

Changes to class definitions: Building a RAG application means that we have to consider both the user visualisation of the report and the consumption of that data for the AI model, data_frame or array for the former, .JSON for the latter. These considerations will have a transformative effect on all the code in this module.

PREVIOUS

November 20, 2023: There is a known issue we are working on now. Remind the assistant to follow intsructions. Specifically in the following cases:

Always getting a value of zero for the median sample total
- The GPT has specific instructions on this
Tells you the correct columns are not available
- The GPT has the column names and definitions from this page

The data has a two column index, somtheing the GPT does not always recognize. An issue has been submitted issue

Note

The assistants role is to provide mathematical and graphical representations of the data in response to the researchers request. This often involves aggregating values at different levels, combining attributes and the like.

This page allows all users to verify that these complex transactions are happening correctly. The GPT may not use the same method to calculate the final result, but the results should be same.

Default data of hammerdirt GPT:#

beta version = .01

The default data for the GPT can be reproduced on the command line if the hammerdirtgpt package is installed:

# Collecting required data to establish a report
# This includes the language maps for all the common
# abbreviations and columns or index labels.
c_l = r_class.language_maps()

# The survey data in units of pcs/m or pcs/m². The reports
# are aggregated first to the sample_id. Which means that the operations
# are the same wether using pcs/m or pcs/m².
surveys = r_class.collect_survey_data_for_report()

# The support or evnvironmental data. This includes plain text descriptions 
# of the Codes. Details for each survey location and topogrphical data
# extracted from the buffer around each survey location.
codes, beaches, land_cover, land_use, streets, river_intersect_lakes = r_class.collect_env_data_for_report()

# Add columns to survey data. The support data contains information that can be used to
# group objects or survey locations that may not be stored with the survey data. In this
# example an adiminstrative label is attached to each survey_id. The cantonal label:
survey_data = surveys.merge(beaches['canton'], left_on='slug', right_index=True, validate='many_to_one')
# survey_data = survey_data.loc[survey_data.code == 'G27'].copy()
survey_data = survey_data[survey_data.feature_name != 'aare'].copy()

# ! USER DEFINED INPUT
# Temporal and geographic boundaries.
boundaries = dict(feature_type ='l', language='fr', start_date='2015-01-01', end_date='2022-01-01')
# Make the report data and report
top_label, language, w_df, w_di = r_class.report_data(boundaries, survey_data.copy(), beaches, codes)
a_report = r_class.ReportClass(w_df, boundaries, top_label, 'fr', c_l)
w_df_locations = w_df.slug.unique()

# call the land use class on the two different location groups
m_ui = LandUse(land_cover, land_use, streets, w_df_locations)

# for the region of interest
lcui = m_ui.use_of_land.copy()
lc_sti, no_datai = match_topo_attributes_to_surveys(lcui, a_report.w_df)

# the basic work data contains the survey results and the 
# topographical data merged on the <slug> column
work_data_i = merge_topodata_and_surveydata(lc_sti, a_report.w_df)

new_names = {
    'slug':'location',
    'loc_date':'sample_id',
    'pcs_m':'pcs/m',
    'Obstanlage': "orchards",
    'Reben':'vineyards',
    'Siedl':'buildings',
    'Wald':'forest',
    'land_use':'public services'
}
gptdf = work_data_i.rename(columns=new_names)

The preceding code produces the following table:

	location	sample_id	date	feature_name	parent_boundary	city	canton	code	feature_type	buildings	forest	undefined	public services	streets
0	aabach	('aabach', '2020-10-22')	2020-10-22	zurichsee	linth	Schmerikon	St. Gallen	G1	l	0.204522	0.535456	0.561892	0.018241	0.207248
1	aabach	('aabach', '2020-10-22')	2020-10-22	zurichsee	linth	Schmerikon	St. Gallen	G10	l	0.204522	0.535456	0.561892	0.018241	0.207248
2	aabach	('aabach', '2020-10-22')	2020-10-22	zurichsee	linth	Schmerikon	St. Gallen	G100	l	0.204522	0.535456	0.561892	0.018241	0.207248
3	aabach	('aabach', '2020-10-22')	2020-10-22	zurichsee	linth	Schmerikon	St. Gallen	G101	l	0.204522	0.535456	0.561892	0.018241	0.207248
4	aabach	('aabach', '2020-10-22')	2020-10-22	zurichsee	linth	Schmerikon	St. Gallen	G102	l	0.204522	0.535456	0.561892	0.018241	0.207248

Hand file to assistant#

Add language definitions#

The language definitions ensure an efficient transmission of intent from the observer to the model. We could leave the translations and definitions up to a translator and thus reduce the weight of the .csv file or API request. Howver this would generate an additional service by the client to get the requested information translated. Providing the definitions according to the standard set in the Federal report is a good baseline. If their is support amongst stakeholders to change the definitions then this can be handled by a pull request or raising an issue on the repo.

gptdf['fr'] = gptdf.code.map(lambda x: codes.loc[x, 'fr'])
gptdf['en'] = gptdf.code.map(lambda x: codes.loc[x, 'en'])
gptdf['de'] = gptdf.code.map(lambda x: codes.loc[x, 'de'])

gptdf.to_csv('data/in_process/lakes.csv', index=False)

gptdfx.head()

	sample_id	location	date	feature_name	parent_boundary	city	canton	feature_type	...	forest	undefined	public services	streets	code	fr	en	de
0	('aabach', '2020-10-22')	aabach	2020-10-22	zurichsee	linth	Schmerikon	St. Gallen	l	...	0.535456	0.561892	0.018241	0.207248	G1	Anneaux pour six packs	Six pack rings	Sixpack-Ringe
1	('aabach', '2020-10-22')	aabach	2020-10-22	zurichsee	linth	Schmerikon	St. Gallen	l	...	0.535456	0.561892	0.018241	0.207248	G10	Emballage fast food	Food containers single use foamed or plastic	Lebensmittelbehälter zum einmaligen Gebrauch a...
2	('aabach', '2020-10-22')	aabach	2020-10-22	zurichsee	linth	Schmerikon	St. Gallen	l	...	0.535456	0.561892	0.018241	0.207248	G100	Médical conteneurs/tubes/ emballages	Medical; containers/tubes/ packaging	Medizin; Behälter/Röhrchen/Verpackungen
3	('aabach', '2020-10-22')	aabach	2020-10-22	zurichsee	linth	Schmerikon	St. Gallen	l	...	0.535456	0.561892	0.018241	0.207248	G101	Sac pour déjections canines	Dog feces bag	Robidog Hundekot-Säcklein, andere Hundekotsäck...
4	('aabach', '2020-10-22')	aabach	2020-10-22	zurichsee	linth	Schmerikon	St. Gallen	l	...	0.535456	0.561892	0.018241	0.207248	G102	Tongs	Flip-flops	Flip-Flops

5 rows × 21 columns

gptdfx[gptdfx.code == 'G79']

	sample_id	location	date	feature_name	parent_boundary	city	canton	feature_type	orchards	vineyards	...	forest	undefined	public services	streets	code	pcs/m	quantity	fr	en	de

0 rows × 21 columns

gptdfx.columns

Index(['sample_id', 'location', 'date', 'feature_name', 'parent_boundary',
       'city', 'canton', 'feature_type', 'orchards', 'vineyards', 'buildings',
       'forest', 'undefined', 'public services', 'streets', 'code', 'pcs/m',
       'quantity', 'fr', 'en', 'de'],
      dtype='object')

Column names and definitions#

These column names and definitions are given to the GPT assistant.

location: the name of the location used by people doing the survey
sample_id: the combination of the location and date, the unique identifier of a sampling event
date: the data of the sample
feature_name: the name of the park, lake, or river where the sample was collected
parent_boundary: a designated survey area, usually a river basin or regional label
city: the muniicpality where the sample was taken
canton: the canton where the sample was taken
pcs/m: the number of objects identified by the column code collected at the sampling event divided by the length of shoreline, river bank or trail that was sampled.
quantity: the number of objects identified by the column code collected at the sampling event
code: the Marine Litter Watch object code
feature_type: identifies the sample location as either a park, lake or river
orchard: % of dry land attributed to this land-use within 1’500 m of the survey location
vineyards: % of dry land attributed to this land-use within 1’500 m of the survey location
buildings: % of dry land attributed to this land-use within 1’500 m of the survey location
forest: % of dry land attributed to this land-use within 1’500 m of the survey location
undefined: % of dry land with no land-use label
public services: % of dry land attributed to hospitals, schools, sports, administration
streets: the number of meters of streets within 1 500 m of the survey location. scaled between 0 - 1.
fr: french code definitions
en: english code definitions
de: german code definitions

Note

The GPT will go through data exploration at the begining of the chat. These column defintions are given to the GPT and can be requested at any time. The definitions the GPT gives you should be very close to these definitions, it is not tell the GPT to use the provided definition in its instructions. These definitions should come back.

Verifying the output#

Test statistics#

Asking for each of these individually or telling the assistant to produce them all should yield the following results:

the median sample total of the data frame
the total quantity
the number of lakes
the number of samples
the number of cantons
the number of cities

{'lakes': 16, 'cities': 67, 'quantity': 146066, 'samples': 753, 'cantons': 14, 'median_pcs_m': 2.77}

Most common#

The most common codes are those codes that are either in the top ten by quantity or present in at lease 50% of the surveys.

most_common, weight = a_report.most_common
most_common

	quantity	%	pcs_m	fail rate
G27	29033	0.198674	0.33	0.904636
Gfrags	17073	0.116831	0.04	0.905960
Gfoams	14989	0.102570	0.00	0.754967
G208	10445	0.071475	0.00	0.328477
G30	8931	0.061115	0.12	0.845033
G67	6926	0.047395	0.06	0.650331
Gcaps	5382	0.036829	0.00	0.750993
G95	4610	0.031546	0.00	0.482119
G200	4232	0.028960	0.01	0.508609
G178	2558	0.017504	0.03	0.607947
G156	2190	0.014986	0.00	0.429139
G177	1447	0.009902	0.01	0.525828

Aggregating samples#

Sample total pcs/m#

gp_dt['pcs/m'].describe()

count    753.000000
mean       5.628738
std        9.367900
min        0.040000
25%        1.220000
50%        2.770000
75%        6.000000
max       77.100000
Name: pcs/m, dtype: float64

Single code#

cigarette ends

gp_dtcode = gptdfx[gptdfx.code.isin(['G27'])].groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})
gp_dtcode['pcs/m'].describe()

count    753.000000
mean       0.907304
std        1.605537
min        0.000000
25%        0.090000
50%        0.340000
75%        1.040000
max       19.700000
Name: pcs/m, dtype: float64

Combining codes#

combining cigarette ends and snack wrappers

gp_dtcodes = gptdfx[gptdfx.code.isin(['G27', 'G30'])].groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})
gp_dtcodes['pcs/m'].describe()

count    753.000000
mean       1.238433
std        2.035005
min        0.000000
25%        0.210000
50%        0.550000
75%        1.450000
max       23.270000
Name: pcs/m, dtype: float64

Single feature#

the results on Bielersee

gp_dtbsee = gptdfx[gptdfx.feature_name == 'bielersee'].groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})
gp_dtbsee['pcs/m'].describe()

count    51.000000
mean      4.023725
std       2.995087
min       0.400000
25%       1.450000
50%       3.380000
75%       5.470000
max      14.800000
Name: pcs/m, dtype: float64

Combined features#

Bielersee and Thunersee

gp_dtbt = gptdfx[gptdfx.feature_name.isin(['bielersee', 'thunersee'])].groupby(['sample_id', *geo_features], as_index=False).agg({'pcs/m':'sum', 'quantity':'sum'})
gp_dtbt['pcs/m'].describe()

count    94.000000
mean      2.738085
std       2.682369
min       0.160000
25%       0.862500
50%       1.685000
75%       3.477500
max      14.800000
Name: pcs/m, dtype: float64

Land use#

Correlation matrix of the land use variables with each other

corrs = gp_dtbt[geo_features[1:-1]].corr()
corrs

	vineyards	orchards	buildings	forest	undefined	public services	streets
vineyards	1.000000	-0.088721	-0.174585	-0.087445	0.057518	-0.110552	0.101623
orchards	-0.088721	1.000000	-0.195273	0.586422	-0.335178	-0.220276	-0.185527
buildings	-0.174585	-0.195273	1.000000	-0.249992	-0.699952	0.833019	0.631690
forest	-0.087445	0.586422	-0.249992	1.000000	-0.487967	-0.077841	-0.129156
undefined	0.057518	-0.335178	-0.699952	-0.487967	1.000000	-0.664196	-0.487873
public services	-0.110552	-0.220276	0.833019	-0.077841	-0.664196	1.000000	0.831108
streets	0.101623	-0.185527	0.631690	-0.129156	-0.487873	0.831108	1.000000

Author: hammerdirt-analyst

conda environment: cantonal_report

pandas: 2.0.3