Spaces:

JasonTPhillipsJr
/

SpaGAN

Sleeping

App Files Files Community

JasonTPhillipsJr commited on Nov 11, 2024

Commit

17e89f9

verified ·

1 Parent(s): bc50d7d

Delete models/spabert/notebooks

Browse files

Files changed (16) hide show

models/spabert/notebooks/GAN-SpaBERT_pytorch.ipynb +0 -0
models/spabert/notebooks/README.md +0 -167
models/spabert/notebooks/Setup.ipynb +0 -0
models/spabert/notebooks/SpaBertEmbeddingTest1.ipynb +0 -0
models/spabert/notebooks/WHGDataset.py +0 -77
models/spabert/notebooks/Working with SpaBERT Embedding.ipynb +0 -0
models/spabert/notebooks/__pycache__/WHGDataset.cpython-310.pyc +0 -0
models/spabert/notebooks/spabert-entity-linking.ipynb +0 -287
models/spabert/notebooks/spabert-fine-tuning.ipynb +0 -262
models/spabert/notebooks/tutorial_datasets/mlm_mem_keeppos_ep0_iter06000_0.2936.pth +0 -3
models/spabert/notebooks/tutorial_datasets/osm_mn.csv +0 -0
models/spabert/notebooks/tutorial_datasets/output.csv.json +0 -3
models/spabert/notebooks/tutorial_datasets/spabert-base-uncased-finetuned-osm-mn.pth +0 -3
models/spabert/notebooks/tutorial_datasets/spabert_osm_mn.json +0 -3
models/spabert/notebooks/tutorial_datasets/spabert_whg_wikidata.json +0 -3
models/spabert/notebooks/tutorial_datasets/spabert_wikidata_sampled.json +0 -3

models/spabert/notebooks/GAN-SpaBERT_pytorch.ipynb DELETED Viewed

The diff for this file is too large to render. See raw diff

models/spabert/notebooks/README.md DELETED Viewed

@@ -1,167 +0,0 @@
-# Tutorials for Testing and Fine-Tuning SpaBERT
-This repository provides two Jupyter Notebooks for testing entity linking (one of the downstream tasks of SpaBERT) and fine-tuning procedure to train on geo-entities from other knowledge bases (e.g., [World Historical Gazetteer](https://whgazetteer.org/))
-1. The first step is cloning the SpaBERT repository onto your machine. Run the following line of code to do this.
-`git clone https://github.com/zekun-li/spabert.git`
-2. You will need to have IPython Kernel for Jupyter installed before running the code in this tutorial. Run the following line of code to ensure ipython is installed
-`pip install ipykernel`
-3. Before starting the jupyter notebooks run the following lines to make sure you have all required packages:
-`pip install requirements.txt`
-The requirements.txt file will be located in the spabert directory.
-```
--spabert
- | - datasets
- | - experiments
- | - models
- | - models
- | - notebooks
- | - utils
- | - __init__.py
- | - README.md
- | - requirements.txt
- | - train_mlm.py
-```
-## Installing Model Weights
-Make sure you have git-lfs installed (https://git-lfs.com windows & mac) (https://github.com/git-lfs/git-lfs/blob/main/INSTALLING.md linux)
-Please run the follow commands separately in the order to install pre-trained & fine-tuned model weights
-`git lfs install`
-`git clone https://huggingface.co/knowledge-computing-lab/spabert-base-uncased`
-`git clone https://huggingface.co/knowledge-computing-lab/spabert-base-uncased-finetuned-osm-mn`
-Once the model weight is installed, you'll see a file called `mlm_mem_keeppos_ep0_iter06000_0.2936.pth` and `spabert-base-uncased-finetuned-osm-mn.pth`
-Move these files to the tutorial_datasets folder. After moving them, the file structure should look like this:
-```
-- notebooks
-  | - tutorial_datasets
-  |   | - mlm_mem_keeppos_ep0_iter06000_0.2936.pth
-  |   | - osm_mn.csv
-  |   | - spabert_osm_mn.json
-  |   | - spabert_whg_wikidata.json
-  |   | - spabert_wikidata_sampled.json
-  |   | - spabert-base-uncased-finetuned-osm-mn.pth
-  | - README.md
-  | - spabert-entity-linking.ipynb
-  | - spabert-fine-tuning.ipynb
-  | - WHGDataset.py
-```
-## Jupyter Notebook Descriptions
-### [spabert-fine-tuning.ipynb](https://github.com/Jina-Kim/spabert/blob/main/notebooks/spabert-fine-tuning.ipynb)
-This Jupyter Notebook provides on how to fine-tune spabert using point data from OpenStreetMap (OSM) in Minnesota. SpaBERT is pre-trained using data from California and London using OSM Point data. Instructions for pre-training your own model can be found on the spabert github
-Here are the steps to run:
-1. Define which dataset you want to use (e.g., OSM in New York or Minnesota)
-2. Read data from csv file and construct KDTree for computing nearest neighbors
-3. Create dataset using KDTree for fine-tuning SpaBERT using the dataset you chose
-4. Load pre-trained model
-5. Load dataset using the SpaBERT data loader
-6. Train model for 1 epoch using fine-tuning model and save
-### [spabert-entity-linking.ipynb](https://github.com/Jina-Kim/spabert/blob/main/notebooks/spabert-entity-linking.ipynb)
-This Jupyter Notebook provides on how to create an entity-linking dataset and how to perform entity-linking using SpaBERT. The dataset used here is a pre-matched dataset between World Historical Gazetteer (WHG) and Wikidata. The methods used to evaluate this model will be Hits@K and Mean Reciprocal Rank (MRR)
-Here are the steps to run:
-1. Load fine-tuned model from previous Jupyter notebook
-2. Load datasets using the WHG data loader
-3. Calculate embeddings for whg and wikidata entities using SpaBERT
-4. Calculate hits@1, Hits@5, Hits@10, and MRR
-## Dataset Descriptions
-There are two types of tutorial datasets used for fine-tuning SpaBERT, CSV and JSON files.
-- CSV file - sample taken from OpenStreetMap (OSM)
-    - Minnesota State `./tutorial_datasets/osm_mn.csv`
-    An example data structure:
-    | row_id | name | latitude | longitude |
-    | ------ | ---- | -------- | --------- |
-    |    0   | Duluth | -92.1215 | 46.7729 |
-    |    1   | Green Valley | -95.757 | 44.5269 |
-- JSON files - ready-to-use files for SpaBERT's data loader - [SpatialDataset](../datasets/dataset_loader.py)
-    - OSM Minnesota State `./tutorial_datasets/spabert_osm_mn.json`
-      - Generated from `./tutorial_datasets/osm_mn.csv` using spabert-fine-tuning.ipynb
-    - WHG `./tutorial_datasets/spabert_whg_wikidata.json`
-      - Geo-entities from WHG that have the link to Wikidata
-    - Wikidata `./tutorial_datasets/spabert_wikidata_sampled.json`
-      - Sampled from entities delivered by WHG. These entities have been linked between WHG and Wikidata by WHG prior to being delivered to us.
-The file contains json objects on each line. Each json object describes the spatial context of an entity using nearby entities.
-A sample json object looks like the following:
-```json
-{
-   "info":{
-      "name":"Duluth",
-      "geometry":{
-         "coordinates":[
-            46.7729,
-            -92.1215
-         ]
-      }
-   },
-   "neighbor_info":{
-      "name_list":[
-         "Duluth",
-         "Chinese Peace Belle and Garden",
-         ...
-      ],
-      "geometry_list":[
-         {
-            "coordinates":[
-               46.7729,
-               -92.1215
-            ]
-         },
-         {
-            "coordinates":[
-               46.7770,
-               -92.1241
-            ]
-         },
-         ...
-      ]
-   }
-}
-```
-To perform entity-linking on SpaBERT you must have a dataset structured similarly to the second dataset used for fine-tuning.
-A sample json object looks like the following:
-```json
-{
-   "info":{
-      "name":"Duluth",
-      "geometry":{
-         "coordinates":[
-            46.7729,
-            -92.1215
-         ]
-      },
-      "qid":"Q485708"
-   },
-   "neighbor_info":{
-      ...
-   }
-}
-```

models/spabert/notebooks/Setup.ipynb DELETED Viewed

The diff for this file is too large to render. See raw diff

models/spabert/notebooks/SpaBertEmbeddingTest1.ipynb DELETED Viewed

The diff for this file is too large to render. See raw diff

models/spabert/notebooks/WHGDataset.py DELETED Viewed

@@ -1,77 +0,0 @@
-import numpy as np
-import torch
-from torch.utils.data import Dataset
-import json
-import sys
-sys.path.append("../")
-from datasets.dataset_loader import SpatialDataset
-from transformers import RobertaTokenizer, BertTokenizer
-class WHGDataset(SpatialDataset):
-    # initializes dataset loader and converts dataset python object
-    def __init__(self, data_file_path, tokenizer=None,max_token_len = 512, distance_norm_factor = 1, spatial_dist_fill=100, sep_between_neighbors = False):
-        if tokenizer is None:
-            self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        else:
-            self.tokenizer = tokenizer
-        self.read_data(data_file_path)
-        self.max_token_len = max_token_len
-        self.distance_norm_factor = distance_norm_factor
-        self.spatial_dist_fill = spatial_dist_fill
-        self.sep_between_neighbors = sep_between_neighbors
-    # returns a specific item from the dataset given an index
-    def __getitem__(self, idx):
-        return self.load_data(idx)
-    # returns the length of the dataset loaded
-    def __len__(self):
-        return self.len_data
-    def get_average_distance(self,idx):
-        line = self.data[idx]
-        line_data_dict = json.loads(line)
-        pivot_pos = line_data_dict['info']['geometry']['coordinates']
-        neighbor_geom_list = line_data_dict['neighbor_info']['geometry_list']
-        lat_diff = 0
-        lng_diff = 0
-        for neighbor in neighbor_geom_list:
-            coordinates = neighbor['coordinates']
-            lat_diff = lat_diff + (abs(pivot_pos[0]-coordinates[0]))
-            lng_diff = lng_diff + (abs(pivot_pos[1]-coordinates[1]))
-        avg_lat_diff = lat_diff/len(neighbor_geom_list)
-        avg_lng_diff = lng_diff/len(neighbor_geom_list)
-        return (avg_lat_diff, avg_lng_diff)
-    # reads dataset from given filepath, run on initilization
-    def read_data(self, data_file_path):
-        with open(data_file_path, 'r') as f:
-            data = f.readlines()
-        len_data = len(data)
-        self.len_data = len_data
-        self.data = data
-    # loads and parses dataset
-    def load_data(self, idx):
-        line = self.data[idx]
-        line_data_dict = json.loads(line)
-        # get pivot info
-        pivot_name = str(line_data_dict['info']['name'])
-        pivot_pos = line_data_dict['info']['geometry']['coordinates']
-        # get neighbor info
-        neighbor_info = line_data_dict['neighbor_info']
-        neighbor_name_list = neighbor_info['name_list']
-        neighbor_geom_list = neighbor_info['geometry_list']
-        parsed_data = self.parse_spatial_context(pivot_name, pivot_pos, neighbor_name_list, neighbor_geom_list, self.spatial_dist_fill)
-        parsed_data['qid'] = line_data_dict['info']['qid']
-        return parsed_data

models/spabert/notebooks/Working with SpaBERT Embedding.ipynb DELETED Viewed

The diff for this file is too large to render. See raw diff

models/spabert/notebooks/__pycache__/WHGDataset.cpython-310.pyc DELETED Viewed

Binary file (2.49 kB)

models/spabert/notebooks/spabert-entity-linking.ipynb DELETED Viewed

@@ -1,287 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "import sys\n",
-    "from transformers import BertTokenizer\n",
-    "from transformers.models.bert.modeling_bert import BertForMaskedLM\n",
-    "import torch\n",
-    "from WHGDataset import WHGDataset\n",
-    "\n",
-    "sys.path.append(\"../\")\n",
-    "from datasets.usgs_os_sample_loader import USGS_MapDataset\n",
-    "from datasets.wikidata_sample_loader import Wikidata_Geocoord_Dataset, Wikidata_Random_Dataset\n",
-    "from models.spatial_bert_model import SpatialBertModel\n",
-    "from models.spatial_bert_model import SpatialBertConfig\n",
-    "from models.spatial_bert_model import  SpatialBertForMaskedLM\n",
-    "from utils.find_closest import find_ref_closest_match, sort_ref_closest_match\n",
-    "from utils.common_utils import load_spatial_bert_pretrained_weights, get_spatialbert_embedding, get_bert_embedding, write_to_csv\n",
-    "from utils.baseline_utils import get_baseline_model\n",
-    "\n",
-    "\n",
-    "# load our spabert model\n",
-    "device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n",
-    "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
-    "        \n",
-    "config = SpatialBertConfig()\n",
-    "model = SpatialBertModel(config)\n",
-    "\n",
-    "model.to(device)\n",
-    "model.eval()\n",
-    "\n",
-    "# load pretrained weights\n",
-    "pre_trained_model=torch.load('tutorial_datasets/fine-spabert-base-uncased-finetuned-osm-mn.pth')\n",
-    "cnt_layers = 0\n",
-    "model_keys = model.state_dict()\n",
-    "for key in model_keys:\n",
-    "    if 'bert.'+ key in pre_trained_model:\n",
-    "        model_keys[key] = pre_trained_model[\"bert.\"+key]\n",
-    "        cnt_layers += 1\n",
-    "    else:\n",
-    "        print(\"No weight for\", key)\n",
-    "print(cnt_layers, 'layers loaded')\n",
-    "\n",
-    "model.load_state_dict(model_keys)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# load entity-linking datasets\n",
-    "\n",
-    "sep_between_neighbors = False\n",
-    "wikidata_dict_per_map = {}\n",
-    "wikidata_dict_per_map['wikidata_emb_list'] = []\n",
-    "wikidata_dict_per_map['wikidata_qid_list'] = []\n",
-    "wikidata_dict_per_map['names'] = []\n",
-    "\n",
-    "\n",
-    "whg_dataset = WHGDataset(\n",
-    "    data_file_path = 'tutorial_datasets/spabert_whg_wikidata.json',\n",
-    "    tokenizer = tokenizer,\n",
-    "    max_token_len = 512, \n",
-    "    distance_norm_factor = 25, \n",
-    "    spatial_dist_fill=100,\n",
-    "    sep_between_neighbors = sep_between_neighbors)\n",
-    "\n",
-    "wikidata_dataset = WHGDataset(\n",
-    "    data_file_path='tutorial_datasets/spabert_wikidata_sampled.json',\n",
-    "    tokenizer=tokenizer,\n",
-    "    max_token_len=512,\n",
-    "    distance_norm_factor=50000,\n",
-    "    spatial_dist_fill=20,\n",
-    "    sep_between_neighbors=sep_between_neighbors)\n",
-    "\n",
-    "\n",
-    "matched_wikid_dataset = []\n",
-    "for i in range(len(wikidata_dataset)):\n",
-    "    emb = wikidata_dataset[i]\n",
-    "    matched_wikid_dataset.append(emb)\n",
-    "    max_dist_lng = max(emb['norm_lng_list'])\n",
-    "    max_dist_lat = max(emb['norm_lat_list'])\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import sys\n",
-    "sys.path.append('../')\n",
-    "from experiments.entity_matching.data_processing import request_wrapper\n",
-    "import scipy.spatial as sp\n",
-    "import numpy as np\n",
-    "## ENTITY LINKING ##\n",
-    "\n",
-    "\n",
-    "# disambigufy\n",
-    "def disambiguify(model, model_name, usgs_dataset, wikidata_dict_list, candset_mode = 'all_map', if_use_distance = True, select_indices = None): \n",
-    "\n",
-    "    if select_indices is None: \n",
-    "        select_indices = range(0, len(wikidata_dict_list))\n",
-    "\n",
-    "\n",
-    "    assert(candset_mode in ['all_map','per_map'])\n",
-    "    wikidata_emb_list = wikidata_dict_list['wikidata_emb_list']\n",
-    "    wikidata_qid_list = wikidata_dict_list['wikidata_qid_list'] \n",
-    "    ret_list = []\n",
-    "    for i in range(len(usgs_dataset)):\n",
-    "        if (i % 1000) == 0:\n",
-    "            print(\"disambigufy at \" + str((i/len(usgs_dataset))*100)+\"%\")\n",
-    "        if model_name == 'spatial_bert-base' or model_name == 'spatial_bert-large':\n",
-    "            usgs_emb = get_spatialbert_embedding(usgs_dataset[i], model, use_distance = if_use_distance)\n",
-    "        else:\n",
-    "            usgs_emb = get_bert_embedding(usgs_dataset[i], model)\n",
-    "        sim_matrix = 1 - sp.distance.cdist(np.array(wikidata_emb_list), np.array([usgs_emb]), 'cosine')\n",
-    "        closest_match_qid = sort_ref_closest_match(sim_matrix, wikidata_qid_list)\n",
-    "        #print(closest_match_qid)\n",
-    "            \n",
-    "        sorted_sim_matrix = np.sort(sim_matrix, axis = 0)[::-1] # descending order\n",
-    "\n",
-    "        ret_dict = dict()\n",
-    "        ret_dict['pivot_name'] = usgs_dataset[i]['pivot_name']\n",
-    "\n",
-    "        ret_dict['sorted_match_qid'] = [a[0] for a in closest_match_qid]\n",
-    "        ret_dict['sorted_sim_matrix'] = [a[0] for a in sorted_sim_matrix]\n",
-    "\n",
-    "        ret_list.append(ret_dict)\n",
-    "\n",
-    "    return ret_list \n",
-    "\n",
-    "\n",
-    "candset_mode = 'all_map'\n",
-    "for i in range(0, len(matched_wikid_dataset)):\n",
-    "    if (i % 1000) == 0:\n",
-    "        print(\"processing at: \"+ str(i/len(matched_wikid_dataset)*100) + \"%\")\n",
-    "        #print(matched_wikid_dataset[i])\n",
-    "    entity = matched_wikid_dataset[i]\n",
-    "    wikidata_emb = get_spatialbert_embedding(matched_wikid_dataset[i], model)\n",
-    "    wikidata_dict_per_map['wikidata_emb_list'].append(wikidata_emb)\n",
-    "    wikidata_dict_per_map['wikidata_qid_list'].append(matched_wikid_dataset[i]['qid'])\n",
-    "    wikidata_dict_per_map['names'].append(wikidata_dataset[i]['pivot_name'])\n",
-    "\n",
-    "ret_list = disambiguify(model, 'spatial_bert-base', whg_dataset, wikidata_dict_per_map, candset_mode= candset_mode, if_use_distance = not False, select_indices = None)\n",
-    "write_to_csv('tutorial_datasets/', \"output.csv\", ret_list)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Evaluate entity linking\n",
-    "import os\n",
-    "import pandas as pd\n",
-    "import json\n",
-    "\n",
-    "# define the ground truth directory for evaluation\n",
-    "gt_dir = os.path.abspath(\"tutorial_datasets/spabert_wikidata_sampled.json\")\n",
-    "\n",
-    "\n",
-    "# define the file where we wrote out predictions\n",
-    "prediction_path = os.path.abspath('tutorial_datasets/output.csv.json')\n",
-    "\n",
-    "\n",
-    "# define ground truth dictionary\n",
-    "gt_dict = dict()\n",
-    "\n",
-    "with open(gt_dir) as f:\n",
-    "    data = f.readlines()\n",
-    "    for line in data:\n",
-    "        d = json.loads(line)\n",
-    "        gt_dict[d['info']['name']] = d['info']['qid']\n",
-    "\n",
-    "\n",
-    "\n",
-    "rank_list = []\n",
-    "hits_at_1 = 0\n",
-    "hits_at_5 = 0\n",
-    "hits_at_10 = 0\n",
-    "out_dict = {'title':[],'rank':[]}\n",
-    "\n",
-    "with open(prediction_path) as f:\n",
-    "    data = f.readlines()\n",
-    "    for line in data:\n",
-    "        pred_dict = json.loads(line)\n",
-    "        pivot_name = pred_dict['pivot_name']\n",
-    "        sorted_matched_uri = pred_dict['sorted_match_qid']\n",
-    "        sorted_sim_matrix = pred_dict['sorted_sim_matrix']\n",
-    "        if pivot_name in gt_dict:\n",
-    "            gt_uri = gt_dict[pivot_name]\n",
-    "            rank = sorted_matched_uri.index(gt_uri) +1\n",
-    "            if rank == 1:\n",
-    "                hits_at_1 += 1\n",
-    "            if rank <= 5:\n",
-    "                hits_at_5 += 1\n",
-    "            if rank <= 10:\n",
-    "                hits_at_10 +=1\n",
-    "            rank_list.append(rank)\n",
-    "            out_dict['title'].append(pivot_name)\n",
-    "            out_dict['rank'].append(rank)\n",
-    "\n",
-    "hits_at_1 = hits_at_1/len(rank_list)\n",
-    "hits_at_5 = hits_at_5/len(rank_list)\n",
-    "hits_at_10 = hits_at_10/len(rank_list)\n",
-    "\n",
-    "print(hits_at_1)\n",
-    "print(hits_at_5)\n",
-    "print(hits_at_10)\n",
-    "\n",
-    "out_df = pd.DataFrame(out_dict)\n",
-    "out_df\n",
-    "        \n",
-    "\n"
-   ]
-  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Mean Reciprocal Rank is a statistical measure for evaluating processes that produce a list of possible responses of a query in order of probability of correctness.\n",
-    "\n",
-    "First we obtain the rank from the ranked list shown above.\n",
-    "\n",
-    "Next we calculate the reciprocal rank for each rank. The reciprocal is the inverse of the rank. So for a rank of 1 the recprocal rank would be 1/1, for a rank of 2 the reciprocal rank would be 1/2.\n",
-    "\n",
-    "The mean reciprocal rank is the average of the reciprocal ranks. \n",
-    "\n",
-    "This measure gives us a general conceptualization of how well our model predicts entities based on their embeddings.\n",
-    "\n",
-    "An in-depth description of Mean Reciprocal Rank can be found here https://en.wikipedia.org/wiki/Mean_reciprocal_rank\n",
-    "\n",
-    "An import thing to keep in mind when caclulating mean reciprocal rank is that it tends to inversely scale with your candidate set size\n",
-    "\n",
-    "Our candidate set is has a length of 4624 "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# calculating the mean reciprocal rank (MRR)\n",
-    "import numpy as np\n",
-    "\n",
-    "reciprocal_list = [1./rank for rank in rank_list]\n",
-    "\n",
-    "MRR = np.mean(reciprocal_list)\n",
-    "\n",
-    "print(MRR)\n"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "ucgis23workshop",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.11.3"
-  },
-  "orig_nbformat": 4
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}

models/spabert/notebooks/spabert-fine-tuning.ipynb DELETED Viewed

@@ -1,262 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import json\n",
-    "import pandas as pd\n",
-    "\n",
-    "# LOCATION OF THE OSM DATA FOR FINE-TUNING\n",
-    "data = 'tutorial_datasets/osm_mn.csv'\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "## CONSTRUCT DATASET FOR FINE TUNING ##\n",
-    "\n",
-    "# Read data from .csv data file\n",
-    "\n",
-    "state_frame = pd.read_csv(data)\n",
-    "\n",
-    "\n",
-    "# construct list of names and coordinates from data\n",
-    "name_list = []\n",
-    "coordinate_list = []\n",
-    "for i, item in state_frame.iterrows():\n",
-    "    name = item[1]\n",
-    "    lat = item[2]\n",
-    "    lng =item[3]\n",
-    "    name_list.append(name)\n",
-    "    coordinate_list.append([lng,lat])\n",
-    "\n",
-    "\n",
-    "# construct KDTree out of coordinates list for when we make the neighbor lists\n",
-    "import scipy.spatial as scp\n",
-    "\n",
-    "ordered_neighbor_coordinate_list = scp.KDTree(coordinate_list)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "state_frame"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "# Get top 20 nearest neighbors for each entity in dataset\n",
-    "with open('tutorial_datasets/SPABERT_finetuning_data.json', 'w') as out_f:\n",
-    "    for i, item in state_frame.iterrows():\n",
-    "        name = item[1]\n",
-    "        lat = item[2]\n",
-    "        lng = item[3]\n",
-    "        coordinates = [lng,lat]\n",
-    "\n",
-    "        _, nearest_neighbors_idx = ordered_neighbor_coordinate_list.query([coordinates], k=21)\n",
-    "\n",
-    "        # we want to store their names and coordinates\n",
-    "\n",
-    "        nearest_neighbors_name = []\n",
-    "        nearest_neighbors_coords = []\n",
-    "        \n",
-    "        # iterate over nearest neighbors list\n",
-    "        for idx in nearest_neighbors_idx[0]:\n",
-    "            # get name and coordinate of neighbor\n",
-    "            neighbor_name = name_list[idx]\n",
-    "            neighbor_coords = coordinate_list[idx]\n",
-    "            nearest_neighbors_name.append(neighbor_name)\n",
-    "            nearest_neighbors_coords.append({\"coordinates\": neighbor_coords})\n",
-    "        \n",
-    "        # construct neighbor info dictionary object for SpaBERT embedding construction\n",
-    "        neighbor_info = {\"name_list\":nearest_neighbors_name, \"geometry_list\":nearest_neighbors_coords}\n",
-    "\n",
-    "\n",
-    "        # construct full dictionary object for SpaBERT embedding construction\n",
-    "        place = {\"info\":{\"name\":name, \"geometry\":{\"coordinates\": coordinates}}, \"neighbor_info\":neighbor_info}\n",
-    "\n",
-    "        out_f.write(json.dumps(place))\n",
-    "        out_f.write('\\n')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "### FINE-TUNE SPABERT\n",
-    "import sys\n",
-    "from transformers.models.bert.modeling_bert import BertForMaskedLM\n",
-    "from transformers import BertTokenizer\n",
-    "sys.path.append(\"../\")\n",
-    "from models.spatial_bert_model import SpatialBertConfig\n",
-    "from utils.common_utils import load_spatial_bert_pretrained_weights\n",
-    "from models.spatial_bert_model import  SpatialBertForMaskedLM\n",
-    "\n",
-    "# load dataset we just created\n",
-    "\n",
-    "dataset = 'tutorial_datasets/SPABERT_finetuning_data.json'\n",
-    "\n",
-    "# load pre-trained spabert model\n",
-    "\n",
-    "pretrained_model = 'tutorial_datasets/mlm_mem_keeppos_ep0_iter06000_0.2936.pth'\n",
-    "\n",
-    "\n",
-    "# load bert model and tokenizer as well as the SpaBERT config\n",
-    "bert_model = BertForMaskedLM.from_pretrained('bert-base-uncased')\n",
-    "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
-    "config = SpatialBertConfig()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# load pre-trained spabert model\n",
-    "import torch\n",
-    "model = SpatialBertForMaskedLM(config)\n",
-    "\n",
-    "model.load_state_dict(bert_model.state_dict() , strict = False) \n",
-    "\n",
-    "pre_trained_model = torch.load(pretrained_model)\n",
-    "\n",
-    "model_keys = model.state_dict()\n",
-    "cnt_layers = 0\n",
-    "for key in model_keys:\n",
-    "    if key in pre_trained_model:\n",
-    "        model_keys[key] = pre_trained_model[key]\n",
-    "        cnt_layers += 1\n",
-    "    else:\n",
-    "        print(\"No weight for\", key)\n",
-    "print(cnt_layers, 'layers loaded')\n",
-    "\n",
-    "model.load_state_dict(model_keys)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from datasets.osm_sample_loader import PbfMapDataset\n",
-    "from torch.utils.data import DataLoader\n",
-    "# load fine-tning dataset with data loader\n",
-    "\n",
-    "fine_tune_dataset = PbfMapDataset(data_file_path = dataset, \n",
-    "                                        tokenizer = tokenizer, \n",
-    "                                        max_token_len = 300, \n",
-    "                                        distance_norm_factor = 0.0001, \n",
-    "                                        spatial_dist_fill = 20, \n",
-    "                                        with_type = False,\n",
-    "                                        sep_between_neighbors = False, \n",
-    "                                        label_encoder = None,\n",
-    "                                        mode = None)\n",
-    "#initialize data loader\n",
-    "train_loader = DataLoader(fine_tune_dataset, batch_size=12, num_workers=5, shuffle=False, pin_memory=True, drop_last=True)\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import torch\n",
-    "# cast our loaded model to a gpu if one is available, otherwise use the cpu\n",
-    "device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n",
-    "model.to(device)\n",
-    "\n",
-    "# set model to training mode\n",
-    "model.train()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "### FINE TUNING PROCEDURE ###\n",
-    "from tqdm import tqdm \n",
-    "from transformers import AdamW\n",
-    "# initialize optimizer\n",
-    "optim = AdamW(model.parameters(), lr = 5e-5)\n",
-    "\n",
-    "# setup loop with TQDM and dataloader\n",
-    "epoch = tqdm(train_loader, leave=True)\n",
-    "iter = 0\n",
-    "for batch in epoch:\n",
-    "    # initialize calculated gradients from previous step\n",
-    "    optim.zero_grad()\n",
-    "\n",
-    "    # pull all tensor batches required for training\n",
-    "    input_ids = batch['masked_input'].to(device)\n",
-    "    attention_mask = batch['attention_mask'].to(device)\n",
-    "    position_list_x = batch['norm_lng_list'].to(device)\n",
-    "    position_list_y = batch['norm_lat_list'].to(device)\n",
-    "    sent_position_ids = batch['sent_position_ids'].to(device)\n",
-    "\n",
-    "    labels = batch['pseudo_sentence'].to(device)\n",
-    "\n",
-    "    # get outputs of model\n",
-    "    outputs = model(input_ids, attention_mask = attention_mask, sent_position_ids = sent_position_ids,\n",
-    "                position_list_x = position_list_x, position_list_y = position_list_y, labels = labels)\n",
-    "    \n",
-    "\n",
-    "    # calculate loss\n",
-    "    loss = outputs.loss\n",
-    "\n",
-    "    # perform backpropigation\n",
-    "    loss.backward()\n",
-    "\n",
-    "    optim.step()\n",
-    "    epoch.set_postfix({'loss':loss.item()})\n",
-    "\n",
-    "\n",
-    "    iter += 1\n",
-    "torch.save(model.state_dict(), \"tutorial_datasets/fine-spabert-base-uncased-finetuned-osm-mn.pth\")\n"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "base",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.11.3"
-  },
-  "orig_nbformat": 4
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}

models/spabert/notebooks/tutorial_datasets/mlm_mem_keeppos_ep0_iter06000_0.2936.pth DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:e591f9d4798bee6d0deb59dcbbaefb31f08fdc5c751b81e4b52b95ddca766b71
-size 531897899

models/spabert/notebooks/tutorial_datasets/osm_mn.csv DELETED Viewed

The diff for this file is too large to render. See raw diff

models/spabert/notebooks/tutorial_datasets/output.csv.json DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:5c5b0c617f0cf93c320a8d8a38a3af7f092ad54eebfeca866d2905a03bcae6f8
-size 37566110

models/spabert/notebooks/tutorial_datasets/spabert-base-uncased-finetuned-osm-mn.pth DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:7046e57275530ee40ec10a80ecb67977e6f6530dc935c6abf77cf1d56c3d0f9a
-size 531904817

models/spabert/notebooks/tutorial_datasets/spabert_osm_mn.json DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:0665091faef52166a44c6ef253af85c99e30f7a63150b4542d42768793a088f6
-size 65595132

models/spabert/notebooks/tutorial_datasets/spabert_whg_wikidata.json DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:bba735c975231c42f467285b4f17ce4ba58262557f2769b89c863b7f37302209
-size 52811876

models/spabert/notebooks/tutorial_datasets/spabert_wikidata_sampled.json DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:d13476f6583e96ebc7272af910f99decc062b4053f92cc927837c49a777e6e86
-size 27841961