File size: 27,113 Bytes
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
 
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
 
 
 
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
 
 
4202f60
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
27cb60a
4202f60
 
 
27cb60a
4202f60
 
 
27cb60a
4202f60
 
 
27cb60a
4202f60
 
27cb60a
4202f60
 
 
 
 
 
27cb60a
 
4202f60
27cb60a
4202f60
 
 
27cb60a
4202f60
27cb60a
4202f60
 
 
 
 
27cb60a
4202f60
 
 
 
27cb60a
 
4202f60
27cb60a
4202f60
 
 
 
27cb60a
4202f60
 
 
27cb60a
4202f60
 
 
 
 
27cb60a
4202f60
 
 
 
27cb60a
4202f60
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
27cb60a
4202f60
27cb60a
4202f60
 
27cb60a
4202f60
27cb60a
4202f60
 
 
 
27cb60a
4202f60
27cb60a
 
4202f60
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
27cb60a
4202f60
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
27cb60a
4202f60
 
 
 
 
27cb60a
4202f60
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
27cb60a
 
 
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
27cb60a
 
 
 
4202f60
 
27cb60a
4202f60
 
27cb60a
4202f60
 
27cb60a
4202f60
27cb60a
4202f60
 
 
 
27cb60a
4202f60
27cb60a
4202f60
 
 
 
27cb60a
4202f60
27cb60a
4202f60
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
27cb60a
4202f60
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
27cb60a
4202f60
 
27cb60a
4202f60
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
27cb60a
 
 
4202f60
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
27cb60a
4202f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
 
 
27cb60a
4202f60
 
 
 
 
 
27cb60a
 
 
 
 
 
 
 
 
 
 
 
4202f60
 
 
 
27cb60a
4202f60
 
27cb60a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4202f60
 
 
27cb60a
4202f60
27cb60a
4202f60
 
 
 
 
 
 
 
 
27cb60a
 
 
 
 
 
 
 
 
4202f60
 
 
27cb60a
 
4202f60
 
 
 
27cb60a
4202f60
 
 
 
 
27cb60a
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
# FBMC Flow Forecasting MVP - Day 0 Quick Start Guide
## Environment Setup (45 Minutes)

**Target**: From zero to working local + HF Space environment with all dependencies verified

---

## Prerequisites Check (5 minutes)

Before starting, verify you have:

```bash
# Check Git
git --version
# Need: 2.x+

# Check Python
python3 --version
# Need: 3.10+
```

**API Keys & Accounts Ready:**
- [ ] ENTSO-E Transparency Platform API key
- [ ] Hugging Face account with payment method for Spaces
- [ ] Hugging Face write token (for uploading datasets)

**Important Data Storage Philosophy:**
- **Code** → Git repository (small, version controlled)
- **Data** → HuggingFace Datasets (separate, not in Git)
- **NO Git LFS** needed (following data science best practices)

---

## Step 1: Create Hugging Face Space (10 minutes)

1. **Navigate to**: https://huggingface.co/new-space

2. **Configure Space:**
   - **Owner**: Your username/organization
   - **Space name**: `fbmc-forecasting` (or your preference)
   - **License**: Apache 2.0
   - **Select SDK**: `JupyterLab`
   - **Select Hardware**: `A10G GPU ($30/month)`**CRITICAL**
   - **Visibility**: Private (recommended for MVP)

3. **Create Space** button

4. **Wait 2-3 minutes** for Space initialization

5. **Verify Space Access:**
   - Visit: `https://huggingface.co/spaces/YOUR_USERNAME/fbmc-forecasting`
   - Confirm JupyterLab interface loads
   - Check hardware: Should show "A10G GPU" in bottom-right

---

## Step 2: Local Environment Setup (25 minutes)

### 2.1 Clone HF Space Locally (2 minutes)

```bash
# Clone your HF Space
git clone https://huggingface.co/spaces/YOUR_USERNAME/fbmc-forecasting
cd fbmc-forecasting

# Verify remote
git remote -v
# Should show: https://huggingface.co/spaces/YOUR_USERNAME/fbmc-forecasting
```

### 2.2 Create Directory Structure (1 minute)

```bash
# Create project directories
mkdir -p notebooks \
         notebooks_exported \
         src/{data_collection,feature_engineering,model,utils} \
         config \
         results/{forecasts,evaluation,visualizations} \
         docs \
         tools \
         tests

# Note: data/ directory will be created by download scripts
# It is NOT tracked in Git (following best practices)

# Verify structure
tree -L 2
```

### 2.3 Install uv Package Manager (2 minutes)

```bash
# Install uv (ultra-fast pip replacement)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Add to PATH (if not automatic)
export PATH="$HOME/.cargo/bin:$PATH"

# Verify installation
uv --version
# Should show: uv 0.x.x
```

### 2.4 Create Virtual Environment (1 minute)

```bash
# Create .venv with uv
uv venv

# Activate (Linux/Mac)
source .venv/bin/activate

# Activate (Windows)
# .venv\Scripts\activate

# Verify activation
which python
# Should point to: /path/to/fbmc-forecasting/.venv/bin/python
```

### 2.5 Install Dependencies (2 minutes)

```bash
# Create requirements.txt
cat > requirements.txt << 'EOF'
# Core Data & ML
polars>=0.20.0
pyarrow>=13.0.0
numpy>=1.24.0
scikit-learn>=1.3.0

# Time Series Forecasting
chronos-forecasting>=1.0.0
transformers>=4.35.0
torch>=2.0.0

# Data Collection
entsoe-py>=0.5.0
jao-py>=0.6.0
requests>=2.31.0

# HuggingFace Integration (for Datasets, NOT Git LFS)
datasets>=2.14.0
huggingface-hub>=0.17.0

# Visualization & Notebooks
altair>=5.0.0
marimo>=0.9.0
jupyter>=1.0.0
ipykernel>=6.25.0

# Utilities
pyyaml>=6.0.0
python-dotenv>=1.0.0
tqdm>=4.66.0

# HF Space Integration
gradio>=4.0.0
EOF

# Install with uv (ultra-fast)
uv pip install -r requirements.txt

# Create lockfile for reproducibility
uv pip compile requirements.txt -o requirements.lock
```

**Verify installations:**
```bash
python -c "import polars; print(f'polars {polars.__version__}')"
python -c "import marimo; print(f'marimo {marimo.__version__}')"
python -c "import torch; print(f'torch {torch.__version__}')"
python -c "from chronos import ChronosPipeline; print('chronos-forecasting ✓')"
python -c "from datasets import Dataset; print('datasets ✓')"
python -c "from huggingface_hub import HfApi; print('huggingface-hub ✓')"
python -c "import jao; print(f'jao-py {jao.__version__}')"
```

### 2.6 Configure .gitignore (Data Exclusion) (2 minutes)

```bash
# Create .gitignore - CRITICAL for keeping data out of Git
cat > .gitignore << 'EOF'
# ============================================
# Data Files - NEVER commit to Git
# ============================================
# Following data science best practices:
# - Code goes in Git
# - Data goes in HuggingFace Datasets
data/
*.parquet
*.pkl
*.csv
*.h5
*.hdf5
*.feather

# ============================================
# Model Artifacts
# ============================================
models/checkpoints/
*.pth
*.safetensors
*.ckpt

# ============================================
# Credentials & Secrets
# ============================================
.env
config/api_keys.yaml
*.key
*.pem

# ============================================
# Python
# ============================================
__pycache__/
*.pyc
*.pyo
*.egg-info/
.pytest_cache/
.venv/
venv/

# ============================================
# IDE & OS
# ============================================
.vscode/
.idea/
*.swp
.DS_Store
Thumbs.db

# ============================================
# Jupyter
# ============================================
.ipynb_checkpoints/

# ============================================
# Temporary Files
# ============================================
*.tmp
*.log
.cache/
EOF

# Stage .gitignore
git add .gitignore

# Verify data/ will be ignored
echo "data/" >> .gitignore
git check-ignore data/test.parquet
# Should output: data/test.parquet (confirming it's ignored)
```

**Why NO Git LFS?**
Following data science best practices:
-**Code** → Git (fast, version controlled)
-**Data** → HuggingFace Datasets (separate, scalable)
-**NOT** Git LFS (expensive, non-standard for ML projects)

**Data will be:**
- Downloaded via scripts (Day 1)
- Uploaded to HF Datasets (Day 1)
- Loaded programmatically (Days 2-5)
- NEVER committed to Git repository

### 2.7 Configure API Keys & HuggingFace Access (3 minutes)

```bash
# Create config directory structure
mkdir -p config

# Create API keys configuration
cat > config/api_keys.yaml << 'EOF'
# ENTSO-E Transparency Platform
entsoe_api_key: "YOUR_ENTSOE_API_KEY_HERE"

# OpenMeteo (free tier - no key required)
openmeteo_base_url: "https://api.open-meteo.com/v1/forecast"

# Hugging Face (for uploading datasets)
hf_token: "YOUR_HF_WRITE_TOKEN_HERE"
hf_username: "YOUR_HF_USERNAME"
EOF

# Create .env file for environment variables
cat > .env << 'EOF'
ENTSOE_API_KEY=YOUR_ENTSOE_API_KEY_HERE
OPENMETEO_BASE_URL=https://api.open-meteo.com/v1/forecast
HF_TOKEN=YOUR_HF_WRITE_TOKEN_HERE
HF_USERNAME=YOUR_HF_USERNAME
EOF
```

**Get your HuggingFace Write Token:**
1. Visit: https://huggingface.co/settings/tokens
2. Click "New token"
3. Name: "FBMC Dataset Upload"
4. Type: **Write** (required for uploading datasets)
5. Copy token

**Now edit the files with your actual credentials:**
```bash
# Option 1: Use text editor
nano config/api_keys.yaml  # Update all YOUR_*_HERE placeholders
nano .env                  # Update all YOUR_*_HERE placeholders

# Option 2: Use sed (replace with your actual values)
sed -i 's/YOUR_ENTSOE_API_KEY_HERE/your-actual-entsoe-key/' config/api_keys.yaml .env
sed -i 's/YOUR_HF_WRITE_TOKEN_HERE/hf_your-actual-token/' config/api_keys.yaml .env
sed -i 's/YOUR_HF_USERNAME/your-username/' config/api_keys.yaml .env
```

**Verify credentials are set:**
```bash
# Should NOT see any "YOUR_*_HERE" placeholders
grep "YOUR_" config/api_keys.yaml
# Empty output = good!
```

### 2.8 Create Data Management Utilities (5 minutes)

```bash
# Create data collection module with HF Datasets integration
cat > src/data_collection/hf_datasets_manager.py << 'EOF'
"""HuggingFace Datasets manager for FBMC data storage."""

import polars as pl
from datasets import Dataset, DatasetDict
from huggingface_hub import HfApi
from pathlib import Path
import yaml

class FBMCDatasetManager:
    """Manage FBMC data uploads/downloads via HuggingFace Datasets."""

    def __init__(self, config_path: str = "config/api_keys.yaml"):
        """Initialize with HF credentials."""
        with open(config_path) as f:
            config = yaml.safe_load(f)

        self.hf_token = config['hf_token']
        self.hf_username = config['hf_username']
        self.api = HfApi(token=self.hf_token)

    def upload_dataset(self, parquet_path: Path, dataset_name: str, description: str = ""):
        """Upload Parquet file to HuggingFace Datasets."""
        print(f"Uploading {parquet_path.name} to HF Datasets...")

        # Load Parquet as polars, convert to HF Dataset
        df = pl.read_parquet(parquet_path)
        dataset = Dataset.from_pandas(df.to_pandas())

        # Create full dataset name
        full_name = f"{self.hf_username}/{dataset_name}"

        # Upload to HF
        dataset.push_to_hub(
            full_name,
            token=self.hf_token,
            private=False  # Public datasets (free storage)
        )

        print(f"✓ Uploaded to: https://huggingface.co/datasets/{full_name}")
        return full_name

    def download_dataset(self, dataset_name: str, output_path: Path):
        """Download dataset from HF to local Parquet."""
        from datasets import load_dataset

        print(f"Downloading {dataset_name} from HF Datasets...")

        # Download from HF
        dataset = load_dataset(
            f"{self.hf_username}/{dataset_name}",
            split="train"
        )

        # Convert to polars and save
        df = pl.from_pandas(dataset.to_pandas())
        output_path.parent.mkdir(parents=True, exist_ok=True)
        df.write_parquet(output_path)

        print(f"✓ Downloaded to: {output_path}")
        return df

    def list_datasets(self):
        """List all FBMC datasets for this user."""
        datasets = self.api.list_datasets(author=self.hf_username)
        fbmc_datasets = [d for d in datasets if 'fbmc' in d.id.lower()]

        print(f"\nFBMC Datasets for {self.hf_username}:")
        for ds in fbmc_datasets:
            print(f"  - {ds.id}")

        return fbmc_datasets

# Example usage (will be used in Day 1)
if __name__ == "__main__":
    manager = FBMCDatasetManager()

    # Upload example (Day 1 will use this)
    # manager.upload_dataset(
    #     parquet_path=Path("data/raw/cnecs_2023_2025.parquet"),
    #     dataset_name="fbmc-cnecs-2023-2025",
    #     description="FBMC CNECs data: Oct 2023 - Sept 2025"
    # )

    # Download example (HF Space will use this)
    # manager.download_dataset(
    #     dataset_name="fbmc-cnecs-2023-2025",
    #     output_path=Path("data/raw/cnecs_2023_2025.parquet")
    # )
EOF

# Create data download orchestrator
cat > src/data_collection/download_all.py << 'EOF'
"""Download all FBMC data from HuggingFace Datasets."""

from pathlib import Path
from hf_datasets_manager import FBMCDatasetManager

def setup_data(data_dir: Path = Path("data/raw")):
    """Download all datasets if not present locally."""
    manager = FBMCDatasetManager()

    datasets_to_download = {
        "fbmc-cnecs-2023-2025": "cnecs_2023_2025.parquet",
        "fbmc-weather-2023-2025": "weather_2023_2025.parquet",
        "fbmc-entsoe-2023-2025": "entsoe_2023_2025.parquet",
    }

    data_dir.mkdir(parents=True, exist_ok=True)

    for dataset_name, filename in datasets_to_download.items():
        output_path = data_dir / filename

        if output_path.exists():
            print(f"✓ {filename} already exists, skipping")
        else:
            try:
                manager.download_dataset(dataset_name, output_path)
            except Exception as e:
                print(f"✗ Failed to download {dataset_name}: {e}")
                print(f"  You may need to run Day 1 data collection first")

    print("\n✓ Data setup complete")

if __name__ == "__main__":
    setup_data()
EOF

# Make scripts executable
chmod +x src/data_collection/hf_datasets_manager.py
chmod +x src/data_collection/download_all.py

echo "✓ Data management utilities created"
```

**What This Does:**
- `hf_datasets_manager.py`: Upload/download Parquet files to/from HF Datasets
- `download_all.py`: One-command data setup for HF Space or analysts

**Day 1 Workflow:**
1. Download data from JAO/ENTSO-E/OpenMeteo to `data/raw/`
2. Upload each Parquet to HF Datasets (separate from Git)
3. Git repo stays small (only code)

**HF Space Workflow:**
```python
# In your Space's app.py startup:
from src.data_collection.download_all import setup_data
setup_data()  # Downloads from HF Datasets, not Git
```

### 2.9 Create First Marimo Notebook (5 minutes)

```bash
# Create initial exploration notebook
cat > notebooks/01_data_exploration.py << 'EOF'
import marimo

__generated_with = "0.9.0"
app = marimo.App(width="medium")

@app.cell
def __():
    import marimo as mo
    import polars as pl
    import altair as alt
    from pathlib import Path
    return mo, pl, alt, Path

@app.cell
def __(mo):
    mo.md(
        """
        # FBMC Flow Forecasting - Data Exploration

        **Day 1 Objective**: Explore JAO FBMC data structure

        ## Steps:
        1. Load downloaded Parquet files
        2. Inspect CNECs, PTDFs, RAMs
        3. Identify top 200 binding CNECs (50 Tier-1 + 150 Tier-2)
        4. Visualize temporal patterns
        """
    )
    return

@app.cell
def __(Path):
    # Data paths
    DATA_DIR = Path("../data/raw")
    CNECS_FILE = DATA_DIR / "cnecs_2023_2025.parquet"
    return DATA_DIR, CNECS_FILE

@app.cell
def __(mo, CNECS_FILE):
    # Check if data exists
    if CNECS_FILE.exists():
        mo.md("✓ CNECs data found - ready for Day 1 analysis")
    else:
        mo.md("⚠ CNECs data not yet downloaded - run Day 1 collection script")
    return

if __name__ == "__main__":
    app.run()
EOF

# Test Marimo installation
marimo edit notebooks/01_data_exploration.py &
# This will open browser with interactive notebook
# Close after verifying it loads correctly (Ctrl+C in terminal)
```

### 2.10 Create Utility Modules (2 minutes)

```bash
# Create data loading utilities
cat > src/utils/data_loader.py << 'EOF'
"""Data loading utilities for FBMC forecasting project."""

import polars as pl
from pathlib import Path
from typing import Optional

def load_cnecs(data_dir: Path, start_date: Optional[str] = None, end_date: Optional[str] = None) -> pl.DataFrame:
    """Load CNEC data with optional date filtering."""
    cnecs = pl.read_parquet(data_dir / "cnecs_2023_2025.parquet")

    if start_date:
        cnecs = cnecs.filter(pl.col("timestamp") >= start_date)
    if end_date:
        cnecs = cnecs.filter(pl.col("timestamp") <= end_date)

    return cnecs

def load_weather(data_dir: Path, grid_points: Optional[list] = None) -> pl.DataFrame:
    """Load weather data with optional grid point filtering."""
    weather = pl.read_parquet(data_dir / "weather_2023_2025.parquet")

    if grid_points:
        weather = weather.filter(pl.col("grid_point").is_in(grid_points))

    return weather
EOF

# Create __init__.py files
touch src/__init__.py
touch src/utils/__init__.py
touch src/data_collection/__init__.py
touch src/feature_engineering/__init__.py
touch src/model/__init__.py
```

### 2.11 Initial Commit (2 minutes)

```bash
# Stage all changes (note: data/ is excluded by .gitignore)
git add .

# Create initial commit
git commit -m "Day 0: Initialize FBMC forecasting MVP environment

- Add project structure (notebooks, src, config, tools)
- Configure uv + polars + Marimo + Chronos + HF Datasets stack
- Create .gitignore (excludes data/ following best practices)
- Install jao-py Python library for JAO data access
- Configure ENTSO-E, OpenMeteo, and HuggingFace API access
- Add HF Datasets manager for data storage (separate from Git)
- Create data download utilities (download_all.py)
- Create initial exploration notebook

Data Strategy:
- Code → Git (this repo)
- Data → HuggingFace Datasets (separate, not in Git)
- NO Git LFS (following data science best practices)

Infrastructure: HF Space (A10G GPU, \$30/month)"

# Push to HF Space
git push origin main

# Verify push succeeded
git status
# Should show: "Your branch is up to date with 'origin/main'"

# Verify no data files were committed
git ls-files | grep "\.parquet"
# Should be empty (no .parquet files in Git)
```

---

## Step 3: Verify Complete Setup (5 minutes)

### 3.1 Python Environment Verification

```bash
# Activate environment if not already
source .venv/bin/activate

# Run comprehensive checks
python << 'EOF'
import sys
print(f"Python: {sys.version}")

packages = [
    "polars", "pyarrow", "numpy", "scikit-learn",
    "torch", "transformers", "marimo", "altair",
    "entsoe", "jao", "requests", "yaml", "gradio",
    "datasets", "huggingface_hub"
]

print("\nPackage Versions:")
for pkg in packages:
    try:
        if pkg == "entsoe":
            import entsoe
            print(f"✓ entsoe-py: {entsoe.__version__}")
        elif pkg == "jao":
            import jao
            print(f"✓ jao-py: {jao.__version__}")
        elif pkg == "yaml":
            import yaml
            print(f"✓ pyyaml: {yaml.__version__}")
        elif pkg == "huggingface_hub":
            from huggingface_hub import HfApi
            print(f"✓ huggingface-hub: Ready")
        else:
            mod = __import__(pkg)
            print(f"✓ {pkg}: {mod.__version__}")
    except Exception as e:
        print(f"✗ {pkg}: {e}")

# Test Chronos specifically
try:
    from chronos import ChronosPipeline
    print("\n✓ Chronos forecasting: Ready")
except Exception as e:
    print(f"\n✗ Chronos forecasting: {e}")

# Test HF Datasets
try:
    from datasets import Dataset
    print("✓ HuggingFace Datasets: Ready")
except Exception as e:
    print(f"✗ HuggingFace Datasets: {e}")

print("\nAll checks complete!")
EOF
```

### 3.2 API Access Verification

```bash
# Test ENTSO-E API
python << 'EOF'
from entsoe import EntsoePandasClient
import yaml

# Load API key
with open('config/api_keys.yaml') as f:
    config = yaml.safe_load(f)

api_key = config['entsoe_api_key']

if 'YOUR_ENTSOE_API_KEY_HERE' in api_key:
    print("⚠ ENTSO-E API key not configured - update config/api_keys.yaml")
else:
    try:
        client = EntsoePandasClient(api_key=api_key)
        print("✓ ENTSO-E API client initialized successfully")
    except Exception as e:
        print(f"✗ ENTSO-E API error: {e}")
EOF

# Test OpenMeteo API
python << 'EOF'
import requests

response = requests.get(
    "https://api.open-meteo.com/v1/forecast",
    params={
        "latitude": 52.52,
        "longitude": 13.41,
        "hourly": "temperature_2m",
        "start_date": "2025-01-01",
        "end_date": "2025-01-02"
    }
)

if response.status_code == 200:
    print("✓ OpenMeteo API accessible")
else:
    print(f"✗ OpenMeteo API error: {response.status_code}")
EOF

# Test HuggingFace authentication
python << 'EOF'
from huggingface_hub import HfApi
import yaml

with open('config/api_keys.yaml') as f:
    config = yaml.safe_load(f)

hf_token = config['hf_token']
hf_username = config['hf_username']

if 'YOUR_HF' in hf_token or 'YOUR_HF' in hf_username:
    print("⚠ HuggingFace credentials not configured - update config/api_keys.yaml")
else:
    try:
        api = HfApi(token=hf_token)
        user_info = api.whoami()
        print(f"✓ HuggingFace authenticated as: {user_info['name']}")
        print(f"  Can create datasets: {'datasets' in user_info.get('auth', {}).get('accessToken', {}).get('role', '')}")
    except Exception as e:
        print(f"✗ HuggingFace authentication error: {e}")
        print(f"  Verify token has WRITE permissions")
EOF
```

### 3.3 HF Space Verification

```bash
# Check HF Space status
echo "Visit your HF Space: https://huggingface.co/spaces/YOUR_USERNAME/fbmc-forecasting"
echo ""
echo "Verify:"
echo "  1. JupyterLab interface loads"
echo "  2. Hardware shows 'A10G GPU' in bottom-right"
echo "  3. Files from git push are visible"
echo "  4. Can create new notebook"
```

### 3.4 Final Checklist

```bash
# Print final status
cat << 'EOF'
╔═══════════════════════════════════════════════════════════╗
║           DAY 0 SETUP VERIFICATION CHECKLIST               ║
╚═══════════════════════════════════════════════════════════╝

Environment:
  [ ] Python 3.10+ installed
  [ ] Git installed (NO Git LFS needed)
  [ ] uv package manager installed

Local Setup:
  [ ] Virtual environment created and activated
  [ ] All Python dependencies installed (24 packages including jao-py)
  [ ] API keys configured (ENTSO-E + OpenMeteo + HuggingFace)
  [ ] HuggingFace write token obtained
  [ ] Project structure created (8 directories)
  [ ] .gitignore configured (data/ excluded)
  [ ] Initial Marimo notebook created
  [ ] Data management utilities created (hf_datasets_manager.py)

Git & HF Space:
  [ ] HF Space created (A10G GPU, $30/month)
  [ ] Repository cloned locally
  [ ] .gitignore excludes all data files (*.parquet, data/)
  [ ] Initial commit pushed to HF Space (code only, NO data)
  [ ] HF Space JupyterLab accessible
  [ ] Git repo size < 50 MB (no data committed)

Verification Tests:
  [ ] Python imports successful (polars, chronos, jao-py, datasets, etc.)
  [ ] ENTSO-E API client initializes
  [ ] OpenMeteo API responds (status 200)
  [ ] HuggingFace authentication successful (write access)
  [ ] Marimo notebook opens in browser

Data Strategy Confirmed:
  [ ] Code goes in Git (version controlled)
  [ ] Data goes in HuggingFace Datasets (separate storage)
  [ ] NO Git LFS setup (following data science best practices)
  [ ] data/ directory in .gitignore

Ready for Day 1: [ ]

Next Step: Run Day 1 data collection (8 hours)
- Download data locally via jao-py/APIs
- Upload to HuggingFace Datasets (separate from Git)
- Total data: ~12 GB (stored in HF Datasets, NOT Git)
EOF
```

---

## Troubleshooting

### Issue: uv installation fails
```bash
# Alternative: Use pip directly
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

### Issue: Git LFS files not syncing
**Not applicable** - We're using HuggingFace Datasets, not Git LFS.

If you see Git LFS references, you may have an old version of this guide. Data files should NEVER be in Git.

### Issue: HuggingFace authentication fails
```bash
# Verify token is correct
python << 'EOF'
from huggingface_hub import HfApi
import yaml

with open('config/api_keys.yaml') as f:
    config = yaml.safe_load(f)

try:
    api = HfApi(token=config['hf_token'])
    print(api.whoami())
except Exception as e:
    print(f"Error: {e}")
    print("\nTroubleshooting:")
    print("1. Visit: https://huggingface.co/settings/tokens")
    print("2. Verify token has WRITE permission")
    print("3. Copy token exactly (starts with 'hf_')")
    print("4. Update config/api_keys.yaml and .env")
EOF
```

### Issue: Cannot upload to HuggingFace Datasets
```bash
# Common causes:
# 1. Token doesn't have write permissions
#    Fix: Create new token with "write" scope

# 2. Dataset name already exists
#    Fix: Use different name or add version suffix
#    Example: fbmc-cnecs-2023-2025-v2

# 3. File too large (>5GB single file limit)
#    Fix: Split into multiple datasets or use sharding

# Test upload with small sample:
python << 'EOF'
from datasets import Dataset
import pandas as pd

# Create tiny test dataset
df = pd.DataFrame({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
dataset = Dataset.from_pandas(df)

# Try uploading
try:
    dataset.push_to_hub("YOUR_USERNAME/test-dataset", token="YOUR_TOKEN")
    print("✓ Upload successful - authentication works")
except Exception as e:
    print(f"✗ Upload failed: {e}")
EOF
```

### Issue: Marimo notebook won't open
```bash
# Check marimo installation
marimo --version

# Try running without opening browser
marimo run notebooks/01_data_exploration.py

# Check for port conflicts
lsof -i :2718  # Default Marimo port
```

### Issue: ENTSO-E API key invalid
```bash
# Verify key in ENTSO-E Transparency Platform:
# 1. Login: https://transparency.entsoe.eu/
# 2. Navigate: Account Settings → Web API Security Token
# 3. Copy key exactly (no spaces)
# 4. Update: config/api_keys.yaml and .env
```

### Issue: HF Space shows "Building..." forever
```bash
# Check HF Space logs:
# Visit: https://huggingface.co/spaces/YOUR_USERNAME/fbmc-forecasting
# Click: "Settings" → "Logs"

# Common fix: Ensure requirements.txt is valid
# Test locally:
pip install -r requirements.txt --dry-run
```

### Issue: jao-py import fails
```bash
# Verify jao-py installation
python -c "import jao; print(jao.__version__)"

# If missing, reinstall
uv pip install jao-py>=0.6.0

# Check package is in environment
uv pip list | grep jao
```

---

## What's Next: Day 1 Preview

**Day 1 Objective**: Download 24 months of historical data (Oct 2023 - Sept 2025)

**Data Collection Tasks:**
1. **JAO FBMC Data** (4-5 hours)
   - CNECs: ~900 MB (24 months)
   - PTDFs: ~1.5 GB (24 months)
   - RAMs: ~800 MB (24 months)
   - Shadow prices: ~600 MB (24 months)
   - LTN nominations: ~400 MB (24 months)
   - Net positions: ~300 MB (24 months)

2. **ENTSO-E Data** (2-3 hours)
   - Generation forecasts: 13 zones × 24 months
   - Actual generation: 13 zones × 24 months
   - Cross-border flows: ~20 borders × 24 months

3. **OpenMeteo Weather** (1-2 hours)
   - 52 grid points × 24 months
   - 8 variables per point
   - Parallel download optimization

**Total Data Size**: ~12 GB (compressed Parquet)

**Day 1 Script**: Will use jao-py Python library with rate limiting and parallel download logic.

---

## Summary

**Time Investment**: 45 minutes
**Result**: Production-ready local + cloud development environment

**You Now Have:**
- ✓ HF Space with A10G GPU ($30/month)
- ✓ Local Python environment (24 packages including jao-py and HF Datasets)
- ✓ jao-py Python library for JAO data access
- ✓ ENTSO-E + OpenMeteo + HuggingFace API access configured
- ✓ HuggingFace Datasets manager for data storage (separate from Git)
- ✓ Data download/upload utilities (hf_datasets_manager.py)
- ✓ Marimo reactive notebook environment
- ✓ .gitignore configured (data/ excluded, following best practices)
- ✓ Complete project structure (8 directories)

**Data Strategy Implemented:**
```
Code (version controlled)     →  Git Repository (~50 MB)
Data (storage & versioning)   →  HuggingFace Datasets (~12 GB)
NO Git LFS (following data science best practices)
```

**Ready For**: Day 1 data collection (8 hours)
- Download 24 months data locally (jao-py + APIs)
- Upload to HuggingFace Datasets (not Git)
- Git repo stays clean (code only)

---

**Document Version**: 2.0
**Last Updated**: 2025-10-29
**Project**: FBMC Flow Forecasting MVP (Zero-Shot)