nvedant07 commited on
Commit
e892eae
·
verified ·
1 Parent(s): 79a4676

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. LICENSE +31 -0
  2. README.md +444 -0
  3. config.json +109 -0
  4. config.py +262 -0
  5. config.yaml +96 -0
  6. generation_config.json +4 -0
  7. model-00001-of-00061.safetensors +3 -0
  8. model-00002-of-00061.safetensors +3 -0
  9. model-00003-of-00061.safetensors +3 -0
  10. model-00004-of-00061.safetensors +3 -0
  11. model-00005-of-00061.safetensors +3 -0
  12. model-00006-of-00061.safetensors +3 -0
  13. model-00007-of-00061.safetensors +3 -0
  14. model-00008-of-00061.safetensors +3 -0
  15. model-00009-of-00061.safetensors +3 -0
  16. model-00010-of-00061.safetensors +3 -0
  17. model-00011-of-00061.safetensors +3 -0
  18. model-00012-of-00061.safetensors +3 -0
  19. model-00013-of-00061.safetensors +3 -0
  20. model-00014-of-00061.safetensors +3 -0
  21. model-00015-of-00061.safetensors +3 -0
  22. model-00016-of-00061.safetensors +3 -0
  23. model-00017-of-00061.safetensors +3 -0
  24. model-00018-of-00061.safetensors +3 -0
  25. model-00019-of-00061.safetensors +3 -0
  26. model-00020-of-00061.safetensors +3 -0
  27. model-00021-of-00061.safetensors +3 -0
  28. model-00022-of-00061.safetensors +3 -0
  29. model-00023-of-00061.safetensors +3 -0
  30. model-00024-of-00061.safetensors +3 -0
  31. model-00025-of-00061.safetensors +3 -0
  32. model-00026-of-00061.safetensors +3 -0
  33. model-00027-of-00061.safetensors +3 -0
  34. model-00028-of-00061.safetensors +3 -0
  35. model-00029-of-00061.safetensors +3 -0
  36. model-00030-of-00061.safetensors +3 -0
  37. model-00031-of-00061.safetensors +3 -0
  38. model-00032-of-00061.safetensors +3 -0
  39. model-00033-of-00061.safetensors +3 -0
  40. model-00034-of-00061.safetensors +3 -0
  41. model-00035-of-00061.safetensors +3 -0
  42. model-00036-of-00061.safetensors +3 -0
  43. model-00037-of-00061.safetensors +3 -0
  44. model-00038-of-00061.safetensors +3 -0
  45. model-00039-of-00061.safetensors +3 -0
  46. model-00040-of-00061.safetensors +3 -0
  47. model-00041-of-00061.safetensors +3 -0
  48. model-00042-of-00061.safetensors +3 -0
  49. model-00043-of-00061.safetensors +3 -0
  50. model-00044-of-00061.safetensors +3 -0
LICENSE ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ The following applies to all files in this repository, unless otherwise noted:
2
+
3
+ Copyright (c) 2025 Aleph Alpha Research GmbH. All rights reserved.
4
+
5
+ This project is licensed under the terms of the Open Aleph License 1.0, available at
6
+ https://github.com/Aleph-Alpha/.github/blob/main/oal.pdf
7
+
8
+ ---
9
+ Excerpt from the license text:
10
+
11
+ Subject to the terms and conditions of this License, the Licensor grants you a non-exclusive, worldwide,
12
+ non-transferable, non-sublicensable, and royalty-free limited right to use, copy, modify, distribute, make
13
+ otherwise publicly available, and reproduce the Works and Derivative Works under Licensor’s copyright,
14
+ for any Non-Commercial and Non-Administrative purpose.
15
+ You may not use, copy, modify, distribute, make otherwise publicly available, reproduce, or sublicense the
16
+ Works or Derivative Works except as expressly provided under and in accordance with this License.
17
+ Your rights granted under this License will automatically terminate if you fail to comply with any of the
18
+ terms of this License.
19
+
20
+ EXCEPT FOR DAMAGES CAUSED BY INTENT OR FRAUDULENTLY CONCEALED
21
+ DEFECTS, AND EXCEPT FOR DAMAGES RESULTING FROM BREACH OF ANY
22
+ WARRANTY OR GUARANTEE EXPRESSLY GIVEN BY LICENSOR IN THE OPEN ALEPH LICENSE,
23
+ IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR ANY
24
+ DAMAGES ARISING OUT OF THE OPEN ALEPH LICENSE OR THE USE OF THE WORK. ANY
25
+ MANDATORY STATUTORY LIABILITY UNDER APPLICABLE LAW REMAINS
26
+ UNAFFECTED.
27
+
28
+ EXCEPT AS EXPRESSLY STATED IN THIS LICENSE OR REQUIRED BY APPLICABLE
29
+ LAW, THE WORKS ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES
30
+ OF ANY KIND INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES REGARDING
31
+ THE CONTENTS, ACCURACY, OR FITNESS FOR A PARTICULAR PURPOSE.
README.md ADDED
@@ -0,0 +1,444 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ license: other
6
+ thumbnail: https://huggingface.co/Aleph-Alpha/llama-3_1-70b-tfree-hat-sft/raw/main/source/aleph_alpha_logo_thumbnail.png
7
+ license_name: open-aleph-license
8
+ license_link: LICENSE
9
+ tags:
10
+ - Aleph Alpha Research
11
+ - pytorch
12
+ - Hirarchical Autoregressive Transformer
13
+ - HAT
14
+ model-index:
15
+ - name: llama-3_1-70b-tfree-hat-sft
16
+ results: []
17
+ ---
18
+
19
+ <div align="center">
20
+ <img src="source/aleph_alpha_logo.svg" width="60%" alt="Aleph Alpha Research Logo" />
21
+ </div>
22
+
23
+ <div align="center" style="line-height: 1;">
24
+ <a href="https://aleph-alpha.com/research/" target="_blank" style="margin: 2px;">
25
+ <img alt="Homepage" src="source/aleph_alpha_homepage_badge.svg" style="display: inline-block; vertical-align: middle;" />
26
+ </a>
27
+ <a href="https://huggingface.co/Aleph-Alpha" target="_blank" style="margin: 2px;">
28
+ <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-AlephAlpha%20Research-e3ff00?color=e3ff00&amp;logoColor=white" style="display: inline-block; vertical-align: middle;"/>
29
+ </a>
30
+ </div>
31
+
32
+ <div align="center" style="line-height: 1;">
33
+ <a href="https://twitter.com/Aleph__Alpha" target="_blank" style="margin: 2px;">
34
+ <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-AlephAlpha_Research-white?logo=x&amp;logoColor=white" style="display: inline-block; vertical-align: middle;"/>
35
+ </a>
36
+ <a href="https://huggingface.co/Aleph-Alpha/llama-3_1-70b-tfree-hat-sft/blob/main/LICENSE" style="margin: 2px;">
37
+ <img alt="License" src="https://img.shields.io/badge/License-Open Aleph License-white?&amp;color=white" style="display: inline-block; vertical-align: middle;"/>
38
+ </a>
39
+ </div>
40
+
41
+ <hr>
42
+
43
+ # llama-3_1-70b-tfree-hat-sft
44
+ <!-- markdownlint-disable first-line-h1 -->
45
+ <!-- markdownlint-disable html -->
46
+ <!-- markdownlint-disable no-duplicate-header -->
47
+
48
+ This model card provides an overview of our **tokenizer-free llama-3.1-70b-tfree-hat model** based on Llama, a foundation model developed by Aleph Alpha Research* and publicly available under the Open Aleph License, a license explicitly allowing for non-commercial research and educational use.
49
+
50
+ The model is based on the Llama 3.1 70B base model’s pre-trained backbone, replacing the Llama tokenizer with our Hierarchical Autoregressive Transformer (HAT) architecture which is described originally in our [paper](https://arxiv.org/abs/2501.10322). This novel architecture integrates character-level encoding and decoding with the word-level backbone, allowing for improved text compression (less sequence positions) and performance in the languages it has been trained on, and potentially higher robustness to prompt changes, as well as improved adaptability to new languages & domains via fine-tuning.
51
+
52
+ The model was pre- and post-trained in English & German on carefully curated data in compliance with applicable EU and national regulations, including copyright and data privacy laws. The model has not been optimized for code generation and math and is thus not evaluated extensively on respective benchmarks.
53
+
54
+ Please note that the realized inference speed strongly depends on the maturity of the inference implementation beyond the intrinsic text compression of any model. The current publicly available inference implementation is in a non-optimized state, hence any benchmark on speed must take account of that. We are releasing an optimized [vLLM-based inference solution](https://github.com/Aleph-Alpha/vllm) that is still under active development.
55
+
56
+ You can find all model weights and their corresponding safetensors conversions at the following links:
57
+
58
+ | Model Name | Link |
59
+ | --- | --- |
60
+ | `llama-3_1-70b-tfree-hat-sft` | [Link](https://huggingface.co/Aleph-Alpha/llama-3_1-70b-tfree-hat-sft) |
61
+
62
+ # Model Access
63
+
64
+ We provide access to our model through the channels listed below.
65
+
66
+ - **HuggingFace**: The model’s weights are available on HuggingFace under the [Open Aleph License](https://github.com/Aleph-Alpha/.github/blob/main/oal.pdf), a license explicitly allowing for non-commercial research and educational use.
67
+
68
+ We do not collect PII (personally identifiable information) for any of these channels. We do not log user inputs to the models. We do not train on user data.
69
+
70
+ **Note**: The same models are made available to users regardless of their geographic location and their input language but subject to sanction regimes, technology export regulations, and other restrictions that may apply. The same offering is provided to all countries within and external to the European Union if no legal restrictions apply.
71
+
72
+ # How to use
73
+
74
+ ## Inference
75
+
76
+ We release a vLLM-based inference implementation adapted to our llama-3.1-70b-tfree-hat model [here](https://github.com/Aleph-Alpha/vllm). Please note that this inference implementation is still under active development.
77
+
78
+ ## Prompt formatting
79
+
80
+ The prompt format used for this model is identical to the [Llama prompt format](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/). We highly recommend using it when prompting the models to ensure optimal performance for the supervised fine-tuned and direct-preference-optimized model versions. You can format your prompt in the recommended format by setting `add_llama_template=True` in the `model._prepare_input` method.
81
+
82
+ # Evaluation
83
+
84
+ **Performance**: Our T-Free model delivers strong performance in both English and German. For evaluation purposes, we compare our model with [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct). Respective benchmarks and results can be found in the tables below.
85
+
86
+ **Efficiency**: Our tokenizer-free approach results in improved text compression, providing a foundation for improved efficiency in inference speed. We measure in terms of words processed across all languages and domains. We define the metric as **tokenizer fertility** or **bytes per sequence position**, where a higher value indicates better performance. Latency and throughput are currently out of scope for research-centric evaluations and will be addressed in the future. Currently, our evaluation framework automatically measures **bytes per sequence position** across datasets, allowing us to derive text compression scores and analyze variations across different dataset distributions. The end to end resulting efficiency is dependend on the inference implementation beyond the scope of the here provided inference implementation and reported compression scores.
87
+
88
+ **Disclaimer**: The results presented below were generated using our internal inference implementation, not the inference module mentioned above.We plan to make source-available both our evaluation framework and a high-performance vLLM integration for this model in the coming weeks to ensure reproducibility. Our goal with this initial release is to provide the community with a straightforward codebase that demonstrates the architecture and supports basic inference capabilities.
89
+
90
+ **Metric Glossary**
91
+
92
+ `log_acc`: Average Accuracy Loglikelihood<br>
93
+ `norm_log_acc`: Average Normalized Loglikelihood Accuracy<br>
94
+ `comp_acc`: Average Completion Accuracy<br>
95
+ `norm_prob_mass`: Average Probability Mass Normalized<br>
96
+ `bleu`: Average BLEU Score<br>
97
+ `rouge_gm`: Average ROUGE-Geometric-Mean<br>
98
+ `F1`: Average F1<br>
99
+ `CS`: Chatbot Style<br>
100
+ `IF`: Instruction Following<br>
101
+ `LC`: Language Consistency<br>
102
+ `CI`: Concordance Index<br>
103
+ `ES`: Exponential Similarity
104
+
105
+ ## SFT Benchmarks
106
+
107
+ **MTBench winrates**
108
+
109
+ English/German MTBench numbers are based on datasets created with [FastChat](https://github.com/LumiOpen/FastChat) for the corresponding models.
110
+
111
+ | | **vs.** [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) **(Eng)** | **vs.** [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) **(Ger)** |
112
+ | --- | --- | --- |
113
+ | llama-3_1-70b-tfree-hat-sft | 0.633 | 0.639 |
114
+
115
+ Comparison on a broader set of evals:
116
+
117
+ | Group | Task | Metric Name | Num Fewshot | [llama-3_1-70b-tfree-hat-sft]((https://huggingface.co/Aleph-Alpha/llama-3_1-70b-tfree-hat-sft)) | [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) | [llama-3_1-70b-tfree-hat-sft]((https://huggingface.co/Aleph-Alpha/llama-3_1-70b-tfree-hat-sft)) Compression | [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) Compression |
118
+ | --- | --- | --- | --- | --- | --- | --- | --- |
119
+ | Knowledge | MMLU | `norm_log_acc` | 5 | 0.773 | 0.818 | 5.818 | 4.922 |
120
+ | Knowledge | Full Text MMLU | `norm_log_acc` | 5 | 0.786 | 0.830 | 5.850 | 5.109 |
121
+ | Knowledge | MMLU Pro | `norm_log_acc` | 5 | 0.513 | 0.573 | 5.135 | 4.115 |
122
+ | Knowledge | GPQA | `log_acc` | 0 | 0.360 | 0.545 | 5.260 | 3.845 |
123
+ | Knowledge | BBH | `norm_log_acc` | 3 | 0.652 | 0.706 | 5.332 | 4.435 |
124
+ | Knowledge | OpenBookQA | `norm_log_acc` | 10 | 0.526 | 0.556 | 7.101 | 7.092 |
125
+ | Knowledge | TriviaQA | `comp_acc` | 5 | 0.582 | 0.757 | 6.924 | 6.018 |
126
+ | Knowledge | TruthfulQA | `norm_prob_mass` | 6 | 0.176 | 0.191 | 6.575 | 5.702 |
127
+ | Reasoning | ARC Easy | `norm_log_acc` | 25 | 0.920 | 0.911 | 7.018 | 6.426 |
128
+ | Reasoning | ARC Challenge | `norm_log_acc` | 25 | 0.739 | 0.741 | 6.860 | 6.244 |
129
+ | Reasoning | Winogrande | `norm_log_acc` | 5 | 0.749 | 0.697 | 6.856 | 6.615 |
130
+ | Reasoning | HellaSwag | `norm_log_acc` | 10 | 0.809 | 0.665 | 5.980 | 5.300 |
131
+ | German | MMMLU | `norm_log_acc` | 5 | 0.715 | 0.783 | 6.630 | 3.934 |
132
+ | German | [ARC Easy DE](https://huggingface.co/datasets/openGPT-X/arcx) | `norm_log_acc` | 25 | 0.848 | 0.825 | 7.872 | 4.926 |
133
+ | German | [ARC Challenge DE](https://huggingface.co/datasets/openGPT-X/arcx) | `norm_log_acc` | 25 | 0.669 | 0.653 | 7.798 | 4.878 |
134
+ | German | [Winogrande DE](https://huggingface.co/datasets/demelin/wino_x) | `norm_log_acc` | 5 | 0.793 | 0.761 | 7.225 | 5.374 |
135
+ | German | [HellaSwag DE](https://huggingface.co/datasets/openGPT-X/hellaswagx) | `norm_log_acc` | 10 | 0.727 | 0.707 | 6.971 | 4.150 |
136
+ | German | [TruthfulQA DE](https://huggingface.co/datasets/openGPT-X/truthfulqax) | `norm_prob_mass` | 6 | 0.170 | 0.174 | 7.378 | 4.674 |
137
+ | German | [GSM8K DE](https://huggingface.co/datasets/openGPT-X/gsm8kx) | `comp_acc` | 8 | 0.630 | 0.139 | 4.822 | 3.323 |
138
+ | German | WMT16 | `bleu` | 3 | 37.380 | 38.841 | 6.807 | 5.077 |
139
+ | German | WMT16 Instruct | `bleu` | 3 | 37.614 | 37.912 | 6.862 | 5.145 |
140
+ | Instruction Following | Alpaca Eval | `CS` | 0 | 0.363 | 0.168 | 7.984 | 4.746 |
141
+ | Instruction Following | Alpaca Eval | `IF` | 0 | 0.945 | 0.961 | 7.984 | 4.746 |
142
+ | Instruction Following | Alpaca Eval | `LC` | 0 | 0.994 | 0.993 | 7.984 | 4.746 |
143
+ | Long context | QuALITY | `log_acc` | 0 | 0.488 | 0.459 | 4.867 | 4.302 |
144
+ | Long context | ZeroSCROLLS MuSiQue | `F1` | 0 | 0.450 | 0.522 | 5.638 | 4.428 |
145
+ | Long context | ZeroSCROLLS SpaceDigest | `ES` | 0 | 0.779 | 0.404 | 5.154 | 4.480 |
146
+ | Long context | ZeroSCROLLS SQuALITY | `rouge_gm` | 0 | 0.170 | 0.159 | 4.994 | 4.243 |
147
+ | Safety | Winogender | `norm_log_acc` | 5 | 0.679 | 0.843 | 6.875 | 6.701 |
148
+
149
+ # Training Details
150
+
151
+ ## Model Architecture
152
+
153
+ The model uses a hierarchical autoregressive architecture consisting of three components: encoder, backbone, and decoder together with connector layers between components. Encoder, backbone, and decoder are all instances of autoregressive transformers with pre-norm residual blocks in the style of Llama, using a SwiGLU unit as a feed-forward block, with all model parameters active during training and inference. The backbone model uses standard causal attention, while the encoder and decoder use local causal attention with a finite look-back window.
154
+
155
+ The encoder processes input text as a sequence of UTF-8 bytes and produces a sequence of activations of the same length. This sequence is then split into chunks corresponding to words or other semantic units in the text (this is further explained below). In the encoder-backbone connector layer, for each word, a learned latent vector cross-attends to its corresponding chunk of encoder activations. The resulting sequence of latent vectors then serves as input to the backbone. The backbone processes this latent sequence and produces a sequence of word-level representations. Finally, the decoder module is another transformer that acts on the byte-level activations and has an LM head that produces next-byte probabilities. To make use of the higher level information stored in the word-level embeddings during decoding, another cross-attention mechanism is used. In each transformer block of the decoder, every byte-level position cross-attends to the backbone’s word-level representations that correspond to the words preceding this byte.
156
+
157
+ ## Encoder module
158
+
159
+ | | **70B** |
160
+ | --- | --- |
161
+ | Number of layers | 6 |
162
+ | Number of attention heads | 16 |
163
+ | Head size | 128 |
164
+ | Number of Key-Value heads | 16 |
165
+ | Hidden size | 2048 |
166
+ | Cross-attention hidden size | 8192 |
167
+ | MLP expansion factor | 2.75 |
168
+ | MLP type | SwiGLU |
169
+ | Sequence length | 98304 |
170
+ | Position embeddings | RoPE with base 1e5 |
171
+ | Attention type | causal, local with window size 768 |
172
+
173
+ ## Backbone module
174
+
175
+ | | **70B** |
176
+ | --- | --- |
177
+ | Number of layers | 80 |
178
+ | Number of attention heads | 64 |
179
+ | Head size | 128 |
180
+ | Number of Key-Value heads | 8 |
181
+ | Hidden size | 8192 |
182
+ | MLP expansion factor | 3.5 |
183
+ | MLP type | SwiGLU |
184
+ | Sequence length | 12288 |
185
+ | Position embeddings | RoPE with base 5e5 |
186
+ | Attention type | causal |
187
+
188
+ ## Decoder module
189
+
190
+ | | **70B** |
191
+ | --- | --- |
192
+ | Number of layers | 4 |
193
+ | Number of attention heads | 16 |
194
+ | Head size | 128 |
195
+ | Number of Key-Value heads | 16 |
196
+ | Hidden size | 2048 |
197
+ | Cross-attention hidden size | 2048 |
198
+ | MLP expansion factor | 2.75 |
199
+ | MLP type | SwiGLU |
200
+ | Sequence length | 98304 |
201
+ | Position embeddings | RoPE with base 1e5 |
202
+ | Attention type | causal, local with window size 768 |
203
+
204
+ **Total parameter count**
205
+
206
+ Total: `69,302,847,488` Encoder: `476,610,560` Backbone: `68,452,352,000` Decoder: `373,884,928`
207
+
208
+ **Word splitter**
209
+
210
+ To split arbitrary byte sequences, we adopted the guidelines from [UAX #29](https://unicode.org/reports/tr29/), which splits text into words for common Western languages but also produces meaningful semantic units for other types of languages (e.g. Chinese, Japanese, Korean). From now on, we refer to these splits as words.
211
+
212
+ We also merged leading whitespace and trailing punctuation into the words to reduce sequence length at the word level.
213
+
214
+ To improve the processing of code and math documents, we made additional adjustments to the Unicode splitter. First, we split instances of camel cases like FooBar into Foo and Bar. Second, we treated math symbols (again by Unicode standard) as separate words.
215
+
216
+ ## Pre-Training
217
+
218
+ **Approach**
219
+
220
+ We randomly initialized all model parameters of the encoder, decoder, and connector layers. The backbone architecture precisely matches the Llama 3.1 70B architecture, this allowed us to initialize the weights to the pre-trained Llama 3.1 70B base model weights. The model was then trained on the next-byte-prediction objective on a large and diverse document corpus (see below). Initially, we trained on sequences up to 3500 words with global batch size of 1024 for 30,000 steps, totaling to an amount of 108B words. We then continued training on sequences of up to 16000 words with global batch size 128 for another 5000 steps, totalling to another 10.2B words, where we upweight longer documents to make use of the extended context. The training was conducted in our [Scaling framework](https://github.com/Aleph-Alpha/scaling).
221
+
222
+ **Data sources**
223
+
224
+ The model was trained on a filtered subset of diverse corpora of text data including proprietary curated datasets, high-quality web content, public domain sources, German texts, mathematical texts, and programming code. The proportions and sources of data we used in the pre-training were:
225
+
226
+ English Language Data (70%)
227
+
228
+ - curated web and synthetic data (63%)
229
+
230
+ - high quality curated sources such as Wikipedia and public domain books (7%)
231
+
232
+ German Language Data (7%)
233
+
234
+ - curated web and synthetic data (6.3%)
235
+
236
+ - high quality curated sources such as Wikipedia and public domain books (0.7%)
237
+
238
+ Mathematical Content (5%)
239
+
240
+ - mathematical code and proofs (2%)
241
+
242
+ - mathematical word problems and equations (3%)
243
+
244
+ Programming Code (18%)
245
+
246
+ - general programming code (11%)
247
+
248
+ - high-quality and synthetic Python code (7%)
249
+
250
+ ## Data curation
251
+
252
+ We applied a range of curation techniques, e.g., for German as described in [Aleph-Alpha-GermanWeb](https://huggingface.co/datasets/Aleph-Alpha/Aleph-Alpha-GermanWeb). These include but are not limited to:
253
+
254
+ - URL filtering. We used a URL filter developed to filter out fraudulent, harmful, and illegal content from an explicit blocklist, e.g., adult websites, or URLs containing words associated with fraudulent, harmful, or adult content.
255
+
256
+ - Text extraction. Natural language texts which were embedded HTML and other web programming languages were extracted using the [Resiliparse](https://github.com/chatnoir-eu/chatnoir-resiliparse) text extractor.
257
+
258
+ - Language identification. We used a [fastText language classifier](https://fasttext.cc/docs/en/language-identification.html) trained on character n-grams from Wikipedia to identify, retain, and sort texts into English and German.
259
+
260
+ - Repetition removal. We applied heuristic methods for detection and removal of repetitions on the line, paragraph, and character level.
261
+
262
+ - Document- and line-level filtering. We utilized additional document-level heuristics to ensure documents had reasonable numbers and quality of words, naturalistic symbols-to-words and numbers-to-words ratios, not predominantly made up of bullet points, and a sufficient quantity of real words.
263
+
264
+ - Deduplication. Using exact and fuzzy deduplication to remove duplicate documents.
265
+
266
+ ## Synthetic data
267
+
268
+ We also generated synthetic data by using permissively-licensed LLMs.
269
+
270
+ ## Instruction Fine-tuning
271
+
272
+ ### Approach
273
+
274
+ We optimized `llama-3_1-70b-tfree-hat-sft` for instruction-following using a standard post-training pipeline. We applied supervised fine-tuning (SFT) to train the model on both single-turn and multi-turn (chat) instruction-following tasks.
275
+
276
+ ### Data
277
+
278
+ The data used for instruction fine-tuning is based on a mixture of user prompts and model competitions. The data mixture consists of roughly 2M samples from diverse datasets including but not limited to: specialized reasoning datasets covering mathematics, programming, and logical inference; human feedback focused on helpful and harmless responses; a small curated set for specific response patterns; safety and robustness subsets for appropriate boundaries; collaborative conversational data; multilingual conversation prompts; tabular data reasoning for structured information; and formal mathematics with advanced problems.
279
+
280
+ We synthesized responses to the prompts using Qwen 2.5-32B and Qwen 2.5-72B. Additionally, we improved German performance by translating English prompts using Mistral-Nemo-Instruct-2407, generating the corresponding answers using Mistral-Small-3.1-Instruct, and performing quality filtering using an LLM judge based on Llama-3.3-70B-Instruct. Lastly, we supplemented the synthetic data with proprietary human-generated SFT data as well as further data sources.
281
+
282
+ ## Legal Compliance
283
+
284
+ We acknowledge and abide by applicable national and international regulations, including copyright, data privacy, and other related legislation. Any text and data mining by us is performed in compliance with Directive (EU) 2019/790 and its respective national transposition. During the training and fine-tuning of our models, we comply with applicable data privacy laws, including Regulation (EU) 2016/679 (GDPR) and national data privacy regulations. To the extent possible and foreseeable, we also took legislation with forthcoming obligations into account, such as the obligations for General Purpose AI Models under Regulation (EU) 2024/1689 (EU AI Act), and will constantly monitor such developments and adapt our products and this model card accordingly.
285
+
286
+ # Resource Usage
287
+
288
+ ## Compute & Training Efficiency
289
+
290
+ The following table shows the compute resources used in the training stages for the 70B model.
291
+
292
+ | **Model** | **Training phase** | **GPUs** | **Approximate average power consumption per GPU** | **Approximate GPU hours** |
293
+ | --- | --- | --- | --- | --- |
294
+ | 70B | Continued pre-training | 512 x H200 | 460W | 49,920 |
295
+ | 70B | Long context adaptation | 256 x H100 | 160W | 3200 |
296
+ | 70B | Long context SFT | 512 x H100 | 160W | 12290 |
297
+
298
+ ## Environmental Impact
299
+
300
+ Our H200 and A100 infrastructure runs entirely on 100% renewable energy, ensuring that no CO₂ emissions are directly incurred from training. In addition to this, the H200 data center boasts a power usage effectiveness (PUE) of ≤1.2. Its operation also maintains a net-zero water footprint. Specific number on renewable energy usage for the H100 GPUs is not yet available to us.
301
+
302
+ To estimate the carbon footprint of inference, we base our calculations on publicly available data from the infrastructure provider and, where applicable, standard emissions accounting methodology. We report:
303
+
304
+ - **Carbon emitted**: GPU runtime emissions
305
+
306
+ - **Carbon emitted accounting for PUE**: GPU runtime emissions scaled by the data center's PUE
307
+
308
+ Because the data centers operate fully on renewable energy, both metrics for its operation (excluding infrastructure-related emissions, e.g., initial chip manufacturing) are effectively zero. For H100 GPU infrastructure no information has been made available to us.
309
+
310
+ | Metric | H200 GPU | H100 GPU | A100 GPU |
311
+ | --- | --- | --- | --- |
312
+ | Carbon emitted | 0 kg CO₂ | no information available | 0 kg CO₂ |
313
+ | Carbon emitted accounting for PUE | 0 kg CO₂ | no information available | 0 kg CO₂ |
314
+
315
+ ## Power Consumption
316
+
317
+ | GPU Model | Max Power (W) |
318
+ | --- | --- |
319
+ | A100 | 400 W |
320
+ | H100 | 700 W |
321
+ | H200 | 700 W |
322
+
323
+ Numbers may be contextualized with reference to publicly available studies, such as the carbon footprint of language model training.
324
+
325
+ # Intended Use
326
+
327
+ These models are intended to be deployed as components of AI systems or applications. Use-cases and the model's capabilities include but are not limited to: text generation, classification, summarization, question answering, and labeling. Note that applications might require additional model adaptations or components for guarding against unwanted application behavior or model output.
328
+
329
+ ## Non-Permitted Use
330
+
331
+ Our models shall not be used for illegal or unlawful actions of any kind and with any illegal or unlawful content. This includes in particular prohibited practices according to Article 5 of Regulation (EU) 2024/1689 (EU AI Act) and other activities such as engaging in terrorism, violence, human trafficking, illegal distribution of materials to minors, sexual solicitation, any other criminal activities, harassment, discrimination, creating or promoting malicious code or activities risking death or harm, including those related to military or nuclear applications, and activities not in compliance with sanction regimes, technology export regulations, and other restrictions that may apply. The models are to be used following ethical standards. The utilization of our technology is always governed by, and may be limited in accordance with, our Terms and Conditions, the Open Aleph License, or any specific agreement we might have established with you.
332
+
333
+ Although we do not inspect the requests sent to our API, we regularly review and monitor potential violations that may be related to our models and depending on the circumstances of the specific case take legal action against them. This includes but is not limited to, enforcement to remove published model content, requesting compensation for damages caused, and account termination or removal of credits.
334
+
335
+ For non-anonymous reports, we also provide an appeals mechanism for usage policy violations via our dedicated contact address [[email protected]](mailto:[email protected]) to communicate with us.
336
+
337
+ Customers and partners are enabled to use our [ticketing system](https://servicedesk.aleph-alpha.de/external) for appeals, claims, and feedback.
338
+
339
+
340
+ # Risks and Limitations
341
+
342
+ **Note:** Language models are **not agents** and not optimized for prescriptive actions. The use of language models in high-stake environments, for critical decisions or to support a user's wellbeing should be performed with additional guardrails in place.
343
+
344
+ ## Risk Categories
345
+
346
+ In the following sections, we describe risk categories and provide examples of completions we would consider inappropriate or harmful. We then describe steps to minimize these risks.
347
+
348
+ **Harmful Language**
349
+
350
+ Large language models can sometimes generate undesired outputs that are unsuitable for certain applications. This includes producing content with harmful language, discriminative content, inappropriate tone and style, systemic biases, or suggestions that might encourage illegal actions. Such outputs can also include incorrect, outdated information, or material that is not suitable for all ages. While we constantly take efforts to reduce the likelihood of such undesired outputs, this possibility can never be fully ruled out. To minimize these issues, the following strategies can be employed:
351
+
352
+ - Abide by the guidance on illegal use provided for in this Model Card.
353
+
354
+ - Crafting prompts carefully to guide the model's output more effectively.
355
+
356
+ - Utilizing a finetuned model (often referred to as a control or instruct model) that prioritizes using explicitly provided information.
357
+
358
+ - Employing a finetuned model designed to maintain an appropriate tone and style, including avoiding offensive language.
359
+
360
+ - Conducting additional validations at the application level to ensure output quality and appropriateness.
361
+
362
+
363
+ ### Systemic Biases
364
+
365
+ Language models obtain world-knowledge from their pre-training data and may therefore exhibit the same systematic biases that are present in the data. Differing deployment scenarios (including differing cultural contexts) can expose systematic biases in different ways. We acknowledge the cultural diversity of communities and users inside and outside the EU. For larger deployments, we encourage users to track systematic biases relevant to their use-case, and we are happy to consult on bespoke fine-tunings to alleviate such biases.
366
+
367
+ ### Outdated World Knowledge
368
+
369
+ Pre-training was performed using a fixed dataset, created at a fixed date in the past. Accordingly, the world knowledge of foundation models is limited to the information contained in its training data. More recent information may not be known to the model or misunderstood when presented as input during live usage.
370
+
371
+ Risks include:
372
+
373
+ - Generation of personally identifiable information. Models are not explicitly trained to provide such information, but may seem to provide personally identifiable information. This does not necessarily imply the presence of such information in training data, as hallucination is possible.
374
+
375
+ - Generation of unintended, irrelevant, or repetitive outputs. This includes the production of incorrect or outdated information.
376
+
377
+
378
+ Risks may be mitigated by:
379
+
380
+ - Injecting context, where relevant.
381
+
382
+ - Crafting prompts carefully to guide the model's output more effectively.
383
+
384
+ - Performing validations on the application layer, e.g., classifying the output.
385
+
386
+ - Using the repetition penalty, especially in the case of repetition, or other parameters available in the API (see [documentation](https://docs.aleph-alpha.com/api/complete/)).
387
+
388
+ - Avoiding of use cases targeted at retrieval of personally identifiable information.
389
+
390
+
391
+ ### Political Bias
392
+
393
+ Our models have not been optimized to represent a political opinion or take a specific point of view. They may generate outputs that contradict a user's opinion or expectation, e.g., produce hateful, violent or inappropriate, biased, or discriminatory content. Such behavior may be addressed by:
394
+
395
+ - Crafting prompts carefully to guide the model's output more effectively.
396
+
397
+ - Performing validations on the application layer, e.g., via Red-Teaming or classifying the output.
398
+
399
+
400
+ ### Mistaken for a Human
401
+
402
+ Users may attribute human traits to AI models. This also includes the fact that content generated by the model is not explicitly detectable at this point. It is therefore required to:
403
+
404
+ - Inform end users that they are interacting with or reading output of an AI.
405
+
406
+ - Design the system in a way that mitigates the impact of unintended interpretation of the output.
407
+
408
+
409
+ ### Other Errors
410
+
411
+ Any AI module can produce errors, even after implementing all the recommended measures. When integrating foundation language models into an application, users should:
412
+
413
+ - be aware of the risk of (harmful) failure cases and implement the use case in a way that mitigates such risks.
414
+
415
+ - be aware that foundation models do not contain application logic, e.g., content filters. Enforcement policies relevant to the use case need to be implemented in the application layer.
416
+
417
+ - avoid unsupervised use in high-stakes environments.
418
+
419
+ - validate output with adequate measures.
420
+
421
+
422
+ ### Mitigation Approach
423
+
424
+ We specifically tailor model alignment and risk mitigation techniques to each user-facing application built on top of our models, working closely with our customers to refine them according to their unique requirements. Our intention is for these models to undergo further fine-tuning by us and our customers, utilizing their own datasets alongside our support and datasets to ensure suitability for end-user applications, including harm mitigation efforts. Our customers are responsible for adhering to the terms and conditions when aligning the models in their downstream applications.
425
+
426
+ ### Reproducibility
427
+
428
+ Some inference parameters, e.g., temperature, lead to the random sampling of outputs, which precludes the reproducibility of outputs. Even when such parameters are not in use, outputs may diverge slightly on a numeric level for technical reasons. One may implement the following measures if needed:
429
+
430
+ - Logging of past model outputs on the application layer (Aleph Alpha Research is not storing any data and/or using any data provided in prompts for the training of its LLMs).
431
+
432
+
433
+ This list of risks, biases, and limitations may not be complete, as improving the understanding and behavior of language models is an ongoing research topic in the AI science community.
434
+
435
+ # Legal Acknowledgements
436
+
437
+ - **Built with Llama**: Built with Llama: Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. The applicable license agreement can be found under the following link: [Llama 3.1 Community License Agreement ](https://www.llama.com/llama3_1/license/)
438
+
439
+ - **Improved using Qwen**
440
+
441
+
442
+ \*Aleph Alpha Research refers to Aleph Alpha Research GmbH
443
+
444
+ [hat-paper]: https://arxiv.org/abs/2501.10322
config.json ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "HATForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "config.HATArchitectureConfig",
7
+ "AutoModelForCausalLM": "model.HATForCausalLM"
8
+ },
9
+ "backbone_config": {
10
+ "hidden_size": 8192,
11
+ "intermediate_size": 28672,
12
+ "is_neox_style": true,
13
+ "key_query_norm": false,
14
+ "key_query_norm_per_head": false,
15
+ "max_position_embeddings": 12288,
16
+ "mlp_bias": false,
17
+ "num_attention_heads": 64,
18
+ "num_hidden_layers": 80,
19
+ "num_key_value_heads": 8,
20
+ "rms_norm_eps": 1e-05,
21
+ "rope_scaling": {
22
+ "factor": 8.0,
23
+ "high_freq_factor": 4.0,
24
+ "low_freq_factor": 1.0,
25
+ "original_max_position_embeddings": 8192,
26
+ "rope_type": "llama3"
27
+ },
28
+ "rope_theta": 500000,
29
+ "sliding_window": null,
30
+ "transformers_version": null,
31
+ "use_cache": true,
32
+ "vocab_size": 0
33
+ },
34
+ "decoder_config": {
35
+ "cross_attention_config": {
36
+ "attention_num_kv_heads": 16,
37
+ "hidden_size": 2048,
38
+ "hidden_size_kv": 8192,
39
+ "hidden_size_q": 2048,
40
+ "key_query_norm": false,
41
+ "key_query_norm_per_head": false,
42
+ "num_attention_heads": 16,
43
+ "word_window_size": 1
44
+ },
45
+ "cross_attn_every_layer": true,
46
+ "hidden_size": 2048,
47
+ "intermediate_size": 5632,
48
+ "is_neox_style": true,
49
+ "key_query_norm": false,
50
+ "key_query_norm_per_head": false,
51
+ "max_position_embeddings": 98304,
52
+ "mlp_bias": false,
53
+ "num_attention_heads": 16,
54
+ "num_hidden_layers": 4,
55
+ "num_key_value_heads": 16,
56
+ "rms_norm_eps": 1e-05,
57
+ "rope_scaling": {
58
+ "rope_type": "default"
59
+ },
60
+ "rope_theta": 100000,
61
+ "sliding_window": 768,
62
+ "transformers_version": null,
63
+ "use_cache": true,
64
+ "vocab_size": 256
65
+ },
66
+ "encoder_config": {
67
+ "cross_attention_config": {
68
+ "attention_num_kv_heads": 64,
69
+ "hidden_size": 8192,
70
+ "hidden_size_kv": 2048,
71
+ "hidden_size_q": 8192,
72
+ "key_query_norm": false,
73
+ "key_query_norm_per_head": false,
74
+ "num_attention_heads": 64,
75
+ "word_window_size": 1
76
+ },
77
+ "hidden_size": 2048,
78
+ "intermediate_size": 5632,
79
+ "is_neox_style": false,
80
+ "key_query_norm": false,
81
+ "key_query_norm_per_head": false,
82
+ "max_position_embeddings": 98304,
83
+ "mlp_bias": false,
84
+ "num_attention_heads": 16,
85
+ "num_hidden_layers": 6,
86
+ "num_key_value_heads": 16,
87
+ "rms_norm_eps": 1e-05,
88
+ "rope_scaling": {
89
+ "rope_type": "default"
90
+ },
91
+ "rope_theta": 100000,
92
+ "sliding_window": 768,
93
+ "transformers_version": null,
94
+ "use_cache": true,
95
+ "vocab_size": 256
96
+ },
97
+ "max_position_embeddings": 98304,
98
+ "max_word_size": 100,
99
+ "model_type": "hierarchical_autoregressive_transformer",
100
+ "sliding_window": 768,
101
+ "special_token_dict": {
102
+ "<|begin_of_text|>": 250,
103
+ "<|end_header_id|>": 252,
104
+ "<|eot_id|>": 192,
105
+ "<|start_header_id|>": 251
106
+ },
107
+ "torch_dtype": "bfloat16",
108
+ "transformers_version": "4.46.3"
109
+ }
config.py ADDED
@@ -0,0 +1,262 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataclasses import dataclass
2
+
3
+ import torch.nn as nn
4
+ from transformers.configuration_utils import PretrainedConfig
5
+ from transformers.models.llama.configuration_llama import LlamaConfig
6
+
7
+
8
+ @dataclass
9
+ class TransformerHATModelConfig(LlamaConfig):
10
+ def __init__(
11
+ self,
12
+ hidden_size: int,
13
+ num_hidden_layers: int,
14
+ num_attention_heads: int,
15
+ num_key_value_heads: int,
16
+ rms_norm_eps: float,
17
+ intermediate_size: int,
18
+ max_position_embeddings: int,
19
+ rope_scaling: dict,
20
+ rope_theta: float,
21
+ mlp_bias: bool,
22
+ use_cache: bool = True,
23
+ sliding_window: int | None = None,
24
+ vocab_size: int = 0,
25
+ hidden_act: str = "silu",
26
+ key_query_norm: bool = False,
27
+ key_query_norm_per_head: bool = False,
28
+ is_neox_style: bool = True,
29
+ **kwargs,
30
+ ):
31
+ super().__init__(
32
+ vocab_size=vocab_size,
33
+ hidden_size=hidden_size,
34
+ num_hidden_layers=num_hidden_layers,
35
+ num_attention_heads=num_attention_heads,
36
+ num_key_value_heads=num_key_value_heads,
37
+ hidden_act=hidden_act,
38
+ rms_norm_eps=rms_norm_eps,
39
+ intermediate_size=intermediate_size,
40
+ max_position_embeddings=max_position_embeddings,
41
+ rope_scaling=rope_scaling,
42
+ rope_theta=rope_theta,
43
+ mlp_bias=mlp_bias,
44
+ use_cache=use_cache,
45
+ **kwargs,
46
+ )
47
+
48
+ self.sliding_window = sliding_window
49
+ self.key_query_norm = key_query_norm
50
+ self.key_query_norm_per_head = key_query_norm_per_head
51
+ self.is_neox_style = is_neox_style
52
+
53
+ def to_dict(self):
54
+ config_dict = {
55
+ "vocab_size": self.vocab_size,
56
+ "hidden_size": self.hidden_size,
57
+ "num_hidden_layers": self.num_hidden_layers,
58
+ "num_attention_heads": self.num_attention_heads,
59
+ "num_key_value_heads": self.num_key_value_heads,
60
+ "rms_norm_eps": self.rms_norm_eps,
61
+ "intermediate_size": self.intermediate_size,
62
+ "max_position_embeddings": self.max_position_embeddings,
63
+ "rope_scaling": self.rope_scaling,
64
+ "rope_theta": self.rope_theta,
65
+ "mlp_bias": self.mlp_bias,
66
+ "use_cache": self.use_cache,
67
+ "sliding_window": self.sliding_window,
68
+ "transformers_version": self.transformers_version,
69
+ "key_query_norm": self.key_query_norm,
70
+ "key_query_norm_per_head": self.key_query_norm_per_head,
71
+ "is_neox_style": self.is_neox_style,
72
+ }
73
+ return config_dict
74
+
75
+
76
+ @dataclass
77
+ class CrossAttentionConfig:
78
+ def __init__(
79
+ self,
80
+ hidden_size: int,
81
+ hidden_size_q: int,
82
+ hidden_size_kv: int,
83
+ num_attention_heads: int,
84
+ attention_num_kv_heads: int,
85
+ word_window_size: int,
86
+ key_query_norm: bool,
87
+ key_query_norm_per_head: bool,
88
+ ):
89
+ self.hidden_size = hidden_size
90
+ self.hidden_size_q = hidden_size_q
91
+ self.hidden_size_kv = hidden_size_kv
92
+ self.num_attention_heads = num_attention_heads
93
+ self.attention_num_kv_heads = attention_num_kv_heads
94
+ self.word_window_size = word_window_size
95
+ self.key_query_norm = key_query_norm
96
+ self.key_query_norm_per_head = key_query_norm_per_head
97
+
98
+ def to_dict(self):
99
+ return {
100
+ "hidden_size_q": self.hidden_size_q,
101
+ "hidden_size_kv": self.hidden_size_kv,
102
+ "hidden_size": self.hidden_size,
103
+ "num_attention_heads": self.num_attention_heads,
104
+ "attention_num_kv_heads": self.attention_num_kv_heads,
105
+ "word_window_size": self.word_window_size,
106
+ "key_query_norm": self.key_query_norm,
107
+ "key_query_norm_per_head": self.key_query_norm_per_head,
108
+ }
109
+
110
+
111
+ @dataclass
112
+ class DecoderHATModelConfig(TransformerHATModelConfig):
113
+ def __init__(
114
+ self,
115
+ num_attention_heads: int,
116
+ num_key_value_heads: int,
117
+ sliding_window: int,
118
+ cross_attention_config: CrossAttentionConfig,
119
+ cross_attn_every_layer: bool,
120
+ **kwargs,
121
+ ):
122
+ super().__init__(
123
+ num_attention_heads=num_attention_heads,
124
+ num_key_value_heads=num_key_value_heads,
125
+ sliding_window=sliding_window,
126
+ **kwargs,
127
+ )
128
+ self.cross_attn_every_layer = cross_attn_every_layer
129
+ self.cross_attention_config = cross_attention_config
130
+
131
+ def to_dict(self):
132
+ config_dict = super().to_dict()
133
+ config_dict["cross_attn_every_layer"] = self.cross_attn_every_layer
134
+ config_dict["cross_attention_config"] = self.cross_attention_config.to_dict()
135
+ return config_dict
136
+
137
+ @classmethod
138
+ def from_dict(cls, config_dict, **kwargs):
139
+ config_dict = config_dict.copy() # Avoid modifying the original dict
140
+ config_dict.update(kwargs) # Apply overrides
141
+ dict_config = config_dict.pop("cross_attention_config", {})
142
+ cross_attention_config = CrossAttentionConfig(**dict_config)
143
+ config_dict["cross_attention_config"] = cross_attention_config
144
+ return cls(**config_dict)
145
+
146
+
147
+ @dataclass
148
+ class EncoderHATModelConfig(TransformerHATModelConfig):
149
+ def __init__(
150
+ self,
151
+ cross_attention_config: CrossAttentionConfig,
152
+ **kwargs,
153
+ ):
154
+ super().__init__(**kwargs)
155
+ self.cross_attention_config = cross_attention_config
156
+
157
+ @classmethod
158
+ def from_dict(cls, config_dict, **kwargs):
159
+ config_dict = config_dict.copy() # Avoid modifying the original dict
160
+ config_dict.update(kwargs) # Apply overrides
161
+ dict_config = config_dict.pop("cross_attention_config", {})
162
+ cross_attention_config = CrossAttentionConfig(**dict_config)
163
+ config_dict["cross_attention_config"] = cross_attention_config
164
+
165
+ return cls(**config_dict)
166
+
167
+ def to_dict(self):
168
+ config_dict = super().to_dict()
169
+ if self.cross_attention_config:
170
+ config_dict["cross_attention_config"] = self.cross_attention_config.to_dict()
171
+ return config_dict
172
+
173
+
174
+ @dataclass
175
+ class HATArchitectureConfig(PretrainedConfig):
176
+ model_type: str = "hierarchical_autoregressive_transformer"
177
+
178
+ def __init__(
179
+ self,
180
+ special_token_dict: dict | None = None,
181
+ encoder_config: EncoderHATModelConfig | None = None,
182
+ backbone_config: TransformerHATModelConfig | None = None,
183
+ decoder_config: DecoderHATModelConfig | None = None,
184
+ model_type: str = "hierarchical_autoregressive_transformer",
185
+ eos_token_id: int = 192,
186
+ max_word_size: int = 100,
187
+ sliding_window: int = 768,
188
+ max_position_embeddings: int = 262144,
189
+ **kwargs,
190
+ ):
191
+ super().__init__(**kwargs)
192
+ self.encoder_config = encoder_config
193
+ self.backbone_config = backbone_config
194
+ self.decoder_config = decoder_config
195
+ self.model_type = model_type
196
+ self.eos_token_id = eos_token_id
197
+ self.max_word_size = max_word_size
198
+ self.special_token_dict = special_token_dict
199
+ self.transformers_version = "4.46.3"
200
+
201
+ # set these for out of the box vllm inference
202
+ self.architectures = ["HATDecoderForCausalLM"]
203
+ self.sliding_window = sliding_window
204
+ self.max_position_embeddings = max_position_embeddings
205
+ self.torch_dtype = "bfloat16"
206
+
207
+ @classmethod
208
+ def from_dict(cls, config_dict, **kwargs):
209
+ """
210
+ Instantiates a HATArchitectureConfig from a Python dictionary of parameters.
211
+
212
+ Overrides the base `from_dict` to correctly handle nested config objects.
213
+ """
214
+ config_dict = config_dict.copy() # Avoid modifying the original dict
215
+ config_dict.update(kwargs) # Apply overrides
216
+
217
+ # Pop and instantiate nested config dictionaries
218
+ encoder_dict = config_dict.pop("encoder_config", {})
219
+ backbone_dict = config_dict.pop("backbone_config", {})
220
+ decoder_dict = config_dict.pop("decoder_config", {})
221
+
222
+ # Instantiate nested configs
223
+ encoder_config = EncoderHATModelConfig.from_dict(encoder_dict) if encoder_dict else None
224
+ backbone_config = TransformerHATModelConfig.from_dict(backbone_dict) if backbone_dict else None
225
+ decoder_config = DecoderHATModelConfig.from_dict(decoder_dict) if decoder_dict else None
226
+ special_token_dict = config_dict.pop("special_token_dict", {"<|eot_id|>": 192})
227
+ max_word_size = config_dict.pop("max_word_size", 100)
228
+ return cls(
229
+ encoder_config=encoder_config,
230
+ backbone_config=backbone_config,
231
+ decoder_config=decoder_config,
232
+ special_token_dict=special_token_dict,
233
+ max_word_size=max_word_size,
234
+ **config_dict,
235
+ )
236
+
237
+ def to_dict(self):
238
+ config_dict = {}
239
+ if self.encoder_config:
240
+ config_dict["encoder_config"] = self.encoder_config.to_dict()
241
+ if self.backbone_config:
242
+ config_dict["backbone_config"] = self.backbone_config.to_dict()
243
+ if self.decoder_config:
244
+ config_dict["decoder_config"] = self.decoder_config.to_dict()
245
+ config_dict["model_type"] = self.model_type
246
+ config_dict["transformers_version"] = self.transformers_version
247
+ config_dict["auto_map"] = {"AutoConfig": "config.HATArchitectureConfig", "AutoModelForCausalLM": "model.HATForCausalLM"}
248
+ config_dict["special_token_dict"] = self.special_token_dict
249
+
250
+ # print these out to the config for vllm
251
+ config_dict["max_word_size"] = self.max_word_size
252
+ config_dict["sliding_window"] = self.sliding_window
253
+ config_dict["max_position_embeddings"] = self.max_position_embeddings
254
+ config_dict["torch_dtype"] = self.torch_dtype
255
+ config_dict["architectures"] = self.architectures
256
+ return config_dict
257
+
258
+
259
+ class EncoderHATModel(nn.Module):
260
+ def __init__(self, config: HATArchitectureConfig, *args, **kwargs):
261
+ super().__init__(*args, **kwargs)
262
+ self.config = config
config.yaml ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ encoder_config:
2
+ vocab_size: 256
3
+ hidden_size: 2048
4
+ num_hidden_layers: 6
5
+ num_attention_heads: 16
6
+ num_key_value_heads: 16
7
+ rms_norm_eps: 1.0e-05
8
+ intermediate_size: 5632
9
+ max_position_embeddings: 98304
10
+ rope_scaling:
11
+ rope_type: default
12
+ rope_theta: 100000
13
+ mlp_bias: false
14
+ use_cache: true
15
+ sliding_window: 768
16
+ transformers_version: null
17
+ key_query_norm: false
18
+ key_query_norm_per_head: false
19
+ is_neox_style: false
20
+ cross_attention_config:
21
+ hidden_size_q: 8192
22
+ hidden_size_kv: 2048
23
+ hidden_size: 8192
24
+ num_attention_heads: 64
25
+ attention_num_kv_heads: 64
26
+ word_window_size: 1
27
+ key_query_norm: false
28
+ key_query_norm_per_head: false
29
+ backbone_config:
30
+ vocab_size: 0
31
+ hidden_size: 8192
32
+ num_hidden_layers: 80
33
+ num_attention_heads: 64
34
+ num_key_value_heads: 8
35
+ rms_norm_eps: 1.0e-05
36
+ intermediate_size: 28672
37
+ max_position_embeddings: 12288
38
+ rope_scaling:
39
+ rope_type: llama3
40
+ factor: 8.0
41
+ original_max_position_embeddings: 8192
42
+ low_freq_factor: 1.0
43
+ high_freq_factor: 4.0
44
+ rope_theta: 500000
45
+ mlp_bias: false
46
+ use_cache: true
47
+ sliding_window: null
48
+ transformers_version: null
49
+ key_query_norm: false
50
+ key_query_norm_per_head: false
51
+ is_neox_style: true
52
+ decoder_config:
53
+ vocab_size: 256
54
+ hidden_size: 2048
55
+ num_hidden_layers: 4
56
+ num_attention_heads: 16
57
+ num_key_value_heads: 16
58
+ rms_norm_eps: 1.0e-05
59
+ intermediate_size: 5632
60
+ max_position_embeddings: 98304
61
+ rope_scaling:
62
+ rope_type: default
63
+ rope_theta: 100000
64
+ mlp_bias: false
65
+ use_cache: true
66
+ sliding_window: 768
67
+ transformers_version: null
68
+ key_query_norm: false
69
+ key_query_norm_per_head: false
70
+ is_neox_style: true
71
+ cross_attn_every_layer: true
72
+ cross_attention_config:
73
+ hidden_size_q: 2048
74
+ hidden_size_kv: 8192
75
+ hidden_size: 2048
76
+ num_attention_heads: 16
77
+ attention_num_kv_heads: 16
78
+ word_window_size: 1
79
+ key_query_norm: false
80
+ key_query_norm_per_head: false
81
+ model_type: hierarchical_autoregressive_transformer
82
+ transformers_version: 4.46.3
83
+ auto_map:
84
+ AutoConfig: config.HATArchitectureConfig
85
+ AutoModelForCausalLM: model.HATForCausalLM
86
+ special_token_dict:
87
+ <|begin_of_text|>: 250
88
+ <|start_header_id|>: 251
89
+ <|end_header_id|>: 252
90
+ <|eot_id|>: 192
91
+ max_word_size: 100
92
+ sliding_window: 768
93
+ max_position_embeddings: 98304
94
+ torch_dtype: bfloat16
95
+ architectures:
96
+ - HATDecoderForCausalLM
generation_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "transformers_version": "4.46.3"
4
+ }
model-00001-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5a3564d64a71a648587cca3ca4409faf2ee040001b655e45ea7175202206504b
3
+ size 4992458720
model-00002-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:66578b63e7691f3a48e2256cc3ce37aa8be95c95f1f1cbc8fd9dd7773b2de95e
3
+ size 4966123144
model-00003-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8af3cc2abf296cc6b5f6b07037052d8dfe01ff3e33d19350f8c355d09f07b8c2
3
+ size 4362142896
model-00004-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:728ee67efaa02800d1d515992a251eb7bd60e0be8f4354a6026ce58e5ecb7dc0
3
+ size 4966188912
model-00005-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ad5b0b1350aa81b6b33e270c534583376d465e77dc99e6e1d530962b763139b9
3
+ size 4362142896
model-00006-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e2e92e4e91226c5b2df7359b490d5630e66f62c6774f99a668f9f793120c6630
3
+ size 4362142896
model-00007-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:49e85b28038dda70c6e28461665832df278eb68d61b086e4f0e66be740db4593
3
+ size 4966188912
model-00008-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2f4ae5b8707a89aa2ec4ce69bc94f4c56b405a42c37b45a82dfe7b2703168fa
3
+ size 4362142912
model-00009-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff1917704aae0063ec94e68aeab2e525cb5f581d36b956ede5ce871a289ebf44
3
+ size 4362142904
model-00010-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:343c851aa5fd255bc9436947900060ea5738edb0259d04a6bcd16389828c4ad9
3
+ size 4966188928
model-00011-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d0f76ba1ffe4ff51f3634bf99bfcbb3bd5b65113742a931e5851fd13a1bad571
3
+ size 4362142904
model-00012-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:65eec13d9fcdd587bcc68b0fd26689630374378a609fe1aac2ce7a254754edbe
3
+ size 4362142904
model-00013-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1eb3174f47d78c6e3fbd7aaac4f67bdc4e2a8ef93a729619721c1c6d67829cb2
3
+ size 4966188928
model-00014-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cf33ede7924c59c413cd5ca6a807e2d3bf785884617c1415a3010dea1e5c3bec
3
+ size 4362142904
model-00015-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8e754e308dbb6857fa72a9dcd5debec5668ec34e100e1959d4c31d34bc29c082
3
+ size 4362142904
model-00016-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d1dd2492d339b49223d5ff1453e21892133190c384d854187068ecfd27a4401f
3
+ size 4966188928
model-00017-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e6baa0e48bf614903a8a9229fd8282042cd1715a109188405c899fa7b26e9b3
3
+ size 4362142904
model-00018-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a77cc13fb9f95a201bcd57c7809c33b0651ff2509ab3dead7674cd7ca0a21e3c
3
+ size 4362142904
model-00019-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:449f8e5949f4eb5b93c5972fc4782ca4703a0f42ee9fb78241289a7a6b0efc3d
3
+ size 4966188928
model-00020-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee24f74535dfdde4df64897a65d854262cb484f083babc15a3f2229d34075179
3
+ size 4362142904
model-00021-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:041a750c22a23c3ae802f4859b8b8f8d6a63fab078410ef7f514164893ba7671
3
+ size 4362142904
model-00022-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8025479f96daea63dda480f9a2487febf8d3215e7b74fb403c35a6bf97a5f1d2
3
+ size 4966188928
model-00023-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec93af46c20a36970a5042f62002441a271bb7ef051e07f7280640535c0e3ab6
3
+ size 4362142904
model-00024-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3f6964c6991c698480576cd213ee068fe713be7bc80397ccf7b8d5c6d5f0d467
3
+ size 4362142904
model-00025-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:39a8c532fe617953ed16b9ae77a5915290ed7e4e4c953f5d1b4a05bdaccf04bc
3
+ size 4966188928
model-00026-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:15a05861a07ea306675734de88667deae1e941275f4ebfa50c256253fd3bf052
3
+ size 4362142904
model-00027-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef602e77a4af97a880e200f41093e86ee7fa9b670d811aac82ab2fa527cd2666
3
+ size 4362142904
model-00028-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a126e15a8fbd4f52e1956c3394ae222767d1003c1d4c20b566d300cf58396960
3
+ size 4966188928
model-00029-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8243868cf79176d8f28676763529ff32e0c44371488bbcc12d4e9dd90d2cf04d
3
+ size 4362142904
model-00030-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aa82277c1bfbfdef24585834a0cfe07883d7445dc1d31edbc48bbcb270dd16f1
3
+ size 4362142904
model-00031-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:54d0990bf736c86e819e98c4d9c1e9bb6bc5fc9f6b54f365a8f27fba649ff534
3
+ size 4966188928
model-00032-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d3bb9661ddf9defbfe47620a010a6efb77702597c502f4a93bda9e45b9195079
3
+ size 4362142904
model-00033-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:354fd0af125ba4a1691634cb7b8eec00fe57efffc8e3dde2ce602bb14d9bad2a
3
+ size 4362142904
model-00034-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c3d177f40e8c999b1c2ca7c158183ef0276c6eea29c69730b01a52f757525b05
3
+ size 4966188928
model-00035-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:76d33fd4c8ca89e27c300ed88dadfcd360924b6f833de3969e6c2f87e9088c64
3
+ size 4362142904
model-00036-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a9d7d6f9fbbe0f57e61e872676c18bfc4f530e56b57dfb9d888e6bce2053d6d1
3
+ size 4362142904
model-00037-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e77d304bff6efcaa039d48d962aa0e29cbf7025aeb3687a506f24636659cca39
3
+ size 4966188928
model-00038-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e489cc420b93c994a877faf3fe9149d26484b6a1a028071b2a5caacaf7ef1350
3
+ size 4362142904
model-00039-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b6dcd371dd6a2b91b0264e9c95f6dad3d56520e498e6dae5fef9de54319b56c9
3
+ size 4362142904
model-00040-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:71d773ade2594bd860f813b1f9e9ee9ebf23c0799622cf442720da466cadc399
3
+ size 4966188928
model-00041-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:692446421bd2399a65fe5a3de491a309d02123b7054b34ea48ca0d39208bde7e
3
+ size 4362142904
model-00042-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:acff5db3dbd2523fa33075fa6a832bf0524d59c535e4d2d94172321d5d2faae4
3
+ size 4362142904
model-00043-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:44687293209be69938c912920e387b079fbec59c14e26e46a37d2a0d26b379df
3
+ size 4966188928
model-00044-of-00061.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7bddb88ca0fab9fd7ef924f575a1c54a720619a668bc0885e07a212d8fd26942
3
+ size 4362142904