stableai-org commited on
Commit
e03cc02
·
verified ·
1 Parent(s): 270cf6c

Upload LimiX-2M

Browse files

This is the LimiX-2M model from stableai-org.

Files changed (3) hide show
  1. LimiX-2M.ckpt +3 -0
  2. README.md +292 -3
  3. config.json +6 -0
LimiX-2M.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:16f385d5201dc17cd6a8eb18becb0b4f8b3603838423b9f06062106acc965513
3
+ size 9558253
README.md CHANGED
@@ -1,3 +1,292 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+ <img src="https://raw.githubusercontent.com/limix-ldm/LimiX/refs/heads/main/doc/LimiX-Logo.png" alt="LimiX summary" width="89%">
3
+ </div>
4
+
5
+ # News :boom:
6
+ - 2025-08-29: LimiX V1.0 Released.
7
+ - 2025-11-10: LimiX-2M is officially released! Compared to LimiX-16M, this smaller variant offers significantly lower GPU memory usage and faster inference speed. The retrieval mechanism has also been enhanced, further improving model performance while reducing both inference time and memory consumption.
8
+
9
+ # ➤ Overview
10
+ <div align="center">
11
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/LimiX_Summary.png" alt="LimiX summary" width="89%">
12
+ </div>
13
+ We introduce LimiX, the first installment of our LDM series. LimiX aims to push generality further: a single model that handles classification, regression, missing-value imputation, feature selection, sample selection, and causal inference under one training and inference recipe, advancing the shift from bespoke pipelines to unified, foundation-style tabular learning.
14
+
15
+ LimiX adopts a transformer architecture optimized for structured data modeling and task generalization. The model first embeds features X and targets Y from the prior knowledge base into token representations. Within the core modules, attention mechanisms are applied across both sample and feature dimensions to identify salient patterns in key samples and features. The resulting high-dimensional representations are then passed to regression and classification heads, enabling the model to support diverse predictive tasks.
16
+
17
+ For details, please refer to the technical report at the link: [LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505) or [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
18
+
19
+ # ➤ Superior Performance
20
+ The LimiX model achieved SOTA performance across multiple tasks.
21
+
22
+ ## ➩ Classification
23
+ <div align="center">
24
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/Classifier.png" alt="Classification" width="80%">
25
+ </div>
26
+
27
+ ## ➩ Regression
28
+ <div align="center">
29
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/Regression.png" alt="Regression" width="60%">
30
+ </div>
31
+
32
+ ## ➩ Missing Values Imputation
33
+ <div align="center">
34
+ <img src="https://github.com/limix-ldm/LimiX/raw/main/doc/MissingValueImputation.png" alt="Missing value imputation" width="80%">
35
+ </div>
36
+
37
+ # ➤ Tutorials
38
+ ## ➩ Installation
39
+ ### Option 1 (recommended): Use the Dockerfile
40
+ Download [Dockerfile](https://github.com/limix-ldm/LimiX/blob/main/Dockerfile)
41
+ ```bash
42
+ docker build --network=host -t limix/infe:v1 --build-arg FROM_IMAGES=nvidia/cuda:12.2.0-base-ubuntu22.04 -f Dockerfile .
43
+ ```
44
+
45
+ ### Option 2: Build manually
46
+ Download the prebuilt flash_attn files
47
+ ```bash
48
+ wget -O flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
49
+ ```
50
+ Install Python dependencies
51
+ ```bash
52
+ pip install python==3.12.7 torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
53
+ pip install flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
54
+ pip install scikit-learn einops huggingface-hub matplotlib networkx numpy pandas scipy tqdm typing_extensions xgboost kditransform hyperopt
55
+ ```
56
+
57
+ ### Download source code
58
+ ```bash
59
+ git clone https://github.com/limix-ldm/LimiX.git
60
+ cd LimiX
61
+ ```
62
+
63
+ # ➤ Inference
64
+ LimiX supports tasks such as classification, regression, and missing value imputation
65
+ ## ➩ Model download
66
+ | Model size | Download link | Tasks supported |
67
+ | --- | --- | --- |
68
+ | LimiX-16M | [LimiX-16M.ckpt](https://huggingface.co/stableai-org/LimiX-16M/tree/main) | ✅ classification ✅regression ✅missing value imputation |
69
+ | LimiX-2M | [LimiX-2M.ckpt](https://huggingface.co/stableai-org/LimiX-2M/tree/main) | ✅ classification ✅regression ✅missing value imputation |
70
+
71
+ ## ➩ Interface description
72
+
73
+ ### Model Creation
74
+ ```python
75
+ class LimiXPredictor:
76
+ def __init__(self,
77
+ device:torch.device,
78
+ model_path:str,
79
+ mix_precision:bool=True,
80
+ inference_config: list|str,
81
+ categorical_features_indices:List[int]|None=None,
82
+ outlier_remove_std: float=12,
83
+ softmax_temperature:float=0.9,
84
+ task_type: Literal['Classification', 'Regression']='Classification',
85
+ mask_prediction:bool=False,
86
+ inference_with_DDP: bool = False,
87
+ seed:int=0)
88
+ ```
89
+ | Parameter | Data Type | Description |
90
+ |--------|----------|----------|
91
+ | device | torch.device | The hardware that loads the model |
92
+ | model_path | str | The path to the model that needs to be loaded |
93
+ | mix_precision | bool | Whether to enable the mixed precision inference |
94
+ | inference_config | list/str | Configuration file used for inference |
95
+ | categorical_features_indices | list | The indices of categorical columns in the tabular data |
96
+ | outlier_remove_std | float | The threshold is employed to remove outliers, defined as values that are multiples of the standard deviation |
97
+ | softmax_temperature | float | The temperature used to control the behavior of softmax operator |
98
+ | task_type | str | The task type which can be either "Classification" or "Regression" |
99
+ | mask_prediction | bool | Whether to enable missing value imputation |
100
+ | inference_with_DDP | bool | Whether to enable DDP during inference |
101
+ | seed | int | The seed to control random states |
102
+ ### Predict
103
+ ```python
104
+ def predict(self, x_train:np.ndarray, y_train:np.ndarray, x_test:np.ndarray) -> np.ndarray:
105
+ ```
106
+ | Parameter | Data Type | Description |
107
+ | ------- | ---------- | ----------------- |
108
+ | x_train | np.ndarray | The input features of the training set |
109
+ | y_train | np.ndarray | The target variable of the training set |
110
+ | x_test | np.ndarray | The input features of the test set |
111
+
112
+ ## Inference Configuration File Description
113
+ | Configuration File Name | Description | Difference |
114
+ | ------- | ---------- | ----- |
115
+ | cls_default_retrieval.json | Default **classification task** inference configuration file **with retrieval** | Better classification performance |
116
+ | cls_default_noretrieval.json | Default **classification task** inference configuration file **without retrieval** | Faster speed, lower memory requirements |
117
+ | reg_default_retrieval.json | Default **regression task** inference configuration file **with retrieval** | Better regression performance |
118
+ | reg_default_noretrieval.json | Default **regression task** inference configuration file **without retrieval** | Faster speed, lower memory requirements |
119
+ | reg_default_noretrieval_MVI.json | Default inference configuration file for **missing value imputation task** | |
120
+
121
+ ## ➩ Ensemble Inference Based on Sample Retrieval
122
+
123
+ For a detailed technical introduction to Ensemble Inference Based on Sample Retrieval, please refer to the [technical report](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf).
124
+
125
+ Considering inference speed and memory requirements, ensemble inference based on sample retrieval currently only supports hardware with specifications higher than the NVIDIA RTX 4090 GPU.
126
+
127
+ ### Classification Task
128
+
129
+ ```
130
+ python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
131
+ ```
132
+
133
+ ### Regression Task
134
+
135
+ ```
136
+ python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
137
+ ```
138
+
139
+ ### Customizing Data Preprocessing for Inference Tasks
140
+ #### First, Generate the Inference Configuration File
141
+
142
+ ```python
143
+ generate_inference_config()
144
+ ```
145
+
146
+ ### Classification Task
147
+ #### Single GPU or CPU
148
+
149
+ ```
150
+ python inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
151
+ ```
152
+
153
+ #### Multi-GPU Distributed Inference
154
+
155
+ ```
156
+ torchrun --nproc_per_node=8 inference_classifier.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
157
+ ```
158
+
159
+ ### Regression Task
160
+ #### Single GPU or CPU
161
+
162
+ ```
163
+ python inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data
164
+ ```
165
+
166
+ #### Multi-GPU Distributed Inference
167
+
168
+ ```
169
+ torchrun --nproc_per_node=8 inference_regression.py --save_name your_save_name --inference_config_path path_to_retrieval_config --data_dir path_to_data --inference_with_DDP
170
+ ```
171
+
172
+ ### Retrieval Optimization Project
173
+ This project implements an optimized retrieval system. To achieve the best performance, we utilize Optuna for hyperparameter tuning of retrieval parameters.
174
+ #### Installation
175
+ Ensure you have the required dependencies installed:
176
+ ```
177
+ pip install optuna
178
+ ```
179
+ #### Usage
180
+ For standard inference using pre-optimized parameters, refer to the code below:
181
+ ```
182
+ searchInference = RetrievalSearchHyperparameters(
183
+ dict(device_id=0,model_path=model_path), X_train, y_train, X_test, y_test,
184
+ )
185
+ config, result = searchInference.search(n_trials=10, metric="AUC",
186
+ inference_config='config/cls_default_retrieval.json',task_type="cls")
187
+ ```
188
+ This will launch an Optuna study to find the best combination of retrieval parameters for your specific dataset and use case.
189
+
190
+ ## ➩ Classification
191
+ ```python
192
+ from sklearn.datasets import load_breast_cancer
193
+ from sklearn.metrics import accuracy_score, roc_auc_score
194
+ from sklearn.model_selection import train_test_split
195
+ from huggingface_hub import hf_hub_download
196
+ import numpy as np
197
+ import os, sys
198
+
199
+ os.environ["RANK"] = "0"
200
+ os.environ["WORLD_SIZE"] = "1"
201
+ os.environ["MASTER_ADDR"] = "127.0.0.1"
202
+ os.environ["MASTER_PORT"] = "29500"
203
+
204
+ ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
205
+ if ROOT_DIR not in sys.path:
206
+ sys.path.insert(0, ROOT_DIR)
207
+ from inference.predictor import LimiXPredictor
208
+
209
+ X, y = load_breast_cancer(return_X_y=True)
210
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
211
+
212
+ model_file = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
213
+
214
+ clf = LimiXPredictor(device='cuda', model_path=model_file, inference_config='config/cls_default_retrieval.json')
215
+ prediction = clf.predict(X_train, y_train, X_test)
216
+
217
+ print("roc_auc_score:", roc_auc_score(y_test, prediction[:, 1]))
218
+ print("accuracy_score:", accuracy_score(y_test, np.argmax(prediction, axis=1)))
219
+ ```
220
+ For additional examples, refer to [inference_classifier.py](./inference_classifier.py)
221
+
222
+ ## ➩ Regression
223
+ ```python
224
+ from functools import partial
225
+
226
+ from sklearn.datasets import fetch_california_housing
227
+ from sklearn.model_selection import train_test_split
228
+ from sklearn.metrics import r2_score
229
+ from huggingface_hub import hf_hub_download
230
+ try:
231
+ from sklearn.metrics import root_mean_squared_error as mean_squared_error
232
+ except:
233
+ from sklearn.metrics import mean_squared_error
234
+ mean_squared_error = partial(mean_squared_error, squared=False)
235
+ import os, sys
236
+
237
+ os.environ["RANK"] = "0"
238
+ os.environ["WORLD_SIZE"] = "1"
239
+ os.environ["MASTER_ADDR"] = "127.0.0.1"
240
+ os.environ["MASTER_PORT"] = "29500"
241
+
242
+ ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
243
+ if ROOT_DIR not in sys.path:
244
+ sys.path.insert(0, ROOT_DIR)
245
+ from inference.predictor import LimiXPredictor
246
+
247
+ house_data = fetch_california_housing()
248
+ X, y = house_data.data, house_data.target
249
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
250
+
251
+ y_mean = y_train.mean()
252
+ y_std = y_train.std()
253
+ y_train_normalized = (y_train - y_mean) / y_std
254
+ y_test_normalized = (y_test - y_mean) / y_std
255
+
256
+ model_path = hf_hub_download(repo_id="stableai-org/LimiX-16M", filename="LimiX-16M.ckpt", local_dir="./cache")
257
+
258
+ model = LimiXPredictor(device='cuda', model_path=model_path, inference_config='config/reg_default_retrieval.json')
259
+ y_pred = model.predict(X_train, y_train_normalized, X_test)
260
+
261
+ # Compute RMSE and R²
262
+ y_pred = y_pred.to('cpu').numpy()
263
+ rmse = mean_squared_error(y_test_normalized, y_pred)
264
+ r2 = r2_score(y_test_normalized, y_pred)
265
+
266
+ print(f'RMSE: {rmse}')
267
+ print(f'R2: {r2}')
268
+ ```
269
+ For additional examples, refer to [inference_regression.py](./inference_regression.py)
270
+
271
+ ## ➩ Missing value imputation
272
+ For the demo file, see [examples/demo_missing_value_imputation.py](examples/inference_regression.py)
273
+
274
+ # ➤ Link
275
+ - LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence: [LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence](https://arxiv.org/abs/2509.03505)
276
+ - LimiX Technical Report: [LimiX_Technical_Report.pdf](https://github.com/limix-ldm/LimiX/blob/main/LimiX_Technical_Report.pdf)
277
+ - Detailed instructions for using Limix: [Visit the official Limix documentation](https://www.limix.ai/doc/)
278
+ - Balance Comprehensive Challenging Omni-domain Classification Benchmark: [bcco_cls](https://huggingface.co/datasets/stableai-org/bcco_cls)
279
+ - Balance Comprehensive Challenging Omni-domain Regression Benchmark: [bcco_reg](https://huggingface.co/datasets/stableai-org/bcco_reg)
280
+
281
+ # ➤ License
282
+ The code in this repository is open-sourced under the [Apache-2.0](LICENSE.txt) license, while the usage of the LimiX model weights is subject to the Model License. The LimiX weights are fully available for academic research and may be used commercially upon obtaining proper authorization.
283
+
284
+ # ➤ Citation
285
+ ```
286
+ @article{LimiX,
287
+ title={LimiX:Unleashing Structured-Data Modeling Capability for Generalist Intelligence},
288
+ author={LimiXTeam},
289
+ journal={arXiv preprint arXiv:2509.03505},
290
+ year={2025}
291
+ }
292
+ ```
config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "LimiX-2M",
3
+ "author": "stableai-org",
4
+ "description": "This is the LimiX-2M model from stableai-org.",
5
+ "license": "apache-2.0"
6
+ }