Spaces:

Luigi
/

dinercall-intent-demo

Paused

App Files Files Community

Luigi commited on Apr 12

Commit

2d9c975

1 Parent(s): 4ae7310

switch to gradio

Browse files

Files changed (3) hide show

README.md +42 -18
app.py +111 -120
requirements.txt +1 -3

README.md CHANGED Viewed

@@ -3,36 +3,36 @@ title: Dinercall Intent Demo
 emoji: 🏆
 colorFrom: red
 colorTo: gray
-sdk: streamlit
-sdk_version: 1.44.1
 app_file: app.py
 pinned: false
 license: apache-2.0
 short_description: restaurant reservation intent detector
 ---
 # 🍽️ 餐廳訂位意圖識別系統 (Mandarin Reservation Intent Classifier)
-🎙️ 本系統讓使用者可以透過**語音錄音**或**文字輸入**，自動判斷是否具有「訂位意圖」，是語音助理或自動客服前端的理想元件之一。
 ---
 ## 🔍 功能介紹
 - 🧠 **語音辨識**：使用 fine-tuned Whisper 模型 [`Jingmiao/whisper-small-zh_tw`](https://huggingface.co/Jingmiao/whisper-small-zh_tw) 將語音轉為繁體中文文字。
-- 🤖 **意圖分類**：使用微調的 ALBERT 中文模型判斷輸入是否包含訂位意圖。
 - 📱 **支援手機與桌機**：介面具備良好響應性，適用於各類瀏覽器與行動裝置。
-- 🔊 **瀏覽器錄音**：可直接錄音並即時進行語音辨識與意圖分類。
 ---
 ## 🚀 使用方式
-1. 點擊「▶️ 開始錄音」按鈕開始說話。
-2. 點擊「⏹️ 停止錄音」完成語音輸入。
-3. 系統會自動轉文字，並進行「是否為訂位」意圖判斷。
-4. 或者也可以直接手動輸入文字，再點擊送出按鈕。
 ---
@@ -44,29 +44,53 @@ short_description: restaurant reservation intent detector
 ### 中文意圖分類模型：
 - [`Luigi/albert-tiny-chinese-dinercall-intent`](https://huggingface.co/Luigi/albert-tiny-chinese-dinercall-intent)
 - [`Luigi/albert-base-chinese-dinercall-intent`](https://huggingface.co/Luigi/albert-base-chinese-dinercall-intent)
 ---
 ## 📦 依賴環境
 ```txt
-streamlit
-transformers>=4.30.0
 torch
-torchaudio
 ```
 ---
 ## 🛠️ 開發者備註
-- 本應用為 Streamlit App，支援 Hugging Face Spaces 部署。
-- 使用 JavaScript 客製化錄音介面，透過 Web Audio API 進行錄音與 POST 回傳。
 - 若需延伸本系統至其他語言或多輪對話，歡迎 fork 本專案進行改造！
 ---
-© 2024 by [Your Name or Team]. Made with ❤️ using Hugging Face + Streamlit.
-```
----

 emoji: 🏆
 colorFrom: red
 colorTo: gray
+sdk: gradio
+sdk_version: 5+
 app_file: app.py
 pinned: false
 license: apache-2.0
 short_description: restaurant reservation intent detector
 ---
 # 🍽️ 餐廳訂位意圖識別系統 (Mandarin Reservation Intent Classifier)
+🎙️ 本系統讓使用者可以透過**語音錄音**或**文字輸入**，自動判斷是否具有「訂位意圖」，是語音助理或自動客服前端的理想元件之一。這個版本基於 **Gradio** 建構，具有簡單直觀的分頁式輸入模式切換（「麥克風」或「文字」）。
 ---
 ## 🔍 功能介紹
 - 🧠 **語音辨識**：使用 fine-tuned Whisper 模型 [`Jingmiao/whisper-small-zh_tw`](https://huggingface.co/Jingmiao/whisper-small-zh_tw) 將語音轉為繁體中文文字。
+- 🤖 **意圖分類**：使用微調的 ALBERT 中文模型或 Qwen 模型判斷輸入是否包含訂位意圖。
 - 📱 **支援手機與桌機**：介面具備良好響應性，適用於各類瀏覽器與行動裝置。
+- 🔊 **雙重輸入模式**：使用者可在「麥克風」和「文字」兩種模式間切換，以提供語音或手動輸入。
 ---
 ## 🚀 使用方式
+1. 選擇輸入模式：
+   - 「麥克風」：點擊錄音按鈕開始錄音，錄製完成後自動轉文字並判斷意圖。
+   - 「文字」：直接在文字框中輸入語句，再點擊「執行辨識」按鈕。
+2. 從下拉選單選擇使用的模型（例如 ALBERT-tiny、ALBERT-base 或 Qwen）。
+3. 按下「執行辨識」後，系統將顯示轉換後的文字、意圖判斷結果，並以 TTS（語音合成）的方式回應。
 ---
 ### 中文意圖分類模型：
 - [`Luigi/albert-tiny-chinese-dinercall-intent`](https://huggingface.co/Luigi/albert-tiny-chinese-dinercall-intent)
 - [`Luigi/albert-base-chinese-dinercall-intent`](https://huggingface.co/Luigi/albert-base-chinese-dinercall-intent)
+- 或使用 [`Qwen/Qwen2.5-0.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)（透過 Outlines 整合）
 ---
 ## 📦 依賴環境
 ```txt
+llama-cpp-python
+gradio>=5.0.0
+transformers
 torch
+soundfile
+outlines
+numpy>=1.24,<2.0
+kokoro
+huggingface-hub
+jieba
+docopt
+ordered-set
+cn2an
+pypinyin
+sentencepiece
 ```
 ---
 ## 🛠️ 開發者備註
+- 本應用現改為 Gradio App，適合在 Hugging Face Spaces 上部署，並支援 Gradio V5 的最新功能。
+- 採用雙重輸入模式（麥克風與文字）讓使用者能靈活切換輸入方式。
 - 若需延伸本系統至其他語言或多輪對話，歡迎 fork 本專案進行改造！
 ---
+© 2024 by [Your Name or Team]. Made with ❤️ using Hugging Face + Gradio.
+---
+### Explanation
+- **README.md:**
+  - The SDK and app_file information has been updated to indicate a Gradio-based application.
+  - The features have been revised to highlight the dual-input mode (麥克風 vs. 文字).
+  - The installation instructions and usage steps now reflect the updated Gradio interface.
+- **requirements.txt:**
+  - The dependencies for Streamlit and streamlit-mic-recorder have been removed.
+  - Gradio (version 5.0.0 or higher) has been added as the primary UI framework.
+  - The remaining dependencies support the models and other processing components.
+Feel free to customize further as needed for your deployment or additional features!

app.py CHANGED Viewed

@@ -1,56 +1,58 @@
-import streamlit as st
-from streamlit_mic_recorder import mic_recorder
-from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
-import outlines  # Use outlines with transformers integration
-from torch.nn.functional import softmax
 import torch
 import tempfile
 import re
 from pathlib import Path
-import io
-import base64
-import numpy as np
-import soundfile as sf
-from kokoro import KPipeline
-# ------------------ Model Identifiers ------------------
-# Whisper ASR model identifier (using Hugging Face Transformers pipeline)
 whisper_model_id = "Jingmiao/whisper-small-zh_tw"
-# Qwen LLM model identifier (using outlines transformers integration)
 qwen_model_id = "Qwen/Qwen2.5-0.5B-Instruct"
-# Available models for text classification (intent detection) via Transformers
 available_models = {
     "ALBERT-tiny (Chinese)": "Luigi/albert-tiny-chinese-dinercall-intent",
     "ALBERT-base (Chinese)": "Luigi/albert-base-chinese-dinercall-intent",
-    "Qwen (via Transformers - outlines)": "qwen"  # special keyword to use Qwen below
 }
-# ------------------ Load Functions ------------------
-@st.cache_resource
 def load_whisper_pipeline():
-    return pipeline("automatic-speech-recognition", model=whisper_model_id)
-@st.cache_resource
-def load_transformers_model(model_id):
-    # Load ALBERT-based classification model via Transformers.
     tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
     model = AutoModelForSequenceClassification.from_pretrained(model_id)
     return tokenizer, model
-@st.cache_resource
 def load_qwen_model():
-    # Load Qwen using the outlines transformers integration.
-    # Note that the prompt-based interaction requires proper chat tokens.
     return outlines.models.transformers(qwen_model_id)
-# ------------------ Prediction Functions ------------------
-def predict_with_qwen(text):
-    # Use Qwen via outlines for intent classification with a prompt.
     model = load_qwen_model()
     prompt = f"""
 <|im_start|>system
@@ -76,10 +78,11 @@ Classify the following message: "{text}"
     else:
         return f"未知回應: {prediction}"
-def predict_intent(text, model_id):
-    # Use ALBERT-based Transformers for intent detection.
     tokenizer, model = load_transformers_model(model_id)
     inputs = tokenizer(text, return_tensors="pt")
     with torch.no_grad():
         logits = model(**inputs).logits
         probs = softmax(logits, dim=-1)
@@ -89,20 +92,7 @@ def predict_intent(text, model_id):
     else:
         return f"❌ 無訂位意圖 (Not Reservation intent)（訂位信心度 Confidence: {confidence:.2%}）"
-def load_clean_readme(path="README.md"):
-    text = Path(path).read_text(encoding="utf-8")
-    text = re.sub(r"(?s)^---.*?---", "", text).strip()
-    text = re.sub(r"^# .*?\n+", "", text)
-    return text
-# ------------------ TTS Integration via kokoro ------------------
-@st.cache_resource
-def get_tts_pipeline():
-    # Instantiate and cache the KPipeline for TTS; setting language code to Chinese.
-    return KPipeline(lang_code="z")
-def get_tts_message(intent_result):
     if intent_result and "訂位意圖" in intent_result and "無" not in intent_result:
         return "稍後您將會從簡訊收到訂位連結"
     elif intent_result:
@@ -110,7 +100,7 @@ def get_tts_message(intent_result):
     else:
         return "未能判斷意圖"
-def play_tts_message(message, voice='af_heart'):
     pipeline_tts = get_tts_pipeline()
     generator = pipeline_tts(message, voice=voice)
     audio_chunks = []
@@ -118,78 +108,79 @@ def play_tts_message(message, voice='af_heart'):
         audio_chunks.append(audio)
     if audio_chunks:
         audio_concat = np.concatenate(audio_chunks)
     else:
-        audio_concat = np.array([])
-    wav_buffer = io.BytesIO()
-    sf.write(wav_buffer, audio_concat, 24000, format="WAV")
-    wav_buffer.seek(0)
-    return wav_buffer.read()
-def play_audio_auto(audio_data, mime="audio/wav"):
-    audio_base64 = base64.b64encode(audio_data).decode()
-    audio_html = f'''
-    <audio controls autoplay style="width: 100%;">
-        <source src="data:{mime};base64,{audio_base64}" type="{mime}">
-        Your browser does not support the audio element.
-    </audio>
-    '''
-    st.markdown(audio_html, unsafe_allow_html=True)
-# ------------------ App UI ------------------
-st.title("🍽️ 餐廳訂位意圖識別")
-st.markdown("錄音或輸入文字，自動判斷是否具有訂位意圖。")
-model_label = st.selectbox("選擇模型", list(available_models.keys()))
-model_id = available_models[model_label]
-st.markdown("### 🎙️ 點擊錄音（支援瀏覽器）")
-audio = mic_recorder(start_prompt="開始錄音", stop_prompt="停止錄音", just_once=True, use_container_width=True, format="wav", key="recorder")
-# Process audio recording input
-if audio:
-    st.success("錄音完成！")
-    st.audio(audio["bytes"], format="audio/wav")
-    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmpfile:
-        tmpfile.write(audio["bytes"])
-        tmpfile_path = tmpfile.name
-    with st.spinner("🧠 Whisper 處理語音中..."):
-        try:
-            whisper_pipe = load_whisper_pipeline()
-            result = whisper_pipe(tmpfile_path)
-            transcription = result["text"]
-            st.success(f"📝 語音轉文字：{transcription}")
-        except Exception as e:
-            st.error(f"❌ Whisper 錯誤：{str(e)}")
-            transcription = ""
-    if transcription:
-        with st.spinner("預測中..."):
-            if model_id == "qwen":
-                result_text = predict_with_qwen(transcription)
-            else:
-                result_text = predict_intent(transcription, model_id)
-            st.success(result_text)
-            tts_text = get_tts_message(result_text)
-            st.info(f"TTS 語音內容: {tts_text}")
-            audio_message = play_tts_message(tts_text)
-            play_audio_auto(audio_message, mime="audio/wav")
-# Process text input for intent classification
-text_input = st.text_input("✍️ 或手動輸入語句")
-if text_input and st.button("🚀 送出"):
-    with st.spinner("預測中..."):
-        if model_id == "qwen":
-            result_text = predict_with_qwen(text_input)
         else:
-            result_text = predict_intent(text_input, model_id)
-        st.success(result_text)
-        tts_text = get_tts_message(result_text)
-        st.info(f"TTS 語音內容: {tts_text}")
-        audio_message = play_tts_message(tts_text)
-        play_audio_auto(audio_message, mime="audio/wav")
-with st.expander("ℹ️ 說明文件 / 使用說明 (README)", expanded=False):
-    readme_md = load_clean_readme()
-    st.markdown(readme_md, unsafe_allow_html=True)

+import gradio as gr
+from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
 import torch
+from torch.nn.functional import softmax
+import numpy as np
+import soundfile as sf
+import io
 import tempfile
+import outlines  # For Qwen integration via outlines
+import kokoro     # For TTS synthesis
 import re
 from pathlib import Path
+from functools import lru_cache
+import warnings
+# Suppress FutureWarnings (e.g. about using `inputs` vs. `input_features`)
+warnings.filterwarnings("ignore", category=FutureWarning)
+# ------------------- Model Identifiers -------------------
 whisper_model_id = "Jingmiao/whisper-small-zh_tw"
 qwen_model_id = "Qwen/Qwen2.5-0.5B-Instruct"
 available_models = {
     "ALBERT-tiny (Chinese)": "Luigi/albert-tiny-chinese-dinercall-intent",
     "ALBERT-base (Chinese)": "Luigi/albert-base-chinese-dinercall-intent",
+    "Qwen (via Transformers - outlines)": "qwen"
 }
+# ------------------- Caching and Loading Functions -------------------
+@lru_cache(maxsize=1)
 def load_whisper_pipeline():
+    pipe = pipeline("automatic-speech-recognition", model=whisper_model_id)
+    # Move model to GPU if available for faster inference
+    if torch.cuda.is_available():
+        pipe.model.to("cuda")
+    return pipe
+@lru_cache(maxsize=2)
+def load_transformers_model(model_id: str):
     tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
     model = AutoModelForSequenceClassification.from_pretrained(model_id)
+    if torch.cuda.is_available():
+        model.to("cuda")
     return tokenizer, model
+@lru_cache(maxsize=1)
 def load_qwen_model():
     return outlines.models.transformers(qwen_model_id)
+@lru_cache(maxsize=1)
+def get_tts_pipeline():
+    return kokoro.KPipeline(lang_code="z")
+# ------------------- Inference Functions -------------------
+def predict_with_qwen(text: str):
     model = load_qwen_model()
     prompt = f"""
 <|im_start|>system
     else:
         return f"未知回應: {prediction}"
+def predict_intent(text: str, model_id: str):
     tokenizer, model = load_transformers_model(model_id)
     inputs = tokenizer(text, return_tensors="pt")
+    if torch.cuda.is_available():
+        inputs = {k: v.to("cuda") for k, v in inputs.items()}
     with torch.no_grad():
         logits = model(**inputs).logits
         probs = softmax(logits, dim=-1)
     else:
         return f"❌ 無訂位意圖 (Not Reservation intent)（訂位信心度 Confidence: {confidence:.2%}）"
+def get_tts_message(intent_result: str):
     if intent_result and "訂位意圖" in intent_result and "無" not in intent_result:
         return "稍後您將會從簡訊收到訂位連結"
     elif intent_result:
     else:
         return "未能判斷意圖"
+def tts_audio_output(message: str, voice: str = 'af_heart'):
     pipeline_tts = get_tts_pipeline()
     generator = pipeline_tts(message, voice=voice)
     audio_chunks = []
         audio_chunks.append(audio)
     if audio_chunks:
         audio_concat = np.concatenate(audio_chunks)
+        # Return as tuple (sample_rate, numpy_array) for gr.Audio (sample rate used: 24000 Hz)
+        return (24000, audio_concat)
+    else:
+        return None
+def transcribe_audio(audio_file):
+    whisper_pipe = load_whisper_pipeline()
+    # audio_file is the file path from gr.Audio (with type="filepath")
+    result = whisper_pipe(audio_file)
+    return result["text"]
+# ------------------- Main Processing Function -------------------
+def classify_intent(mode, audio_file, text_input, model_choice):
+    # Determine input based on explicit mode.
+    if mode == "Microphone" and audio_file is not None:
+        transcription = transcribe_audio(audio_file)
+    elif mode == "Text" and text_input:
+        transcription = text_input
+    else:
+        return "請提供語音或文字輸入", "", None
+    # Classify the transcribed or provided text.
+    if available_models[model_choice] == "qwen":
+        classification = predict_with_qwen(transcription)
     else:
+        classification = predict_intent(transcription, available_models[model_choice])
+    # Generate TTS message and audio.
+    tts_msg = get_tts_message(classification)
+    tts_audio = tts_audio_output(tts_msg)
+    return transcription, classification, tts_audio
+# ------------------- Gradio Blocks Interface Setup -------------------
+with gr.Blocks() as demo:
+    gr.Markdown("## 🍽️ 餐廳訂位意圖識別")
+    gr.Markdown("錄音或輸入文字，自動判斷是否具有訂位意圖。")
+    with gr.Row():
+        # Input Mode Selector
+        mode = gr.Radio(choices=["Microphone", "Text"], label="選擇輸入模式", value="Microphone")
+    with gr.Row():
+        # Audio and Text inputs – only one will be visible based on mode selection.
+        audio_input = gr.Audio(sources=["microphone"], type="filepath", label="語音輸入 (點擊錄音)")
+        text_input = gr.Textbox(lines=2, placeholder="請輸入文字", label="文字輸入")
+    # Initially, only the microphone input is visible.
+    text_input.visible = False
+    # Change event for mode selection to toggle visibility.
+    def update_visibility(selected_mode):
+        if selected_mode == "Microphone":
+            return gr.update(visible=True), gr.update(visible=False)
         else:
+            return gr.update(visible=False), gr.update(visible=True)
+    mode.change(fn=update_visibility, inputs=mode, outputs=[audio_input, text_input])
+    with gr.Row():
+        model_dropdown = gr.Dropdown(choices=list(available_models.keys()),
+                                     value="ALBERT-tiny (Chinese)", label="選擇模型")
+    with gr.Row():
+        classify_btn = gr.Button("執行辨識")
+    with gr.Row():
+        transcription_output = gr.Textbox(label="轉換文字")
+    with gr.Row():
+        classification_output = gr.Textbox(label="意圖判斷結果")
+    with gr.Row():
+        tts_output = gr.Audio(type="numpy", label="TTS 語音輸出")
+    # Button event triggers the classification. Gradio will show a spinner during processing.
+    classify_btn.click(fn=classify_intent,
+                       inputs=[mode, audio_input, text_input, model_dropdown],
+                       outputs=[transcription_output, classification_output, tts_output])
+demo.launch()

requirements.txt CHANGED Viewed

@@ -3,11 +3,9 @@
 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
 llama-cpp-python
-streamlit
-streamlit-mic-recorder
 transformers
 torch
-faster-whisper
 soundfile
 outlines
 numpy>=1.24,<2.0

 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
 llama-cpp-python
+gradio>=5.0.0
 transformers
 torch
 soundfile
 outlines
 numpy>=1.24,<2.0