add deepseek and openrouter (#24)

* add deepseek and openrouter #22 * updating docs * fix dependency version conflict * Update requirements.txt --------- Co-authored-by: David G <[email protected]>
muchdogesec · Feb 5, 2025 · 78c57a8 · 78c57a8
1 parent 3e7a1e2
commit 78c57a8
Show file tree

Hide file tree

Showing 11 changed files with 112 additions and 22 deletions.
diff --git a/.env.example b/.env.example
@@ -3,6 +3,8 @@ INPUT_TOKEN_LIMIT=
 OPENAI_API_KEY=
 ANTHROPIC_API_KEY=
 GOOGLE_API_KEY=
+OPENROUTER_API_KEY=
+DEEPSEEK_API_KEY=
 TEMPERATURE=
 ## CTIBUTLER
 CTIBUTLER_BASE_URL=

diff --git a/.env.markdown b/.env.markdown
@@ -10,12 +10,21 @@ However, if you just want to experiment, set the following values
 	* (REQUIRED IF USING AI MODES) Ensure the input/output token count meets requirements and is supported by the model selected. Will not allow files with more than tokens specified to be processed
 * `TEMPERATURE`: `0.0` 
 	* The temperature value ranges from 0 to 2, with lower values indicating greater determinism and higher values indicating more randomness in responses.
+
+**A small note on selecting a provider**
+
+Below are the models you can use. We highly recommend using [OpenRouter](https://openrouter.ai/) (`OPENROUTER_API_KEY`) which gives you access to a wide range of models / providers. There is no benefit in function to using one provider over another (the txt2detection logic is the same for all). We provide the option to use provider supplied API keys (e.g. `OPENAI_API_KEY`) for those who cannot / do not want to use OpenRouter.
+
+* `OPENROUTER_API_KEY`=
+	* (REQUIRED IF USING MODELS PROVIDED BY OPENROUTER IN AI MODES) get it from: https://openrouter.ai/settings/keys
+* `DEEPSEEK_API_KEY`=
+	* (REQUIRED IF USING DEEPSEEK MODELS DIRECTLY IN AI MODES) get it from: https://platform.deepseek.com/api-key
 * `OPENAI_API_KEY`: YOUR_API_KEY
-	* (REQUIRED IF USING OPENAI MODELS IN AI MODES) get it from https://platform.openai.com/api-keys
+	* (REQUIRED IF USING OPENAI MODELS DIRECTLY IN AI MODES) get it from: https://platform.openai.com/api-keys
 * `ANTHROPIC_API_KEY`: YOUR_API_KEY
-	* (REQUIRED IF USING ANTHROPIC MODELS IN AI MODES) get it from https://console.anthropic.com/settings/keys
+	* (REQUIRED IF USING ANTHROPIC MODELS DIRECTLY IN AI MODES) get it from" https://console.anthropic.com/settings/keys
 * `GOOGLE_API_KEY`:
-	* (REQUIRED IF USING GOOGLE GEMINI MODELS IN AI MODES) get it from the Google Cloud Platform (making sure the Gemini API is enabled for the project)
+	* (REQUIRED IF USING GOOGLE GEMINI MODELS DIRECTLY IN AI MODES) get it from the Google Cloud Platform (making sure the Gemini API is enabled for the project)
 
 ## CTIBUTLER
 

diff --git a/README.md b/README.md
@@ -82,25 +82,39 @@ python3 txt2detection.py \
 * `--external_refs` (optional): txt2detection will automatically populate the `external_references` of the report object it creates for the input. You can use this value to add additional objects to `external_references`. Note, you can only add `source_name` and `external_id` values currently. Pass as `source_name=external_id`. e.g. `--external_refs txt2stix=demo1 source=id` would create the following objects under the `external_references` property: `{"source_name":"txt2stix","external_id":"demo1"},{"source_name":"source","external_id":"id"}`
 * `--detection_language_key` (required): the detection rule language you want the output to be in. You can find a list of detection language keys in `config/detection_languages.yaml`
 * `ai_provider` (required): defines the `provider:model` to be used. Select one option. Currently supports:
-    * Provider: `openai:`, models e.g.: `gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo`, `gpt-4` ([More here](https://platform.openai.com/docs/models))
-    * Provider: `anthropic:`, models e.g.: `claude-3-5-sonnet-latest`, `claude-3-5-haiku-latest`, `claude-3-opus-latest` ([More here](https://docs.anthropic.com/en/docs/about-claude/models))
-    * Provider: `gemini:models/`, models: `gemini-1.5-pro-latest`, `gemini-1.5-flash-latest` ([More here](https://ai.google.dev/gemini-api/docs/models/gemini))
+    * Provider (env var required `OPENROUTER_API_KEY`): `openrouter:`, providers/models `openai/gpt-4o`, `deepseek/deepseek-chat` ([More here](https://openrouter.ai/models))
+    * Provider (env var required `OPENAI_API_KEY`): `openai:`, models e.g.: `gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo`, `gpt-4` ([More here](https://platform.openai.com/docs/models))
+    * Provider (env var required `ANTHROPIC_API_KEY`): `anthropic:`, models e.g.: `claude-3-5-sonnet-latest`, `claude-3-5-haiku-latest`, `claude-3-opus-latest` ([More here](https://docs.anthropic.com/en/docs/about-claude/models))
+    * Provider (env var required `GOOGLE_API_KEY`): `gemini:models/`, models: `gemini-1.5-pro-latest`, `gemini-1.5-flash-latest` ([More here](https://ai.google.dev/gemini-api/docs/models/gemini))
+    * Provider (env var required `DEEPSEEK_API_KEY`): `deepseek:`, models `deepseek-chat` ([More here](https://api-docs.deepseek.com/quick_start/pricing))    
 
 e.g.
 
 ```shell
 python3 txt2detection.py \
   --input_file tests/files/CVE-2024-56520.txt \
-  --name "lynx ransomware" \
+  --name "CVE-2024-56520" \
   --tlp_level green \
   --labels label1,label2 \
   --external_refs txt2stix=demo1 source=id \
   --detection_language spl \
-  --ai_provider openai:gpt-4o \
+  --ai_provider openrouter:openai/gpt-4o \
   --report_id a70c4ca8-77d5-4c6f-96fb-9726ec89d242 \
   --use_identity '{"type":"identity","spec_version":"2.1","id":"identity--8ef05850-cb0d-51f7-80be-50e4376dbe63","created_by_ref":"identity--9779a2db-f98c-5f4b-8d08-8ee04e02dbb5","created":"2020-01-01T00:00:00.000Z","modified":"2020-01-01T00:00:00.000Z","name":"siemrules","description":"https://github.com/muchdogesec/siemrules","identity_class":"system","sectors":["technology"],"contact_information":"https://www.dogesec.com/contact/","object_marking_refs":["marking-definition--94868c89-83c2-464b-929b-a1a8aa3c8487","marking-definition--97ba4e8b-04f6-57e8-8f6e-3a0f0a7dc0fb"]}'
 ```
 
+e.g.
+
+```shell
+python3 txt2detection.py \
+  --input_file tests/files/CVE-2024-56520.txt \
+  --name "CVE-2024-56520" \
+  --tlp_level green \
+  --detection_language sigma \
+  --ai_provider openrouter:openai/gpt-4o \
+  --report_id b02df393-995d-421e-b66c-721000e058d2
+```
+
 ## Adding new detection languages
 
 Adding a new detection language is fairly trivial. However, there is a implicit understanding the model understands the detection rule structure. Results can therefore be mixed, so it is worth testing in detail.

diff --git a/pyproject.toml b/pyproject.toml
@@ -33,6 +33,17 @@ Issues = "https://github.com/muchdogesec/txt2detection/issues"
 [project.scripts]
 txt2detection = "txt2detection.__main__:main"
 
+[project.optional-dependencies]
+llms = [
+  "llama-index-core==0.12.7",
+  "llama-index-llms-anthropic==0.6.3",
+  "llama-index-llms-gemini==0.4.2",
+  "llama-index-llms-openai==0.3.11",
+  "llama-index-llms-openai-like==0.3.3",
+  "llama-index-llms-deepseek==0.1.1",
+  "llama-index-llms-openrouter==0.3.1",
+]
+
 
 [tool.hatch.build.targets.wheel.force-include]
 "config" = "txt2detection/config"
diff --git a/requirements.txt b/requirements.txt
@@ -3,7 +3,7 @@ aiohappyeyeballs==2.4.3; python_version >= '3.8'
 aiohttp==3.11.0; python_version >= '3.9'
 aiosignal==1.3.1; python_version >= '3.7'
 annotated-types==0.7.0; python_version >= '3.8'
-anthropic[bedrock,vertex]==0.39.0; python_version >= '3.8'
+anthropic[bedrock,vertex]>=0.39.0; python_version >= '3.8'
 antlr4-python3-runtime==4.9.3
 anyio==4.6.2.post1; python_version >= '3.9'
 attrs==24.2.0; python_version >= '3.7'
@@ -38,18 +38,21 @@ idna==3.10; python_version >= '3.6'
 jiter==0.7.1; python_version >= '3.8'
 jmespath==1.0.1; python_version >= '3.7'
 joblib==1.4.2; python_version >= '3.8'
-llama-index-core==0.11.23; python_full_version >= '3.8.1' and python_version < '4.0'
-llama-index-llms-anthropic==0.4.1; python_full_version >= '3.8.1' and python_version < '4.0'
-llama-index-llms-gemini==0.3.7; python_version >= '3.9' and python_version < '4.0'
-llama-index-llms-openai==0.2.16; python_full_version >= '3.8.1' and python_version < '4.0'
+llama-index-core==0.12.7; python_full_version >= '3.8.1' and python_version < '4.0'
+llama-index-llms-anthropic==0.6.3; python_full_version >= '3.8.1' and python_version < '4.0'
+llama-index-llms-deepseek==0.1.1; python_version >= '3.9' and python_version < '4.0'
+llama-index-llms-gemini==0.4.2; python_full_version >= '3.8.1' and python_version < '4.0'
+llama-index-llms-openai==0.3.11; python_full_version >= '3.8.1' and python_version < '4.0'
+llama-index-llms-openai-like==0.3.3; python_full_version >= '3.8.1' and python_version < '4.0'
+llama-index-llms-openrouter==0.3.1; python_full_version >= '3.8.1' and python_version < '4.0'
 marshmallow==3.23.1; python_version >= '3.9'
 multidict==6.1.0; python_version >= '3.8'
 mypy-extensions==1.0.0; python_version >= '3.5'
 nest-asyncio==1.6.0; python_version >= '3.5'
 networkx==3.4.2; python_version >= '3.10'
 nltk==3.9.1; python_version >= '3.8'
 numpy==1.26.4; python_version >= '3.9'
-openai==1.54.4; python_version >= '3.8'
+openai==1.58.1; python_version >= '3.8'
 packaging==24.2; python_version >= '3.8'
 pillow==10.4.0; python_version >= '3.8'
 propcache==0.2.0; python_version >= '3.8'

diff --git a/tests/files/EC2-exfil.txt b/tests/files/EC2-exfil.txt
@@ -0,0 +1 @@
+Detect a user attempting to exfiltrate an Amazon EC2 AMI Snapshot. This rule lets you monitor the ModifyImageAttribute CloudTrail API calls to detect when an Amazon EC2 AMI snapshot is made public or shared with an AWS account. This rule also inspects: @requestParameters.launchPermission.add.items.group array to determine if the string all is contained. This is the indicator which means the RDS snapshot is made public. @requestParameters.launchPermission.add.items.userId array to determine if the string * is contained. This is the indicator which means the RDS snapshot was shared with a new or unknown AWS account.
diff --git a/txt2detection/ai_extractor/__init__.py b/txt2detection/ai_extractor/__init__.py
@@ -6,8 +6,11 @@
 
 from .base import BaseAIExtractor
 
-for path in ["openai", "anthropic", "gemini"]:
+class ModelError(Exception):
+    pass
+
+for path in ["openai", "anthropic", "gemini", "deepseek", "openrouter"]:
     try:
         __import__(__package__ + "." + path)
     except Exception as e:
-        logging.warning("%s not installed", path, exc_info=True)
+        logging.warning("%s not supported, please install missing modules", path, exc_info=True)
diff --git a/txt2detection/ai_extractor/deepseek.py b/txt2detection/ai_extractor/deepseek.py
@@ -0,0 +1,19 @@
+import logging
+import os
+
+from .base import BaseAIExtractor
+from llama_index.llms.deepseek import DeepSeek
+
+class DeepseekExtractor(BaseAIExtractor, provider='deepseek'):
+    def __init__(self, **kwargs) -> None:
+        kwargs.setdefault('temperature', float(os.environ.get('TEMPERATURE', 0.0)))
+        kwargs.setdefault('model', 'deepseek-chat')
+        self.llm = DeepSeek(system_prompt=self.system_prompt, **kwargs)
+        super().__init__()
+
+    def count_tokens(self, text):
+        try:
+            return len(self.llm._tokenizer.encode(text))
+        except Exception as e:
+            logging.warning(e)
+            return super().count_tokens(text)
diff --git a/txt2detection/ai_extractor/openai.py b/txt2detection/ai_extractor/openai.py
@@ -1,4 +1,5 @@
 
+import logging
 import os
 from .base import BaseAIExtractor
 from llama_index.llms.openai import OpenAI
@@ -11,5 +12,9 @@ def __init__(self, **kwargs) -> None:
         super().__init__()
 
     def count_tokens(self, text):
-        return len(self.llm._tokenizer.encode(text))
+        try:
+            return len(self.llm._tokenizer.encode(text))
+        except Exception as e:
+            logging.warning(e)
+            return super().count_tokens(text)
 
diff --git a/txt2detection/ai_extractor/openrouter.py b/txt2detection/ai_extractor/openrouter.py
@@ -0,0 +1,20 @@
+
+import logging
+import os
+from .base import BaseAIExtractor
+from llama_index.llms.openrouter import OpenRouter
+
+
+class OpenRouterExtractor(BaseAIExtractor, provider="openrouter"):
+    def __init__(self, **kwargs) -> None:
+        kwargs.setdefault('temperature', float(os.environ.get('TEMPERATURE', 0.0)))
+        self.llm = OpenRouter(system_prompt=self.system_prompt, **kwargs)
+        super().__init__()
+
+    def count_tokens(self, text):
+        try:
+            return len(self.llm._tokenizer.encode(text))
+        except Exception as e:
+            logging.warning(e)
+            return super().count_tokens(text)
+
diff --git a/txt2detection/utils.py b/txt2detection/utils.py
@@ -1,7 +1,7 @@
 from pathlib import Path
 from types import SimpleNamespace
 import yaml
-from .ai_extractor import ALL_AI_EXTRACTORS, BaseAIExtractor
+from .ai_extractor import ALL_AI_EXTRACTORS, BaseAIExtractor, ModelError
 from importlib import resources
 import txt2detection
 import logging
@@ -22,11 +22,14 @@ def parse_model(value: str):
     splits = value.split(':', 1)
     provider = splits[0]
     if provider not in ALL_AI_EXTRACTORS:
-        raise NotImplementedError(f"invalid AI provider in `{value}`, must be one of [{list(ALL_AI_EXTRACTORS)}]")
+        raise NotImplementedError(f"invalid AI provider in `{value}`, must be one of {list(ALL_AI_EXTRACTORS)}")
     provider = ALL_AI_EXTRACTORS[provider]
-    if len(splits) == 2:
-        return provider(model=splits[1])
-    return provider()
+    try:
+        if len(splits) == 2:
+            return provider(model=splits[1])
+        return provider()
+    except Exception as e:
+        raise ModelError(f"Unable to initialize model `{value}`") from e
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Detect a user attempting to exfiltrate an Amazon EC2 AMI Snapshot. This rule lets you monitor the ModifyImageAttribute CloudTrail API calls to detect when an Amazon EC2 AMI snapshot is made public or shared with an AWS account. This rule also inspects: @requestParameters.launchPermission.add.items.group array to determine if the string all is contained. This is the indicator which means the RDS snapshot is made public. @requestParameters.launchPermission.add.items.userId array to determine if the string * is contained. This is the indicator which means the RDS snapshot was shared with a new or unknown AWS account.