Skip to content

Commit

Permalink
add deepseek and openrouter (#24)
Browse files Browse the repository at this point in the history
* add deepseek and openrouter #22

* updating docs

* fix dependency version conflict

* Update requirements.txt

---------

Co-authored-by: David G <[email protected]>
  • Loading branch information
fqrious and himynamesdave authored Feb 5, 2025
1 parent 3e7a1e2 commit 78c57a8
Show file tree
Hide file tree
Showing 11 changed files with 112 additions and 22 deletions.
2 changes: 2 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ INPUT_TOKEN_LIMIT=
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GOOGLE_API_KEY=
OPENROUTER_API_KEY=
DEEPSEEK_API_KEY=
TEMPERATURE=
## CTIBUTLER
CTIBUTLER_BASE_URL=
Expand Down
15 changes: 12 additions & 3 deletions .env.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,21 @@ However, if you just want to experiment, set the following values
* (REQUIRED IF USING AI MODES) Ensure the input/output token count meets requirements and is supported by the model selected. Will not allow files with more than tokens specified to be processed
* `TEMPERATURE`: `0.0`
* The temperature value ranges from 0 to 2, with lower values indicating greater determinism and higher values indicating more randomness in responses.

**A small note on selecting a provider**

Below are the models you can use. We highly recommend using [OpenRouter](https://openrouter.ai/) (`OPENROUTER_API_KEY`) which gives you access to a wide range of models / providers. There is no benefit in function to using one provider over another (the txt2detection logic is the same for all). We provide the option to use provider supplied API keys (e.g. `OPENAI_API_KEY`) for those who cannot / do not want to use OpenRouter.

* `OPENROUTER_API_KEY`=
* (REQUIRED IF USING MODELS PROVIDED BY OPENROUTER IN AI MODES) get it from: https://openrouter.ai/settings/keys
* `DEEPSEEK_API_KEY`=
* (REQUIRED IF USING DEEPSEEK MODELS DIRECTLY IN AI MODES) get it from: https://platform.deepseek.com/api-key
* `OPENAI_API_KEY`: YOUR_API_KEY
* (REQUIRED IF USING OPENAI MODELS IN AI MODES) get it from https://platform.openai.com/api-keys
* (REQUIRED IF USING OPENAI MODELS DIRECTLY IN AI MODES) get it from: https://platform.openai.com/api-keys
* `ANTHROPIC_API_KEY`: YOUR_API_KEY
* (REQUIRED IF USING ANTHROPIC MODELS IN AI MODES) get it from https://console.anthropic.com/settings/keys
* (REQUIRED IF USING ANTHROPIC MODELS DIRECTLY IN AI MODES) get it from" https://console.anthropic.com/settings/keys
* `GOOGLE_API_KEY`:
* (REQUIRED IF USING GOOGLE GEMINI MODELS IN AI MODES) get it from the Google Cloud Platform (making sure the Gemini API is enabled for the project)
* (REQUIRED IF USING GOOGLE GEMINI MODELS DIRECTLY IN AI MODES) get it from the Google Cloud Platform (making sure the Gemini API is enabled for the project)

## CTIBUTLER

Expand Down
24 changes: 19 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,25 +82,39 @@ python3 txt2detection.py \
* `--external_refs` (optional): txt2detection will automatically populate the `external_references` of the report object it creates for the input. You can use this value to add additional objects to `external_references`. Note, you can only add `source_name` and `external_id` values currently. Pass as `source_name=external_id`. e.g. `--external_refs txt2stix=demo1 source=id` would create the following objects under the `external_references` property: `{"source_name":"txt2stix","external_id":"demo1"},{"source_name":"source","external_id":"id"}`
* `--detection_language_key` (required): the detection rule language you want the output to be in. You can find a list of detection language keys in `config/detection_languages.yaml`
* `ai_provider` (required): defines the `provider:model` to be used. Select one option. Currently supports:
* Provider: `openai:`, models e.g.: `gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo`, `gpt-4` ([More here](https://platform.openai.com/docs/models))
* Provider: `anthropic:`, models e.g.: `claude-3-5-sonnet-latest`, `claude-3-5-haiku-latest`, `claude-3-opus-latest` ([More here](https://docs.anthropic.com/en/docs/about-claude/models))
* Provider: `gemini:models/`, models: `gemini-1.5-pro-latest`, `gemini-1.5-flash-latest` ([More here](https://ai.google.dev/gemini-api/docs/models/gemini))
* Provider (env var required `OPENROUTER_API_KEY`): `openrouter:`, providers/models `openai/gpt-4o`, `deepseek/deepseek-chat` ([More here](https://openrouter.ai/models))
* Provider (env var required `OPENAI_API_KEY`): `openai:`, models e.g.: `gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo`, `gpt-4` ([More here](https://platform.openai.com/docs/models))
* Provider (env var required `ANTHROPIC_API_KEY`): `anthropic:`, models e.g.: `claude-3-5-sonnet-latest`, `claude-3-5-haiku-latest`, `claude-3-opus-latest` ([More here](https://docs.anthropic.com/en/docs/about-claude/models))
* Provider (env var required `GOOGLE_API_KEY`): `gemini:models/`, models: `gemini-1.5-pro-latest`, `gemini-1.5-flash-latest` ([More here](https://ai.google.dev/gemini-api/docs/models/gemini))
* Provider (env var required `DEEPSEEK_API_KEY`): `deepseek:`, models `deepseek-chat` ([More here](https://api-docs.deepseek.com/quick_start/pricing))

e.g.

```shell
python3 txt2detection.py \
--input_file tests/files/CVE-2024-56520.txt \
--name "lynx ransomware" \
--name "CVE-2024-56520" \
--tlp_level green \
--labels label1,label2 \
--external_refs txt2stix=demo1 source=id \
--detection_language spl \
--ai_provider openai:gpt-4o \
--ai_provider openrouter:openai/gpt-4o \
--report_id a70c4ca8-77d5-4c6f-96fb-9726ec89d242 \
--use_identity '{"type":"identity","spec_version":"2.1","id":"identity--8ef05850-cb0d-51f7-80be-50e4376dbe63","created_by_ref":"identity--9779a2db-f98c-5f4b-8d08-8ee04e02dbb5","created":"2020-01-01T00:00:00.000Z","modified":"2020-01-01T00:00:00.000Z","name":"siemrules","description":"https://github.com/muchdogesec/siemrules","identity_class":"system","sectors":["technology"],"contact_information":"https://www.dogesec.com/contact/","object_marking_refs":["marking-definition--94868c89-83c2-464b-929b-a1a8aa3c8487","marking-definition--97ba4e8b-04f6-57e8-8f6e-3a0f0a7dc0fb"]}'
```

e.g.

```shell
python3 txt2detection.py \
--input_file tests/files/CVE-2024-56520.txt \
--name "CVE-2024-56520" \
--tlp_level green \
--detection_language sigma \
--ai_provider openrouter:openai/gpt-4o \
--report_id b02df393-995d-421e-b66c-721000e058d2
```

## Adding new detection languages

Adding a new detection language is fairly trivial. However, there is a implicit understanding the model understands the detection rule structure. Results can therefore be mixed, so it is worth testing in detail.
Expand Down
11 changes: 11 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,17 @@ Issues = "https://github.com/muchdogesec/txt2detection/issues"
[project.scripts]
txt2detection = "txt2detection.__main__:main"

[project.optional-dependencies]
llms = [
"llama-index-core==0.12.7",
"llama-index-llms-anthropic==0.6.3",
"llama-index-llms-gemini==0.4.2",
"llama-index-llms-openai==0.3.11",
"llama-index-llms-openai-like==0.3.3",
"llama-index-llms-deepseek==0.1.1",
"llama-index-llms-openrouter==0.3.1",
]


[tool.hatch.build.targets.wheel.force-include]
"config" = "txt2detection/config"
15 changes: 9 additions & 6 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ aiohappyeyeballs==2.4.3; python_version >= '3.8'
aiohttp==3.11.0; python_version >= '3.9'
aiosignal==1.3.1; python_version >= '3.7'
annotated-types==0.7.0; python_version >= '3.8'
anthropic[bedrock,vertex]==0.39.0; python_version >= '3.8'
anthropic[bedrock,vertex]>=0.39.0; python_version >= '3.8'
antlr4-python3-runtime==4.9.3
anyio==4.6.2.post1; python_version >= '3.9'
attrs==24.2.0; python_version >= '3.7'
Expand Down Expand Up @@ -38,18 +38,21 @@ idna==3.10; python_version >= '3.6'
jiter==0.7.1; python_version >= '3.8'
jmespath==1.0.1; python_version >= '3.7'
joblib==1.4.2; python_version >= '3.8'
llama-index-core==0.11.23; python_full_version >= '3.8.1' and python_version < '4.0'
llama-index-llms-anthropic==0.4.1; python_full_version >= '3.8.1' and python_version < '4.0'
llama-index-llms-gemini==0.3.7; python_version >= '3.9' and python_version < '4.0'
llama-index-llms-openai==0.2.16; python_full_version >= '3.8.1' and python_version < '4.0'
llama-index-core==0.12.7; python_full_version >= '3.8.1' and python_version < '4.0'
llama-index-llms-anthropic==0.6.3; python_full_version >= '3.8.1' and python_version < '4.0'
llama-index-llms-deepseek==0.1.1; python_version >= '3.9' and python_version < '4.0'
llama-index-llms-gemini==0.4.2; python_full_version >= '3.8.1' and python_version < '4.0'
llama-index-llms-openai==0.3.11; python_full_version >= '3.8.1' and python_version < '4.0'
llama-index-llms-openai-like==0.3.3; python_full_version >= '3.8.1' and python_version < '4.0'
llama-index-llms-openrouter==0.3.1; python_full_version >= '3.8.1' and python_version < '4.0'
marshmallow==3.23.1; python_version >= '3.9'
multidict==6.1.0; python_version >= '3.8'
mypy-extensions==1.0.0; python_version >= '3.5'
nest-asyncio==1.6.0; python_version >= '3.5'
networkx==3.4.2; python_version >= '3.10'
nltk==3.9.1; python_version >= '3.8'
numpy==1.26.4; python_version >= '3.9'
openai==1.54.4; python_version >= '3.8'
openai==1.58.1; python_version >= '3.8'
packaging==24.2; python_version >= '3.8'
pillow==10.4.0; python_version >= '3.8'
propcache==0.2.0; python_version >= '3.8'
Expand Down
1 change: 1 addition & 0 deletions tests/files/EC2-exfil.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Detect a user attempting to exfiltrate an Amazon EC2 AMI Snapshot. This rule lets you monitor the ModifyImageAttribute CloudTrail API calls to detect when an Amazon EC2 AMI snapshot is made public or shared with an AWS account. This rule also inspects: @requestParameters.launchPermission.add.items.group array to determine if the string all is contained. This is the indicator which means the RDS snapshot is made public. @requestParameters.launchPermission.add.items.userId array to determine if the string * is contained. This is the indicator which means the RDS snapshot was shared with a new or unknown AWS account.
7 changes: 5 additions & 2 deletions txt2detection/ai_extractor/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,11 @@

from .base import BaseAIExtractor

for path in ["openai", "anthropic", "gemini"]:
class ModelError(Exception):
pass

for path in ["openai", "anthropic", "gemini", "deepseek", "openrouter"]:
try:
__import__(__package__ + "." + path)
except Exception as e:
logging.warning("%s not installed", path, exc_info=True)
logging.warning("%s not supported, please install missing modules", path, exc_info=True)
19 changes: 19 additions & 0 deletions txt2detection/ai_extractor/deepseek.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import logging
import os

from .base import BaseAIExtractor
from llama_index.llms.deepseek import DeepSeek

class DeepseekExtractor(BaseAIExtractor, provider='deepseek'):
def __init__(self, **kwargs) -> None:
kwargs.setdefault('temperature', float(os.environ.get('TEMPERATURE', 0.0)))
kwargs.setdefault('model', 'deepseek-chat')
self.llm = DeepSeek(system_prompt=self.system_prompt, **kwargs)
super().__init__()

def count_tokens(self, text):
try:
return len(self.llm._tokenizer.encode(text))
except Exception as e:
logging.warning(e)
return super().count_tokens(text)
7 changes: 6 additions & 1 deletion txt2detection/ai_extractor/openai.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@

import logging
import os
from .base import BaseAIExtractor
from llama_index.llms.openai import OpenAI
Expand All @@ -11,5 +12,9 @@ def __init__(self, **kwargs) -> None:
super().__init__()

def count_tokens(self, text):
return len(self.llm._tokenizer.encode(text))
try:
return len(self.llm._tokenizer.encode(text))
except Exception as e:
logging.warning(e)
return super().count_tokens(text)

20 changes: 20 additions & 0 deletions txt2detection/ai_extractor/openrouter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@

import logging
import os
from .base import BaseAIExtractor
from llama_index.llms.openrouter import OpenRouter


class OpenRouterExtractor(BaseAIExtractor, provider="openrouter"):
def __init__(self, **kwargs) -> None:
kwargs.setdefault('temperature', float(os.environ.get('TEMPERATURE', 0.0)))
self.llm = OpenRouter(system_prompt=self.system_prompt, **kwargs)
super().__init__()

def count_tokens(self, text):
try:
return len(self.llm._tokenizer.encode(text))
except Exception as e:
logging.warning(e)
return super().count_tokens(text)

13 changes: 8 additions & 5 deletions txt2detection/utils.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from pathlib import Path
from types import SimpleNamespace
import yaml
from .ai_extractor import ALL_AI_EXTRACTORS, BaseAIExtractor
from .ai_extractor import ALL_AI_EXTRACTORS, BaseAIExtractor, ModelError
from importlib import resources
import txt2detection
import logging
Expand All @@ -22,11 +22,14 @@ def parse_model(value: str):
splits = value.split(':', 1)
provider = splits[0]
if provider not in ALL_AI_EXTRACTORS:
raise NotImplementedError(f"invalid AI provider in `{value}`, must be one of [{list(ALL_AI_EXTRACTORS)}]")
raise NotImplementedError(f"invalid AI provider in `{value}`, must be one of {list(ALL_AI_EXTRACTORS)}")
provider = ALL_AI_EXTRACTORS[provider]
if len(splits) == 2:
return provider(model=splits[1])
return provider()
try:
if len(splits) == 2:
return provider(model=splits[1])
return provider()
except Exception as e:
raise ModelError(f"Unable to initialize model `{value}`") from e



Expand Down

0 comments on commit 78c57a8

Please sign in to comment.