-
Notifications
You must be signed in to change notification settings - Fork 1
USEC-3 [FIX] Support url safety checking for adapters #200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary by CodeRabbit
WalkthroughAdds a URLValidator module (DNS + private-range blocking with ENV whitelist), extends Adapter base with get_configured_urls() and _validate_urls(), updates many adapters to expose configured URLs and optionally validate them during init, adds URL checks in x2text helper, adds tests, and bumps Milvus schema default and SDK version. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant Adapter as Adapter Subclass
participant Base as Adapter Base
participant Validator as URLValidator
Note over Adapter: __init__(config, validate_urls=False)
Client->>Adapter: instantiate(config, validate_urls=True)
Adapter->>Adapter: set self.config
Adapter->>Base: _validate_urls()
Base->>Adapter: get_configured_urls()
loop For each non-empty URL
Base->>Validator: validate_url(url)
alt Invalid URL
Validator-->>Base: (False, error_message)
Base->>Base: log error with adapter name
Base-->>Client: raise AdapterError
else Valid URL
Validator-->>Base: (True, "")
end
end
Adapter-->>Client: instance ready
Note right of Validator: DNS → IP check → private-range block → optional ENV whitelist allow
sequenceDiagram
autonumber
actor Client
participant Adapter as Adapter Subclass
participant Base as Adapter Base
participant Validator as URLValidator
Client->>Adapter: test_connection()
Adapter->>Base: _validate_urls() (before network ops)
Base->>Validator: validate_url(url)
alt Any invalid
Validator-->>Base: (False, msg)
Base-->>Client: raise AdapterError
else All valid
Adapter->>Adapter: proceed with adapter-specific checks
Adapter-->>Client: connection result
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/unstract/sdk/adapters/llm/any_scale/src/anyscale.py (1)
56-73
: Avoid passing "None" strings for optional params.str(None) becomes "None" and may break the client or logs; pass None instead for optional fields.
Apply this minimal fix:
llm: LLM = Anyscale( model=str(self.config.get(Constants.MODEL)), api_key=str(self.config.get(Constants.API_KEY)), - api_base=str(self.config.get(Constants.API_BASE)), - additional_kwargs=self.config.get(Constants.ADDITIONAL_KWARGS), + api_base=self.config.get(Constants.API_BASE) or None, + additional_kwargs=self.config.get(Constants.ADDITIONAL_KWARGS) or None, max_retries=int( self.config.get(Constants.MAX_RETRIES, LLMKeys.DEFAULT_MAX_RETRIES) ),
🧹 Nitpick comments (28)
src/unstract/sdk/adapters/vectordb/milvus/src/static/json_schema.json (1)
19-21
: Default points to a blocked endpoint; add UX hint in description.With SSRF protections,
http://localhost:19530
will be blocked unless whitelisted. Consider clarifying this in the field description to reduce setup friction.- "description": "Provide the URI of the Milvus server. Example: `https://<instance-id>.api.gcp-us-west1.zillizcloud.com`" + "description": "Provide the URI of the Milvus server. Example: `https://<instance-id>.api.gcp-us-west1.zillizcloud.com`. Note: localhost/private IPs are blocked by default for security. For local development, ask your admin to allowlist the endpoint via ALLOWED_ADAPTER_PRIVATE_ENDPOINTS."src/unstract/sdk/adapters/url_validator.py (5)
93-103
: Precompute blocked networks and annotate as ClassVars.Avoid reparsing CIDRs on every call and address Ruff RUF012.
- BLOCKED_PRIVATE_RANGES = [ + BLOCKED_PRIVATE_RANGES: ClassVar[list[str]] = [ @@ ] + + # Precompute networks once + BLOCKED_PRIVATE_NETWORKS: ClassVar[tuple[ipaddress._BaseNetwork, ...]] = tuple( + ipaddress.ip_network(cidr) for cidr in BLOCKED_PRIVATE_RANGES + ) @@ def _is_private_ip(cls, ip: ipaddress.IPv4Address | ipaddress.IPv6Address) -> bool: """Check if IP address is in private ranges.""" - for range_str in cls.BLOCKED_PRIVATE_RANGES: - try: - network = ipaddress.ip_network(range_str) - if ip in network: - return True - except ValueError: - continue + for network in cls.BLOCKED_PRIVATE_NETWORKS: + if ip in network: + return True return False
88-91
: Tighten exception handling and use logging.exception.Meet TRY400/BLE001/RUF010 hints; keep fail-closed behavior.
- except Exception as e: - logger.error(f"URL validation error for '{url}': {str(e)}") - return False, f"{str(e)}" + except socket.gaierror as e: + logger.exception("URL validation DNS error for '%s': %s", url, e) + return False, f"{e!s}" + except Exception as e: # pragma: no cover + logger.exception("URL validation error for '%s': %s", url, e) + return False, f"{e!s}"
120-139
: Whitelist parsing: avoid blanket Exception and improve messages.Narrow the except and keep logs actionable.
- for item in config.split(","): + for item in config.split(","): @@ - try: - entry = cls._parse_whitelist_entry(item) - if entry: - entries.append(entry) - except Exception as e: - logger.warning(f"Invalid whitelist entry '{item}': {str(e)}") + entry = cls._parse_whitelist_entry(item) + if entry: + entries.append(entry)
140-172
: IPv6 allowlist entries: document bracketless format for ports.
rsplit(":", 1)
supportsfe80::1:8080
but[fe80::1]:8080
will be rejected. Consider documenting accepted format or adding support for brackets.Would you like me to add bracket support in
_parse_whitelist_entry
?
29-42
: Consider extending the blocked ranges set.Optional: add CGNAT and benchmarking ranges often blocked for SSRF.
"169.254.0.0/16", # Link-local + "100.64.0.0/10", # Carrier-Grade NAT + "198.18.0.0/15", # Benchmarkingtests/test_url_validator.py (3)
156-163
: Prefer socket.gaierror to exercise the intended error path.This ensures the DNS failure branch (not the generic Exception) is tested.
- mock_gethostbyname.side_effect = Exception("DNS resolution failed") + import socket as _socket + mock_gethostbyname.side_effect = _socket.gaierror("DNS resolution failed")
164-177
: Add tests for non-HTTP schemes and default-port allowlisting.Covers scheme enforcement and implicit 80/443 behavior.
+ @patch("socket.gethostbyname", return_value="93.184.216.34") + def test_non_http_schemes_blocked(self, _): + is_valid, err = URLValidator.validate_url("ftp://example.com/resource") + self.assertFalse(is_valid) + self.assertIn("Only http and https are allowed", err) + + @patch("socket.gethostbyname", return_value="10.0.0.5") + def test_default_port_allowlist_match(self, _): + os.environ[URLValidator.ENV_VAR] = "10.0.0.5:443" + is_valid, _ = URLValidator.validate_url("https://10.0.0.5") # no explicit port + self.assertTrue(is_valid)
95-111
: Consider adding IPv6 coverage.Validate IPv6 resolution and allowlisting once the validator supports
getaddrinfo
.Would you like me to add IPv6 tests using
@patch("socket.getaddrinfo")
?src/unstract/sdk/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py (1)
48-52
: Minor: trim and guard configured URL.Strip whitespace and ensure empty strings don’t pass through.
def get_configured_urls(self) -> list[str]: """Return all URLs this adapter will connect to.""" - url = self.config.get("url") - return [url] if url else [] + url = self.config.get("url") + if isinstance(url, str): + url = url.strip() + return [url] if url else []src/unstract/sdk/adapters/llm/open_ai/src/open_ai.py (1)
57-61
: Use the constant and trim the value.Avoid string literals for config keys; trim to prevent accidental whitespace issues.
def get_configured_urls(self) -> list[str]: """Return all URLs this adapter will connect to.""" - api_base = self.config.get("api_base") - return [api_base] if api_base else [] + api_base = self.config.get(Constants.API_BASE) + if isinstance(api_base, str): + api_base = api_base.strip() + return [api_base] if api_base else []src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py (1)
39-43
: Minor: trim configured URL.Handles trailing spaces from env/config.
def get_configured_urls(self) -> list[str]: """Return all URLs this adapter will connect to.""" - url = self.config.get("url") - return [url] if url else [] + url = self.config.get("url") + if isinstance(url, str): + url = url.strip() + return [url] if url else []src/unstract/sdk/adapters/llm/ollama/src/ollama.py (1)
56-60
: Minor: trim configured URL.Avoids surprises from whitespace in BASE_URL.
def get_configured_urls(self) -> list[str]: """Return all URLs this adapter will connect to.""" - base_url = self.config.get(Constants.BASE_URL) - return [base_url] if base_url else [] + base_url = self.config.get(Constants.BASE_URL) + if isinstance(base_url, str): + base_url = base_url.strip() + return [base_url] if base_url else []src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (1)
53-57
: Minor: trim configured URL.Guard against whitespace-only values.
def get_configured_urls(self) -> list[str]: """Return all URLs this adapter will connect to.""" - url = self._config.get(Constants.URL) - return [url] if url else [] + url = self._config.get(Constants.URL) + if isinstance(url, str): + url = url.strip() + return [url] if url else []src/unstract/sdk/adapters/x2text/helper.py (1)
105-111
: Tighten error surfacing; align with TRY003.Log the validator’s message once and raise
AdapterError
with that message directly. This reduces noisy prefixes and satisfies Ruff TRY003.- if unstructured_url: - is_valid, error_message = URLValidator.validate_url(unstructured_url) - if not is_valid: - logger.error(f"Unstructured URL validation failed: {error_message}") - raise AdapterError(f"URL validation failed: {error_message}") + if unstructured_url: + is_valid, error_message = URLValidator.validate_url(unstructured_url) + if not is_valid: + logger.error(f"URL validation failed: {error_message}") + raise AdapterError(error_message)src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py (1)
52-56
: Minor: trim configured URL.Consistent with other adapters.
def get_configured_urls(self) -> list[str]: """Return all URLs this adapter will connect to.""" - url = self._config.get(Constants.URL) - return [url] if url else [] + url = self._config.get(Constants.URL) + if isinstance(url, str): + url = url.strip() + return [url] if url else []src/unstract/sdk/adapters/llm/any_scale/src/anyscale.py (1)
51-55
: Validate implicit defaults too (if the SDK provides one).If the Anyscale client uses a built-in api_base when not configured, include that same default here so validation can’t be bypassed by omission.
Would you confirm whether Anyscale has an implicit api_base? If yes, mirror it:
def get_configured_urls(self) -> list[str]: - api_base = self.config.get(Constants.API_BASE) - return [api_base] if api_base else [] + api_base = self.config.get(Constants.API_BASE) + # If the client falls back internally, include that fallback here as well. + return [api_base] if api_base else [Constants.API_BASE_DEFAULT] # if you introduce such a constantsrc/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py (1)
54-58
: Use the defined constant for the key.Minor consistency/readability tweak.
- endpoint = self.config.get("azure_endpoint") + endpoint = self.config.get(Constants.AZURE_ENDPOINT)src/unstract/sdk/adapters/embedding/open_ai/src/open_ai.py (2)
53-57
: Also validate the default base URL when api_base is not set.Prevents skipping validation via omission.
- api_base = self.config.get("api_base") + api_base = self.config.get(Constants.API_BASE_KEY, Constants.API_BASE_VALUE) return [api_base] if api_base else []
58-63
: Align httpx timeout with the adapter timeout setting.Currently the httpx timeout is hardcoded to 10s while the adapter timeout uses config. Aligning avoids surprises.
- httpx_timeout = httpx.Timeout(10.0, connect=60.0) + httpx_timeout = httpx.Timeout(timeout, connect=60.0)src/unstract/sdk/adapters/x2text/unstructured_enterprise/src/unstructured_enterprise.py (2)
14-17
: Consider validating in init for consistency with other adapters.Process/test helpers may validate, but failing fast in init keeps behavior uniform across adapters.
class UnstructuredEnterprise(X2TextAdapter): def __init__(self, settings: dict[str, Any]): super().__init__("UnstructuredIOEnterprise") self.config = settings + # Validate URLs BEFORE any network operations + self._validate_urls()
54-56
: Run centralized validation in test_connection too.Keeps the contract consistent with other adapters’ test_connection paths.
def test_connection(self) -> bool: - result: bool = UnstructuredHelper.test_server_connection(self.config) + super().test_connection() + result: bool = UnstructuredHelper.test_server_connection(self.config) return resultsrc/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (1)
208-210
: Validate before probe: good; add request timeouts for reliability.Super call ensures validation; add explicit timeouts to outbound requests to avoid hangs.
You can set a single default timeout (e.g., 30s) on both GET and POST:
# In _make_request(), add a timeout to requests response = requests.get(url=llm_whisperer_svc_url, headers=headers, params=params, timeout=30) # ... response = requests.post(url=llm_whisperer_svc_url, headers=headers, params=params, data=data, timeout=30)src/unstract/sdk/adapters/embedding/ollama/src/ollama.py (1)
47-51
: Use the constant for the key.Keeps things DRY and reduces typo risk.
- base_url = self.config.get("base_url") + base_url = self.config.get(Constants.BASE_URL)src/unstract/sdk/adapters/llm/azure_open_ai/src/azure_open_ai.py (2)
57-61
: Unify key usage and fix constant name typo (backward-compatible).
- Use the Constants key instead of a hardcoded string.
- The constant is misspelled (AZURE_ENDPONT). Add a correctly named alias and switch call sites.
Apply within the selected ranges:
- endpoint = self.config.get("azure_endpoint") + endpoint = self.config.get(Constants.AZURE_ENDPOINT)- "azure_endpoint": str(self.config.get(Constants.AZURE_ENDPONT)), + "azure_endpoint": str(self.config.get(Constants.AZURE_ENDPOINT)),Add this alias near the Constants block (outside the selected range):
# Prefer this going forward AZURE_ENDPOINT = "azure_endpoint" # Back-compat alias; mark for removal in a future major/minor bump AZURE_ENDPONT = AZURE_ENDPOINTAlso applies to: 76-76
84-84
: Gatereasoning_effort
to O-series models only.Avoid passing
reasoning_*
kwargs unless the selected model supports them.- if enable_reasoning: - llm_kwargs["reasoning_effort"] = self.config.get(Constants.REASONING_EFFORT) + if enable_reasoning and model in O1_MODELS: + llm_kwargs["reasoning_effort"] = self.config.get(Constants.REASONING_EFFORT)src/unstract/sdk/adapters/base.py (2)
48-59
: Clear contract for adapter URLs.Solid default and docstring. Consider reminding implementers to include all outbound URLs (including health/test endpoints) to ensure full coverage.
60-76
: Tighten exception message (ruff TRY003) and keep context in logs.Ruff flags the constructed message. Since you already log adapter/url context, raise with the validator’s message directly.
- raise AdapterError(f"URL validation failed: {error_message}") + raise AdapterError(error_message)Optional: add structured logging context.
- logger.error( - f"URL validation failed for adapter '{adapter_name}': {error_message}" - ) + logger.error( + "URL validation failed", + extra={"adapter": adapter_name, "url": url, "reason": error_message}, + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge Base: Disabled due to Reviews > Disable Knowledge Base setting
📒 Files selected for processing (21)
src/unstract/sdk/adapters/base.py
(3 hunks)src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py
(2 hunks)src/unstract/sdk/adapters/embedding/ollama/src/ollama.py
(2 hunks)src/unstract/sdk/adapters/embedding/open_ai/src/open_ai.py
(2 hunks)src/unstract/sdk/adapters/llm/any_scale/src/anyscale.py
(2 hunks)src/unstract/sdk/adapters/llm/azure_open_ai/src/azure_open_ai.py
(3 hunks)src/unstract/sdk/adapters/llm/ollama/src/ollama.py
(3 hunks)src/unstract/sdk/adapters/llm/open_ai/src/open_ai.py
(2 hunks)src/unstract/sdk/adapters/url_validator.py
(1 hunks)src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py
(3 hunks)src/unstract/sdk/adapters/vectordb/milvus/src/static/json_schema.json
(1 hunks)src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py
(3 hunks)src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py
(3 hunks)src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py
(2 hunks)src/unstract/sdk/adapters/x2text/helper.py
(2 hunks)src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py
(2 hunks)src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py
(2 hunks)src/unstract/sdk/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py
(2 hunks)src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py
(2 hunks)src/unstract/sdk/adapters/x2text/unstructured_enterprise/src/unstructured_enterprise.py
(1 hunks)tests/test_url_validator.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
src/unstract/sdk/adapters/x2text/helper.py
110-110: Avoid specifying long messages outside the exception class
(TRY003)
src/unstract/sdk/adapters/base.py
75-75: Avoid specifying long messages outside the exception class
(TRY003)
src/unstract/sdk/adapters/url_validator.py
29-42: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
66-66: Use explicit conversion flag
Replace with conversion flag
(RUF010)
86-86: Consider moving this statement to an else
block
(TRY300)
88-88: Do not catch blind exception: Exception
(BLE001)
89-89: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
89-89: Use explicit conversion flag
Replace with conversion flag
(RUF010)
90-90: Use explicit conversion flag
Replace with conversion flag
(RUF010)
136-136: Do not catch blind exception: Exception
(BLE001)
137-137: Use explicit conversion flag
Replace with conversion flag
(RUF010)
171-171: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🔇 Additional comments (24)
src/unstract/sdk/adapters/url_validator.py (1)
111-117
: Allowlist semantics: portless entries vs explicit default ports.With the effective-port change, allowlisting
10.0.0.1:443
will now matchhttps://10.0.0.1
. Confirm this aligns with your intended policy.src/unstract/sdk/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py (1)
27-29
: Good: pre-flight URL validation added.Validating before any network calls reduces SSRF risk and fail-fast behavior is correct.
src/unstract/sdk/adapters/llm/open_ai/src/open_ai.py (1)
32-34
: Good: early URL validation.This aligns with the repo-wide safety-net.
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1)
22-24
: Good: pre-flight URL validation.Timing is correct — before any network interaction.
src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py (1)
18-20
: Good: validate URLs before any IO.Consistent with the new security posture.
src/unstract/sdk/adapters/llm/ollama/src/ollama.py (2)
31-33
: Good: early URL validation.Prevents accidental localhost/private hits unless whitelisted.
87-89
: Good: call base test_connection first.Ensures shared validation runs before live calls.
src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (1)
29-31
: Reordersuper().__init__()
before calling_validate_urls()
in WeaviateAdapter
_validate_urls()
may reference base-initialized fields (e.g.self.name
) that aren’t set untilsuper().__init__()
runs. Move the validation call aftersuper().__init__()
or explicitly assign any needed attributes beforehand.src/unstract/sdk/adapters/x2text/helper.py (1)
8-8
: Import looks good.Used below in URL pre-validation.
src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py (2)
83-85
: Good: base validation in test_connection.Ensures policy enforcement before hitting the service.
28-30
: Ignore init-order concern for _validate_urls
validate_urls only calls get_configured_urls (which uses the already-set self._config) and falls back to the class name when self.name is unset, so it doesn’t depend on Adapter.init; no reordering is required.Likely an incorrect or invalid review comment.
src/unstract/sdk/adapters/llm/any_scale/src/anyscale.py (1)
26-28
: Pre-flight URL validation: good placement.Validating in init before any network work matches the SSRF goals.
src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py (1)
29-31
: Pre-flight URL validation: LGTM.Done before constructing httpx client; aligns with the protection goal.
src/unstract/sdk/adapters/embedding/open_ai/src/open_ai.py (1)
28-30
: Pre-flight URL validation: LGTM.Happens before creating the http client.
src/unstract/sdk/adapters/x2text/unstructured_enterprise/src/unstructured_enterprise.py (1)
36-40
: Expose URL via get_configured_urls: LGTM.This enables centralized validation.
src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (1)
58-62
: Expose URL via get_configured_urls: LGTM.Matches the new validation flow.
src/unstract/sdk/adapters/embedding/ollama/src/ollama.py (1)
22-24
: Pre-flight URL validation: LGTM.Early, before any client creation.
src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py (2)
25-27
: Pre-flight URL validation: LGTM.Done before creating the client/store instance.
49-53
: Expose URI for validation: LGTM.Simple and correct.
src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (3)
32-34
: Pre-flight URL validation: LGTM.Placed before building connections.
102-104
: Validation on test_connection path: LGTM.Keeps runtime checks consistent.
118-131
: No action needed: URLValidator doesn’t enforce a schemes allowlist, so postgresql:// URLs pass through unconditionally.src/unstract/sdk/adapters/llm/azure_open_ai/src/azure_open_ai.py (1)
32-34
: Good call validating early.Validating in init before any network activity is the right place.
src/unstract/sdk/adapters/base.py (1)
37-41
: Schema path lookup LGTM.Using getattr with an explicit error for missing SCHEMA_PATH is clear and safe.
# Validate URLs first | ||
super().test_connection() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify validator behavior for local/default URIs and scheme support.
Default schema was changed to http://localhost:19530; this will be blocked unless whitelisted. Confirm tests/documentation cover setting ALLOWED_ADAPTER_PRIVATE_ENDPOINTS for local/dev and that only intended schemes are allowed for Milvus.
Run this to inspect the allowed/prohibited schemes/ranges in the validator:
🏁 Script executed:
#!/bin/bash
rg -nC2 -e 'ALLOWED|SCHEME|SCHEMES|PROTOCOL|BLOCKED|PRIVATE|localhost' src/unstract/sdk/adapters/url_validator.py
Length of output: 882
🏁 Script executed:
#!/bin/bash
rg -nC3 -e 'ALLOWED_SCHEMES' src/unstract/sdk/adapters/url_validator.py || true
rg -nC3 -e 'scheme' src/unstract/sdk/adapters/url_validator.py
rg -nC3 -e 'protocol' src/unstract/sdk/adapters/url_validator.py
rg -nC3 -e 'http' src/unstract/sdk/adapters/url_validator.py
Length of output: 0
🏁 Script executed:
#!/bin/bash
sed -n '1,200p' src/unstract/sdk/adapters/url_validator.py
Length of output: 5661
🏁 Script executed:
#!/bin/bash
sed -n '1,200p' src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py
Length of output: 3122
🏁 Script executed:
#!/bin/bash
sed -n '1,200p' src/unstract/sdk/adapters/vectordb/vectordb_adapter.py
Length of output: 2688
Document/test private whitelist and enforce URL schemes
- URLValidator blocks
127.0.0.0/8
(localhost) by default, sohttp://localhost:19530
will fail without whitelisting . - Add documentation and tests showing how to set
ALLOWED_ADAPTER_PRIVATE_ENDPOINTS
(e.g.127.0.0.1:19530
) for local/dev. - Extend URLValidator to restrict to only supported schemes (e.g.
http
,https
).
🤖 Prompt for AI Agents
In src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py around lines 80 to
82, the call to super().test_connection() uses the global URLValidator which
blocks private IPs by default and does not enforce allowed schemes; update the
URL validation logic to (1) allow explicitly whitelisted private endpoints by
reading ALLOWED_ADAPTER_PRIVATE_ENDPOINTS (e.g. "127.0.0.1:19530") and passing
them to the URLValidator/validator instance so local/dev URLs like
http://localhost:19530 succeed, (2) restrict accepted URL schemes to only
supported ones (http and https) by configuring the URLValidator to validate
scheme membership, and (3) add/update documentation and unit tests that
demonstrate how to set ALLOWED_ADAPTER_PRIVATE_ENDPOINTS and assert that
localhost/private IPs are accepted when whitelisted and rejected otherwise.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Gayathri <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1)
102-103
: Guard against empty results to avoid IndexError.LlamaParse may return an empty list; indexing [0] will crash.
- response_text = documents[0].text + if not documents: + raise AdapterError("LlamaParse returned no documents for the input.") + response_text = documents[0].text
♻️ Duplicate comments (1)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1)
43-49
: Fix SyntaxErrors in get_configured_urls (indent + stray diff marker).The method is indented with 5 spaces inside the class (should be 4), and there’s a literal “- url = …” line left from a diff. Both trigger SyntaxError (confirmed by tooling).
Apply:
- def get_configured_urls(self) -> list[str]: - """Return all URLs this adapter will connect to.""" -- url = self.config.get("url") - base_url = self.config.get(LlamaParseConfig.BASE_URL) - if isinstance(base_url, str): - base_url = base_url.strip() - return [base_url] if base_url else [] + def get_configured_urls(self) -> list[str]: + """Return all URLs this adapter will connect to.""" + base_url = self.config.get(LlamaParseConfig.BASE_URL) + if isinstance(base_url, str): + base_url = base_url.strip() + return [base_url] if base_url else []Also: switching to LlamaParseConfig.BASE_URL correctly aligns validation with the actual client URL.
🧹 Nitpick comments (3)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (3)
91-100
: Tighten error logging and message quality.
- Missing f-string leaves “{exe}” unformatted.
- Prefer structured logging and include exc_info.
- Minor grammar: “llama-parse`s”.
- except ConnectError as connec_err: - logger.error(f"Invalid Base URL given. : {connec_err}") - raise AdapterError( - "Unable to connect to llama-parse`s service, " "please check the Base URL" - ) + except ConnectError as conn_err: + logger.error("Unable to connect to LlamaParse service: %s", conn_err, exc_info=True) + raise AdapterError("Unable to connect to LlamaParse service. Please check the Base URL.") - except Exception as exe: - logger.error( - "Seems like an invalid API Key or possible internal errors: {exe}" - ) - raise AdapterError(str(exe)) + except Exception as exc: + logger.error("Invalid API key or internal error: %s", exc, exc_info=True) + raise AdapterError(str(exc))
73-81
: Avoid mixing encoding with binary modes; keep I/O modes consistent.Passing encoding with "rb"/"wb" is inconsistent and may be ignored or error depending on the FileStorage backend.
- text_content = fs.read( - path=input_file_path_copy, mode="rb", encoding="utf-8" - ) + text_content = fs.read( + path=input_file_path_copy, mode="rb" + ) ... - fs.write( + fs.write( path=input_file_path, - data=text_content, - mode="wb", - encoding="utf-8", + data=text_content, + mode="wb", )
53-54
: Prefer None defaults over instantiated FileStorage in function signatures.Create FileStorage inside the function when fs is None to avoid constructing objects at import time and to ease testing.
Example:
-def _call_parser( - self, - input_file_path: str, - fs: FileStorage = FileStorage(provider=FileStorageProvider.LOCAL), -) -> str: +def _call_parser( + self, + input_file_path: str, + fs: FileStorage | None = None, +) -> str: + fs = fs or FileStorage(provider=FileStorageProvider.LOCAL) ... -def process( +def process( self, input_file_path: str, - output_file_path: str | None = None, - fs: FileStorage = FileStorage(provider=FileStorageProvider.LOCAL), + output_file_path: str | None = None, + fs: FileStorage | None = None, **kwargs: dict[Any, Any], ) -> TextExtractionResult: + fs = fs or FileStorage(provider=FileStorageProvider.LOCAL)Also applies to: 109-110
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge Base: Disabled due to Reviews > Disable Knowledge Base setting
📒 Files selected for processing (1)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py
43-43: SyntaxError: unindent does not match any outer indentation level
45-45: SyntaxError: Invalid assignment target
46-46: SyntaxError: Unexpected indentation
🔇 Additional comments (1)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1)
22-24
: Validating URLs early is the right call.Good placement; this ensures SSRF checks run before any network I/O.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (1)
111-116
: SQL injection risk on identifier interpolation; use psycopg2.sql.Identifier and close cursorDirect f-string interpolation of schema/table can drop arbitrary objects if those values are user-controlled. Use safe identifier quoting and a context manager.
Apply:
- if self._client is not None: - self._client.cursor().execute( - f"DROP TABLE IF EXISTS " - f"{self._schema_name}.data_{self._collection_name} CASCADE" - ) - self._client.commit() + if self._client is not None: + from psycopg2 import sql + with self._client.cursor() as cur: + cur.execute( + sql.SQL("DROP TABLE IF EXISTS {}.{} CASCADE").format( + sql.Identifier(self._schema_name), + sql.Identifier(f"data_{self._collection_name}"), + ) + ) + self._client.commit()
♻️ Duplicate comments (2)
src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py (2)
25-27
: Good: Early URL validation to prevent pre-connect SSRF. Verify no base-init dependency.Calling
self._validate_urls()
before any client creation is the right sequence. Please confirm thatVectorDBAdapter._validate_urls
does not rely on state initialized inVectorDBAdapter.__init__
; if it does, move this call aftersuper().__init__
or decouple the dependency.
49-53
: Enforce http/https schemes at the adapter level (block non-web protocols).Per the PR objective, restrict Milvus URIs to
http
/https
. Add a light-weight scheme check here so invalid schemes are rejected even before the shared validator runs.Apply:
--- a/src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py +++ b/src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py @@ -from typing import Any +from typing import Any +from urllib.parse import urlparse @@ - def get_configured_urls(self) -> list[str]: - """Return all URLs this adapter will connect to.""" - uri = self._config.get(Constants.URI) - return [uri] if uri else [] + def get_configured_urls(self) -> list[str]: + """Return all URLs this adapter will connect to. Enforces http/https.""" + uri = str(self._config.get(Constants.URI, "")).strip() + if not uri: + return [] + scheme = urlparse(uri).scheme.lower() + if scheme not in {"http", "https"}: + raise AdapterError(f"Milvus URI scheme '{scheme}' is not allowed; use http or https.") + return [uri]
🧹 Nitpick comments (7)
src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py (1)
79-88
: Always clean up the test collection (use try/finally).If
test_vector_db_instance
raises, the collection isn’t dropped. Wrap intry/finally
and swallow best-effort cleanup errors.- def test_connection(self) -> bool: - - vector_db = self.get_vector_db_instance() - test_result: bool = VectorDBHelper.test_vector_db_instance( - vector_store=vector_db - ) - # Delete the collection that was created for testing - if self._client is not None: - self._client.drop_collection(self._collection_name) - return test_result + def test_connection(self) -> bool: + vector_db = self.get_vector_db_instance() + test_result: bool = False + try: + test_result = VectorDBHelper.test_vector_db_instance( + vector_store=vector_db + ) + return test_result + finally: + if self._client is not None: + try: + self._client.drop_collection(self._collection_name) + except Exception: + passsrc/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (3)
71-73
: Avoid sending None-valued auth headerIf UNSTRACT_KEY is absent, requests may send a header with a "None" value. Only include it when present.
- return { - "accept": MimeType.JSON, - WhispererHeader.UNSTRACT_KEY: self.config.get( - WhispererConfig.UNSTRACT_KEY - ), - } + headers = {"accept": MimeType.JSON} + unstract_key = self.config.get(WhispererConfig.UNSTRACT_KEY) + if unstract_key: + headers[WhispererHeader.UNSTRACT_KEY] = unstract_key + return headers
407-409
: Metadata pretty-printing: optionalPretty JSON is nice for debugging but increases file size. Consider indent=None (compact) in production paths or making this configurable.
419-421
: Log exceptions with tracebackUse logging.exception inside except blocks to retain stack traces (aligns with Ruff TRY400).
- logger.error( - f"Error while writing metadata to {metadata_file_path}: {e}" - ) + logger.exception( + f"Error while writing metadata to {metadata_file_path}" + )src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py (3)
52-56
: Normalize and type-guard URLs returned for validation.Trim whitespace and guard non-string config to avoid false positives/negatives during validation.
def get_configured_urls(self) -> list[str]: """Return all URLs this adapter will connect to.""" - url = self._config.get(Constants.URL) - return [url] if url else [] + url = self._config.get(Constants.URL) + if isinstance(url, str): + u = url.strip() + return [u] if u else [] + return []
83-83
: Stray blank line.Drop the extra newline to keep method bodies tight; avoids churn in diffs if the repo enforces formatters.
- def test_connection(self) -> bool: - + def test_connection(self) -> bool:
119-121
: Propagate HTTP status code when available.Carry through
e.status_code
if present to improve error handling upstream.- return VectorDBError( - message=str(e), actual_err=e, status_code=status_code - ) + return VectorDBError( + message=str(e), + actual_err=e, + status_code=( + status_code + if status_code is not None + else getattr(e, "status_code", None) + ), + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge Base: Disabled due to Reviews > Disable Knowledge Base setting
📒 Files selected for processing (5)
src/unstract/sdk/adapters/llm/ollama/src/ollama.py
(4 hunks)src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py
(3 hunks)src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py
(4 hunks)src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py
(4 hunks)src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py
(10 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/unstract/sdk/adapters/llm/ollama/src/ollama.py
🧰 Additional context used
🪛 Ruff (0.12.2)
src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py
119-121: Avoid specifying long messages outside the exception class
(TRY003)
419-421: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
🔇 Additional comments (8)
src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (6)
141-144
: Param helper signature: LGTMClear typing and defaults; no issues spotted.
209-212
: Conditional highlight param: LGTMOnly adds the flag when enabled; avoids silently changing behavior.
261-263
: Status parsing default: LGTMSafe default to UNKNOWN; loop correctly handles non-terminal states.
275-277
: Max-poll error message: LGTMMessage is clear and includes attempt count.
360-363
: Pass-through to writer: LGTMArguments are explicit and typed; good separation of concerns.
403-406
: Exclude text from metadata: LGTMPrevents duplicating large payloads in metadata.
src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (2)
106-108
: LGTM — purely cosmetic reflowNo functional change here.
31-34
: OK to validate early — _validate_urls() is safe before super().initPostgres.init sets self._config (used by get_configured_urls()) before calling self._validate_urls(); Adapter._validate_urls() only calls get_configured_urls(), uses getattr(self, "name", self.class.name) as a fallback, and calls the module-level URLValidator, so it does not depend on VectorDBAdapter/Adapter initializer state.
encoded_password = quote_plus( | ||
str(self._config.get(Constants.PASSWORD)) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
❓ Verification inconclusive
Avoid "None" strings for password/port; add timeout; preserve exception context
quote_plus(str(self._config.get(Constants.PASSWORD)))
turns a missing password into the literal "None".str(self._config.get(Constants.PORT))
yields "None" when unset.- Add a connection timeout to prevent hangs.
- Use exception chaining for debuggability.
Apply:
@@
- encoded_password = quote_plus(
- str(self._config.get(Constants.PASSWORD))
- )
+ raw_password = self._config.get(Constants.PASSWORD)
+ encoded_password = (
+ quote_plus(raw_password) if raw_password is not None else None
+ )
+ port = self._config.get(Constants.PORT)
@@
- vector_db: BasePydanticVectorStore = PGVectorStore.from_params(
+ vector_db: BasePydanticVectorStore = PGVectorStore.from_params(
database=self._config.get(Constants.DATABASE),
schema_name=self._schema_name,
host=self._config.get(Constants.HOST),
password=encoded_password,
- port=str(self._config.get(Constants.PORT)),
+ port=port,
user=self._config.get(Constants.USER),
table_name=self._collection_name,
embed_dim=dimension,
)
@@
- self._client = psycopg2.connect(
+ self._client = psycopg2.connect(
database=self._config.get(Constants.DATABASE),
host=self._config.get(Constants.HOST),
user=self._config.get(Constants.USER),
- password=self._config.get(Constants.PASSWORD),
- port=str(self._config.get(Constants.PORT)),
+ password=raw_password,
+ port=port,
sslmode=ssl_mode,
+ connect_timeout=10,
)
@@
- except Exception as e:
- raise AdapterError(str(e))
+ except Exception as e:
+ raise AdapterError(str(e)) from e
If PGVectorStore.from_params
supports SSL or connect args, consider passing sslmode
(and connect_timeout
) there too; please confirm in its docs.
Also applies to: 77-85, 90-97
Avoid "None" strings for password/port; add timeout; preserve exception context
- Stop doing quote_plus(str(...)) — use raw_password = self._config.get(Constants.PASSWORD) and encoded_password = quote_plus(raw_password) if raw_password is not None else None; read port = self._config.get(Constants.PORT).
- Pass raw_password to psycopg2.connect(..., password=raw_password, port=port, sslmode=ssl_mode, connect_timeout=10).
- Preserve exception context: raise AdapterError(str(e)) from e.
- PGVectorStore.from_params accepts create_engine_kwargs — prefer passing SSL/connect options there (e.g. create_engine_kwargs={'connect_args': {'sslmode': ssl_mode, 'connect_timeout': 10}}). (docs.llamaindex.ai)
File: src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (around lines 61-97).
🤖 Prompt for AI Agents
In src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py around lines 61
to 97, stop converting the password to a "None" string via quote_plus(str(...)):
fetch raw_password = self._config.get(Constants.PASSWORD) and set
encoded_password = quote_plus(raw_password) only if raw_password is not None;
likewise read port = self._config.get(Constants.PORT) instead of stringifying
it. Use raw_password (not the quoted string) when calling psycopg2.connect and
pass port, sslmode and a connect_timeout of 10 (e.g. psycopg2.connect(...,
password=raw_password, port=port, sslmode=ssl_mode, connect_timeout=10)).
Preserve exception context when rethrowing by using raise AdapterError(str(e))
from e. When constructing PGVectorStore.from_params, prefer supplying
create_engine_kwargs to pass SSL and connection options (e.g.
create_engine_kwargs={'connect_args': {'sslmode': ssl_mode, 'connect_timeout':
10}}) rather than embedding them into a connection string.
|
||
# Validate URLs BEFORE any network operations | ||
self._validate_urls() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Good: validate endpoints before client init; confirm base-call safety.
Calling _validate_urls()
before constructing QdrantClient
is the right SSRF posture. Please confirm _validate_urls
doesn’t rely on any state set in VectorDBAdapter.__init__
to avoid surprises from invoking it pre-super().__init__
.
Run to inspect the base implementation usage:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate and review the base adapter for _validate_urls dependencies
rg -nP --type=py '(class\s+VectorDBAdapter\b|def\s+_validate_urls\b|def\s+get_configured_urls\b)' -C3
# Inspect URL validator coverage (RFC1918, loopback, link-local, IPv6 ULA/LLA, redirects)
rg -nP --type=py '(class\s+URLValidator\b|RFC1918|127\.0\.0\.0/8|169\.254\.0\.0/16|::1|fc00::/7|fe80::/10|ALLOWED_ADAPTER_PRIVATE_ENDPOINTS|redirect|dns|resolve)' -C2
Length of output: 16772
Fix: don't call _validate_urls() before base init
base._validate_urls() calls self.get_configured_urls(); qdrant.get_configured_urls() uses self._config — calling _validate_urls() at src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py (≈lines 27–30) before the adapter/base initializer runs can raise AttributeError. Move the _validate_urls() call after super().init/after self._config is set, or make get_configured_urls() defensive (e.g. getattr(self, "_config", {}) or return [] if unset). Locations: src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py (call + get_configured_urls), src/unstract/sdk/adapters/base.py (_validate_urls).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/unstract/sdk/adapters/base.py (1)
48-49
: Good call making get_configured_urls abstract.This aligns with earlier feedback and enforces adapter implementations to declare their outbound endpoints.
🧹 Nitpick comments (5)
src/unstract/sdk/adapters/base.py (5)
59-59
: Avoid a concrete default in an abstract method.Returning [] from an abstract method can hide mistakes when a subclass accidentally calls super(); prefer an explicit “not implemented”.
- return [] + raise NotImplementedError
43-47
: Type-safety: abstract get_adapter_type() should not return a string.The return stub "" conflicts with the declared AdapterTypes. Use an explicit “not implemented” to keep type-checkers happy.
@staticmethod @abstractmethod def get_adapter_type() -> AdapterTypes: - return "" + raise NotImplementedError
37-41
: Specify file encoding when reading schema.Prevents platform-dependent decoding issues.
- with open(schema_path) as f: + with open(schema_path, encoding="utf-8") as f: return f.read()
80-84
: Tiny grammar nit.“a adapter” → “an adapter”.
- """Override to test connection for a adapter. + """Override to test connection for an adapter.
61-77
: Tighten validation loop: sanitize, de-duplicate, and raise with context (avoid logging before raising)Strip whitespace, drop empties, preserve order when de-duplicating, precompute adapter_name, and raise AdapterError that includes the adapter name and offending URL; use debug logging only if needed.
Verified _validate_urls() is invoked in init of concrete adapters (examples: src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py:19, src/unstract/sdk/adapters/embedding/open_ai/src/open_ai.py:29, src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py:30).
def _validate_urls(self) -> None: - """Validate all configured URLs against security rules.""" - urls = self.get_configured_urls() - - for url in urls: - if not url: # Skip empty/None URLs - continue - - is_valid, error_message = URLValidator.validate_url(url) - if not is_valid: - # Use class name as fallback when self.name isn't set yet - adapter_name = getattr(self, "name", self.__class__.__name__) - logger.error( - f"URL validation failed for adapter '{adapter_name}': {error_message}" - ) - raise AdapterError(f"URL validation failed: {error_message}") + """Validate all configured URLs against security rules.""" + raw_urls = self.get_configured_urls() + # Normalize and drop empties + urls = [u.strip() for u in raw_urls if u is not None] + # De-duplicate while preserving order + urls = list(dict.fromkeys(u for u in urls if u)) + adapter_name = getattr(self, "name", self.__class__.__name__) + + for url in urls: + is_valid, error_message = URLValidator.validate_url(url) + if not is_valid: + # Consider adding a debug log if desired: + # logger.debug("Blocked URL for %s: %s (%s)", adapter_name, url, error_message) + raise AdapterError(f"{adapter_name}: blocked URL {url}: {error_message}") # noqa: TRY003Optional: if you prefer better UX, collect all invalid URLs and raise one aggregated error.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (1)
src/unstract/sdk/adapters/base.py
(3 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
src/unstract/sdk/adapters/base.py
76-76: Avoid specifying long messages outside the exception class
(TRY003)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (2)
71-74
: Guard against None → "None" leakage; validate required config before connecting.
str(self._config.get(...))
will pass "None" if the key is missing, leading to confusing failures and potentially bypassing your validator’s intent.- self._client = weaviate.connect_to_weaviate_cloud( - cluster_url=str(self._config.get(Constants.URL)), - auth_credentials=Auth.api_key(str(self._config.get(Constants.API_KEY))), - ) + url = self._config.get(Constants.URL) + api_key = self._config.get(Constants.API_KEY) + if not isinstance(url, str) or not url.strip(): + raise AdapterError("Missing or empty 'url' in settings.") + if not isinstance(api_key, str) or not api_key.strip(): + raise AdapterError("Missing or empty 'api_key' in settings.") + self._client = weaviate.connect_to_weaviate_cloud( + cluster_url=url.strip(), + auth_credentials=Auth.api_key(api_key.strip()), + )
86-90
: Use an explicit status-code check (HTTP 422) instead of parsing the exception message.Weaviate returns HTTP 422 for "class already exists" — replace the brittle
if "already exists" in e.message
with a status-code check, e.g.if getattr(e, "status_code", None) == 422: logger.info("Collection already exists; continuing.") else: raise
.
File: src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (lines 86-90)src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py (1)
69-74
: Avoid passing "None" as a string for azure_endpoint. Validate/clean before wiring.- azure_endpoint=str(self.config.get(Constants.AZURE_ENDPOINT)), + azure_endpoint=str(self.config.get(Constants.AZURE_ENDPOINT) or "").strip(),Optionally, fail fast if missing to provide a clearer error:
+ if not self.config.get(Constants.AZURE_ENDPOINT) or not str(self.config.get(Constants.AZURE_ENDPOINT)).strip(): + raise AdapterError("Missing or empty 'azure_endpoint' in settings.")
♻️ Duplicate comments (5)
src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py (1)
20-27
: Heads-up: localhost default requires allowlistWith schema default
http://localhost:19530
, validation will fail unless the private endpoint is allowlisted. Ensure docs/tests show setting the env var for local/dev.Would you like a quick doc snippet added to Milvus README showing
ALLOWED_ADAPTER_PRIVATE_ENDPOINTS=127.0.0.1:19530
usage?src/unstract/sdk/adapters/url_validator.py (2)
60-87
: Resolve A/AAAA and derive effective port; block if any resolved IP is privateIPv4-only
gethostbyname
and missing default ports cause false negatives/positives. Usegetaddrinfo
, compute effective port.- # Resolve hostname to IP address - try: - host_ip = socket.gethostbyname(parsed.hostname) - except socket.gaierror as e: - return ( - False, - f"DNS resolution failed for '{parsed.hostname}': {str(e)}", - ) - - # Check if IP is private - ip_obj = ipaddress.ip_address(host_ip) - if cls._is_private_ip(ip_obj): - # Private IP - check whitelist - port = parsed.port - if cls._is_whitelisted(ip_obj, port): - logger.info(f"Private IP {host_ip}:{port} allowed by whitelist") - return True, "" - else: - error_msg = ( - f"URL blocked: Private IP {host_ip}" - f"{':' + str(port) if port else ''} not in whitelist. " - f"Contact platform admin for assistance." - ) - return False, error_msg - - # Public IP - allowed by default - return True, "" + # Resolve hostname to all IPs (A/AAAA) + try: + addrinfos = socket.getaddrinfo(parsed.hostname, None, proto=socket.IPPROTO_TCP) + except socket.gaierror as e: + return False, f"DNS resolution failed for '{parsed.hostname}': {e!s}" + + resolved_ips = {ai[4][0] for ai in addrinfos} + # Effective port (defaults for HTTP/S if not provided) + effective_port = parsed.port + if effective_port is None: + if scheme == "https": + effective_port = 443 + elif scheme == "http": + effective_port = 80 + + for host_ip in resolved_ips: + ip_obj = ipaddress.ip_address(host_ip) + if cls._is_private_ip(ip_obj): + if cls._is_whitelisted(ip_obj, effective_port): + logger.info("Private IP %s:%s allowed by whitelist", host_ip, effective_port) + continue + return False, ( + f"URL blocked: Private IP {host_ip}" + f"{':' + str(effective_port) if effective_port else ''} not in whitelist. " + "Contact platform admin for assistance." + ) + + # All resolved IPs are public or allowlisted + return True, ""
26-27
: Make ENV var backward-compatible and add ClassVar annotationsCode and docs still reference WHITELISTED_ENDPOINTS — add ALLOWED_ADAPTER_PRIVATE_ENDPOINTS while keeping the legacy name and annotate both with ClassVar (import typing.ClassVar).
File: src/unstract/sdk/adapters/url_validator.py (ENV_VAR definition)
-from dataclasses import dataclass +from dataclasses import dataclass +from typing import ClassVar @@ - ENV_VAR = "WHITELISTED_ENDPOINTS" + # New name per PR objective; keep legacy for compatibility + ENV_VAR: ClassVar[str] = "ALLOWED_ADAPTER_PRIVATE_ENDPOINTS" + LEGACY_ENV_VAR: ClassVar[str] = "WHITELISTED_ENDPOINTS"src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (2)
61-95
: Fix “None” coercions; pass proper types; add timeout; chain exceptionsCurrent code turns missing password/port into literal strings and lacks a connect timeout.
- encoded_password = quote_plus(str(self._config.get(Constants.PASSWORD))) + raw_password = self._config.get(Constants.PASSWORD) + encoded_password = quote_plus(raw_password) if raw_password is not None else None @@ - vector_db: BasePydanticVectorStore = PGVectorStore.from_params( + vector_db: BasePydanticVectorStore = PGVectorStore.from_params( database=self._config.get(Constants.DATABASE), schema_name=self._schema_name, host=self._config.get(Constants.HOST), password=encoded_password, - port=str(self._config.get(Constants.PORT)), + port=self._config.get(Constants.PORT), user=self._config.get(Constants.USER), table_name=self._collection_name, embed_dim=dimension, ) @@ - self._client = psycopg2.connect( + self._client = psycopg2.connect( database=self._config.get(Constants.DATABASE), host=self._config.get(Constants.HOST), user=self._config.get(Constants.USER), - password=self._config.get(Constants.PASSWORD), - port=str(self._config.get(Constants.PORT)), + password=raw_password, + port=self._config.get(Constants.PORT), sslmode=ssl_mode, + connect_timeout=10, ) @@ - except Exception as e: - raise AdapterError(str(e)) + except Exception as e: + raise AdapterError(str(e)) from eIf supported, consider supplying SSL/connect args to
PGVectorStore.from_params
viacreate_engine_kwargs
.Also applies to: 98-99
115-127
: Build parseable URL; bracket IPv6Return a single normalized URL for validation.
- host = self._config.get(Constants.HOST) - port = self._config.get(Constants.PORT) - - if host: - # Construct the database URL for validation - if port: - url = f"postgresql://{host}:{port}" - else: - url = f"postgresql://{host}" - return [url] - return [] + host = self._config.get(Constants.HOST) + port = self._config.get(Constants.PORT) + if not host: + return [] + host_s = str(host).strip() + # Bracket IPv6 literals + try: + import ipaddress # stdlib + ip = ipaddress.ip_address(host_s) + if ip.version == 6 and not (host_s.startswith("[") and host_s.endswith("]")): + host_s = f"[{host_s}]" + except Exception: + pass + url = f"postgresql://{host_s}" + (f":{port}" if port else "") + return [url]Add at file top (outside this hunk) if not already present:
import ipaddress # for IPv6 URL formatting
🧹 Nitpick comments (11)
src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py (1)
40-44
: Hardenget_configured_urls()
against non-str/whitespace values.Minor resilience: normalize and type-check the config value before returning.
Apply:
- url = self.config.get("url") - return [url] if url else [] + url = self.config.get("url") + if isinstance(url, str): + url = url.strip() + return [url] if url else [] + return []src/unstract/sdk/adapters/embedding/ollama/src/ollama.py (1)
48-52
: Use Constants and normalize valuePrefer the defined constant and trim/guard to avoid empty strings.
- base_url = self.config.get("base_url") - return [base_url] if base_url else [] + base_url = self.config.get(Constants.BASE_URL) + if isinstance(base_url, str) and base_url.strip(): + return [base_url.strip()] + return []src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py (1)
49-53
: Normalize URI for validationGuard against blanks and missing scheme; align with schema default.
- uri = self._config.get(Constants.URI) - return [uri] if uri else [] + uri = self._config.get(Constants.URI) + if not isinstance(uri, str) or not uri.strip(): + return [] + uri_s = uri.strip() + if "://" not in uri_s: + uri_s = f"http://{uri_s}" + return [uri_s]src/unstract/sdk/adapters/url_validator.py (5)
28-42
: Annotate mutable class attributeRuff RUF012: mark as
ClassVar[list[str]]
.- BLOCKED_PRIVATE_RANGES = [ + BLOCKED_PRIVATE_RANGES: ClassVar[list[str]] = [
45-59
: Allow optional scheme enforcement (keep DB adapters working)Add an optional
allowed_schemes
parameter; web adapters can pass{"http","https"}
while DB adapters pass None.- def validate_url(cls, url: str) -> tuple[bool, str]: + def validate_url( + cls, url: str, allowed_schemes: set[str] | None = None + ) -> tuple[bool, str]: @@ - parsed = urlparse(url) + parsed = urlparse(url) + scheme = (parsed.scheme or "").lower() + if allowed_schemes is not None and scheme not in allowed_schemes: + return False, ( + f"Invalid or unsupported URL scheme '{parsed.scheme}'. " + f"Only {', '.join(sorted(allowed_schemes))} are allowed." + )
88-91
: Use logging.exception; avoid bare catchPreserve traceback and return a concise message.
- except Exception as e: - logger.error(f"URL validation error for '{url}': {str(e)}") - return False, f"{str(e)}" + except Exception as e: + logger.exception("URL validation error for '%s'", url) + return False, str(e)
120-139
: Legacy env support in parserAlso read legacy env var here so callers benefit without touching call sites.
- config = os.getenv(cls.ENV_VAR, "").strip() + config = (os.getenv(cls.ENV_VAR) or os.getenv(cls.LEGACY_ENV_VAR, "")).strip()
142-172
: Support bracketed IPv6 in allowlist entriesStrip brackets so
[2001:db8::1]:443
parses.- ip_part = entry + ip_part = entry @@ - if ":" in entry: + if ":" in entry: parts = entry.rsplit(":", 1) if len(parts) == 2 and parts[1].isdigit(): ip_part = parts[0] port = int(parts[1]) @@ - if "/" in ip_part: + # Remove IPv6 brackets, if present + if ip_part.startswith("[") and ip_part.endswith("]"): + ip_part = ip_part[1:-1] + if "/" in ip_part: # CIDR notationsrc/unstract/sdk/adapters/llm/open_ai/src/open_ai.py (1)
57-61
: Use constant and normalizeUse the defined constant and trim.
- api_base = self.config.get("api_base") - return [api_base] if api_base else [] + api_base = self.config.get(Constants.API_BASE) + if isinstance(api_base, str) and api_base.strip(): + return [api_base.strip()] + return []src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (1)
53-57
: Trim and lightly normalize URL to avoid whitespace/empty-string edge cases.def get_configured_urls(self) -> list[str]: """Return all URLs this adapter will connect to.""" - url = self._config.get(Constants.URL) - return [url] if url else [] + url = self._config.get(Constants.URL) + if isinstance(url, str): + url = url.strip() + return [url] if url else []src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py (1)
55-59
: Use the defined constant and trim input to avoid whitespace/typos.def get_configured_urls(self) -> list[str]: """Return all URLs this adapter will connect to.""" - endpoint = self.config.get("azure_endpoint") - return [endpoint] if endpoint else [] + endpoint = self.config.get(Constants.AZURE_ENDPOINT) + if isinstance(endpoint, str): + endpoint = endpoint.strip() + return [endpoint] if endpoint else []
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (15)
src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py
(2 hunks)src/unstract/sdk/adapters/embedding/ollama/src/ollama.py
(2 hunks)src/unstract/sdk/adapters/embedding/open_ai/src/open_ai.py
(2 hunks)src/unstract/sdk/adapters/llm/any_scale/src/anyscale.py
(2 hunks)src/unstract/sdk/adapters/llm/azure_open_ai/src/azure_open_ai.py
(3 hunks)src/unstract/sdk/adapters/llm/ollama/src/ollama.py
(2 hunks)src/unstract/sdk/adapters/llm/open_ai/src/open_ai.py
(2 hunks)src/unstract/sdk/adapters/url_validator.py
(1 hunks)src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py
(2 hunks)src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py
(2 hunks)src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py
(2 hunks)src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py
(2 hunks)src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py
(2 hunks)src/unstract/sdk/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py
(2 hunks)src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py
- src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py
- src/unstract/sdk/adapters/llm/ollama/src/ollama.py
- src/unstract/sdk/adapters/llm/any_scale/src/anyscale.py
- src/unstract/sdk/adapters/llm/azure_open_ai/src/azure_open_ai.py
- src/unstract/sdk/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py
- src/unstract/sdk/adapters/embedding/open_ai/src/open_ai.py
🧰 Additional context used
🪛 Ruff (0.12.2)
src/unstract/sdk/adapters/url_validator.py
29-42: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
66-66: Use explicit conversion flag
Replace with conversion flag
(RUF010)
86-86: Consider moving this statement to an else
block
(TRY300)
88-88: Do not catch blind exception: Exception
(BLE001)
89-89: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
89-89: Use explicit conversion flag
Replace with conversion flag
(RUF010)
90-90: Use explicit conversion flag
Replace with conversion flag
(RUF010)
136-136: Do not catch blind exception: Exception
(BLE001)
137-137: Use explicit conversion flag
Replace with conversion flag
(RUF010)
171-171: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🔇 Additional comments (7)
src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py (2)
18-21
: Good: validation happens before any network calls.This meets the “validate before use” requirement. No changes needed here.
Please confirm
_validate_urls()
re-validates resolved IPs (post-DNS) and handles redirects to private ranges to mitigate DNS-rebinding/redirect-based SSRF.
14-14
: Enable URL validation by default (validate_urls=True) or prove all callers opt inRepo scan found only the class definition at src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py:14 and no instantiations. Change the default to True or provide evidence every instantiation passes validate_urls=True.
Apply this change (preferred):
- def __init__(self, settings: dict[str, Any], validate_urls: bool = False): + def __init__(self, settings: dict[str, Any], validate_urls: bool = True):src/unstract/sdk/adapters/embedding/ollama/src/ollama.py (1)
18-25
: Good: validate before any network workConstructor gating with
validate_urls
and calling_validate_urls()
early looks correct.src/unstract/sdk/adapters/llm/open_ai/src/open_ai.py (1)
28-34
: Good: early validation togglePattern matches other adapters; no network work before
_validate_urls()
.src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (1)
26-34
: Good: pre-init validationPattern is consistent with other adapters. Ensure the validator does not hard-restrict schemes to HTTP(S), or pass allowed schemes if that change lands.
src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py (1)
25-32
: LGTM: URL preflight before any network ops.Constructor sets config then validates URLs; no network side effects occur pre‑validation. This aligns with the SSRF mitigation goal.
src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (1)
24-31
: Pre‑instantiation URL validation is safe for Weaviate.init.Adapter._validate_urls() only calls get_configured_urls() and uses getattr(self, "name", self.class.name) as a fallback; Weaviate sets self._config before invoking _validate_urls(), so it does not rely on base init state — no change required.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Gayathri <[email protected]>
…ract-sdk into fix/url_safety_net
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/unstract/sdk/prompt.py (1)
220-224
: Add request timeouts to prevent indefinite hangs.External calls without timeouts can stall request threads and harm reliability.
+ TIMEOUT_SEC = 30 if method.upper() == "POST": response = requests.post( - url=url, json=payload, params=params, headers=req_headers + url=url, json=payload, params=params, headers=req_headers, timeout=TIMEOUT_SEC ) elif method.upper() == "GET": - response = requests.get(url=url, params=params, headers=req_headers) + response = requests.get(url=url, params=params, headers=req_headers, timeout=TIMEOUT_SEC)
🧹 Nitpick comments (9)
src/unstract/sdk/tool/stream.py (2)
118-131
: Propagate status_code in the streamed error payload + clarify docstring optionalsLog consumers won’t see the HTTP code unless it’s included in the LOG record. Also tighten docstring types to reflect optionals.
- def stream_error_and_exit(self, message: str, err: Exception | None = None, status_code: int | None = None) -> None: + def stream_error_and_exit(self, message: str, err: Exception | None = None, status_code: int | None = None) -> None: @@ - err (Exception): Actual exception that occurred - status_code (int): HTTP status code to preserve + err (Exception | None): Optional underlying exception + status_code (int | None): Optional HTTP/status code to preserve @@ - self.stream_log(message, level=LogLevel.ERROR) + self.stream_log(message, level=LogLevel.ERROR, status_code=status_code)
126-129
: Use the public property consistentlyPrefer
is_exec_by_tool
over touching the private field to avoid bypassing future overrides or invariants.- if self._exec_by_tool: + if self.is_exec_by_tool: exit(1)src/unstract/sdk/prompt.py (4)
39-41
: Use logger.exception for better diagnostics.Switch to logger.exception to include stack trace automatically.
- logger.error(f"{msg}\n{e}") + logger.exception(msg)
175-177
: Don’t send None as a header value.REQUEST_ID may be None; Requests expects str values. Add only when present.
- request_headers = {RequestHeader.REQUEST_ID: self.request_id} + request_headers: dict[str, str] = {} + if self.request_id is not None: + request_headers[RequestHeader.REQUEST_ID] = self.request_id
216-216
: Normalize URL join to avoid accidental double slashes.Safer when base_url/url_path contain stray slashes.
- url: str = f"{self.base_url}/{url_path}" + url: str = f"{self.base_url.rstrip('/')}/{url_path.lstrip('/')}"
41-41
: Signature OK — callers with a third arg are valid; update docsstream_error_and_exit is defined as (message, err: Exception | None = None, status_code: int | None = None) at src/unstract/sdk/tool/stream.py:118, so calls like args[0].tool.stream_error_and_exit(msg, e, None) are safe — no runtime fixes needed. Update docs at docs/site/docs/API Reference/tool.stream.md (shows only stream_error_and_exit(message: str) → None) to include the err and status_code parameters.
src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (3)
120-122
: Simplify exception message per static analysis guidanceThe exception message is embedded directly in the raise statement. Consider extracting it to a constant for better maintainability and to address the static analysis hint.
Apply this diff to extract the exception message:
+ UNSUPPORTED_METHOD_MSG = "Unsupported request method" + # ... in _make_request method ... - raise ExtractorError( - f"Unsupported request method: {request_method}" - ) + raise ExtractorError(f"{UNSUPPORTED_METHOD_MSG}: {request_method}")
276-278
: Fix string formatting in error messageThe error message has a formatting issue where the string is split across lines without proper concatenation.
Apply this diff to fix the string formatting:
- raise ExtractorError( - "Unable to extract text after attempting" - f" {request_count} times" - ) + raise ExtractorError( + f"Unable to extract text after attempting {request_count} times" + )
420-422
: Use logging.exception for better error trackingWhen logging within an exception handler,
logging.exception
is preferred as it automatically includes the traceback information, which is helpful for debugging.Apply this diff to improve error logging:
- logger.error( - f"Error while writing metadata to {metadata_file_path}: {e}" - ) + logger.exception( + f"Error while writing metadata to {metadata_file_path}: {e}" + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (3)
src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py
(11 hunks)src/unstract/sdk/prompt.py
(6 hunks)src/unstract/sdk/tool/stream.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py
120-122: Avoid specifying long messages outside the exception class
(TRY003)
420-422: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
🔇 Additional comments (6)
src/unstract/sdk/tool/stream.py (1)
130-130
: SdkError accepts status_code and actual_err kwargs Confirmed — SdkError.init declaresstatus_code: int | None = None
andactual_err: Exception | None = None
in src/unstract/sdk/exceptions.py.src/unstract/sdk/prompt.py (4)
10-10
: Import changes look fine.Constants are used below; no concerns.
22-24
: Decorator signature formatting is OK.Type vars preserved; no behavior change.
84-86
: Base URL formatting change only.No behavioral impact; OK.
232-234
: Decorator formatting LGTM.No functional change; consistent with style.
src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (1)
59-62
: URL normalization implementation is correct but should add _validate_urls() call in initThe
get_configured_urls()
method correctly implements the previously suggested URL normalization with proper string conversion, whitespace stripping, and trailing slash removal. This ensures consistent whitelist matching.However, the URL validation call in
__init__()
is still missing. The base class provides_validate_urls()
which should be invoked to enforce SSRF protections.Apply this diff to add URL validation in
__init__()
:def __init__(self, settings: dict[str, Any]): super().__init__("LLMWhisperer") self.config = settings + self._validate_urls()
This ensures that URL validation runs before any network operations, providing the security layer this PR is designed to implement. As the Python documentation notes, "We recommend that users of these APIs where the values may be used anywhere with security implications code defensively. Do some verification within your code before trusting a returned component part."
error_message = str(e) | ||
status_code = None | ||
response = getattr(e, "response", None) | ||
if response is not None: | ||
status_code = response.status_code | ||
if ( | ||
MimeType.JSON in response.headers.get("Content-Type", "").lower() | ||
MimeType.JSON | ||
in response.headers.get("Content-Type", "").lower() | ||
and "error" in response.json() | ||
): | ||
error_message = response.json()["error"] | ||
elif response.text: | ||
error_message = response.text | ||
msg = f"Error while {context}. {error_message}" | ||
args[0].tool.stream_error_and_exit(msg, e) | ||
args[0].tool.stream_error_and_exit(msg, e, status_code) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Harden RequestException parsing (avoid double JSON parse and JSONDecodeError).
Current code calls response.json() multiple times and assumes a dict with “error”. Guard for non‑JSON/invalid JSON and parse once.
- error_message = str(e)
- status_code = None
- response = getattr(e, "response", None)
- if response is not None:
- status_code = response.status_code
- if (
- MimeType.JSON
- in response.headers.get("Content-Type", "").lower()
- and "error" in response.json()
- ):
- error_message = response.json()["error"]
- elif response.text:
- error_message = response.text
+ error_message = str(e)
+ status_code = None
+ response = getattr(e, "response", None)
+ if response is not None:
+ status_code = response.status_code
+ content_type = response.headers.get("Content-Type", "")
+ if "json" in content_type.lower():
+ try:
+ data = response.json()
+ except ValueError:
+ data = None
+ if isinstance(data, dict) and "error" in data:
+ error_message = str(data["error"])
+ elif response.text:
+ error_message = response.text
+ elif response.text:
+ error_message = response.text
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
error_message = str(e) | |
status_code = None | |
response = getattr(e, "response", None) | |
if response is not None: | |
status_code = response.status_code | |
if ( | |
MimeType.JSON in response.headers.get("Content-Type", "").lower() | |
MimeType.JSON | |
in response.headers.get("Content-Type", "").lower() | |
and "error" in response.json() | |
): | |
error_message = response.json()["error"] | |
elif response.text: | |
error_message = response.text | |
msg = f"Error while {context}. {error_message}" | |
args[0].tool.stream_error_and_exit(msg, e) | |
args[0].tool.stream_error_and_exit(msg, e, status_code) | |
error_message = str(e) | |
status_code = None | |
response = getattr(e, "response", None) | |
if response is not None: | |
status_code = response.status_code | |
content_type = response.headers.get("Content-Type", "") | |
if "json" in content_type.lower(): | |
try: | |
data = response.json() | |
except ValueError: | |
data = None | |
if isinstance(data, dict) and "error" in data: | |
error_message = str(data["error"]) | |
elif response.text: | |
error_message = response.text | |
elif response.text: | |
error_message = response.text | |
msg = f"Error while {context}. {error_message}" | |
args[0].tool.stream_error_and_exit(msg, e, status_code) |
print(json.dumps(record)) | ||
|
||
def stream_error_and_exit(self, message: str, err: Exception | None = None) -> None: | ||
def stream_error_and_exit(self, message: str, err: Exception | None = None, status_code: int | None = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Confirm Python version compatibility of PEP 604 unions
Exception | None
needs Python 3.10+. If 3.9 is still supported, switch to Optional[Exception]
(and from typing import Optional
).
🏁 Script executed:
#!/bin/bash
# Detect declared Python version(s)
fd -a 'pyproject.toml|setup.cfg|setup.py|tox.ini' | sed 's#^#--- #'
rg -n 'python_requires|Programming Language :: Python ::' -C2 -- pyproject.toml setup.cfg setup.py tox.ini 2>/dev/null || true
Length of output: 3136
Use Optional[...] for Python 3.9 compatibility or require Python ≥3.10
pyproject.toml declares "Programming Language :: Python :: 3.9", so change the signature in src/unstract/sdk/tool/stream.py (line ~118) from Exception | None
and int | None
to Optional[Exception]
and Optional[int]
and add from typing import Optional
— or update project metadata to require Python >=3.10.
🤖 Prompt for AI Agents
In src/unstract/sdk/tool/stream.py around line 118, the method signature uses
Python 3.10 union types (Exception | None, int | None) but pyproject declares
Python 3.9; change the signature to use Optional[Exception] and Optional[int],
add "from typing import Optional" at the top of the file (or alternatively
update project metadata to require Python ≥3.10), and ensure any other
occurrences of "X | None" in this file are similarly converted to Optional[X].
What
Why
This change prevents adapters from being used to:
How
__init__
methods to prevent Server-Side Request Forgery (SSRF) attacksALLOWED_ADAPTER_PRIVATE_ENDPOINTS
environment variable to allow specific private endpoints when neededRelevant Docs
Related Issues or PRs
Dependencies Versions / Env Variables
Notes on Testing
Screenshots
Test connection working with Whiteliest

Test connection failing

Create adapter failing

Update adapter failing

Checklist
I have read and understood the Contribution Guidelines.