Skip to content

Conversation

gaya3-zipstack
Copy link
Contributor

@gaya3-zipstack gaya3-zipstack commented Sep 10, 2025

What

  • Add URL validation and SSRF protection for all adapter connections
  • Implement secure URL checking to prevent connections to private networks and localhost
  • Block potentially dangerous protocols and restricted IP ranges
  • Support configurable allowed private endpoints via whitelist

Why

This change prevents adapters from being used to:

  • Access internal services on private networks (10.x.x.x, 192.168.x.x, 172.16-31.x.x)
  • Connect to localhost/loopback addresses (127.x.x.x, ::1)
  • Probe internal infrastructure through SSRF attacks
  • Access link-local addresses (169.254.x.x)

How

  • Security Enhancement: Added comprehensive URL validation in adapter __init__ methods to prevent Server-Side Request Forgery (SSRF) attacks
  • Network Protection: Block connections to private IP ranges (RFC 1918), localhost, and link-local addresses
  • Protocol Validation: Only allow HTTP/HTTPS protocols for web-based adapters
  • Configurable Whitelist: Support for ALLOWED_ADAPTER_PRIVATE_ENDPOINTS environment variable to allow specific private endpoints when needed
  • Error Handling: Clear error messages for blocked URLs with security context

Relevant Docs

Related Issues or PRs

Dependencies Versions / Env Variables

  • ALLOWED_ADAPTER_PRIVATE_ENDPOINTS

Notes on Testing

  • Tested test connection by whitelisting endpoints within docker network (172.16.0.0/12)
  • Tested test connection without whitelisting docker network
  • Tested adapter creation using POST API
  • Tested adapter edit using PUT API

Screenshots

Test connection working with Whiteliest
image

Test connection failing
image

Create adapter failing
image

Update adapter failing
image

Checklist

I have read and understood the Contribution Guidelines.

Copy link

coderabbitai bot commented Sep 10, 2025

Summary by CodeRabbit

  • New Features
    • Optional URL validation across LLM, embedding, vector DB, and x2text adapters with a per-adapter toggle.
    • Configurable whitelist for private endpoints via environment variable.
    • New public method to list configured connection URLs in adapters.
    • Pre-validation of service URLs in x2text requests.
  • Improvements
    • Error handling now includes HTTP status codes for clearer failures.
    • Updated Milvus schema default URI to include http://.
  • Tests
    • Added comprehensive URL validation test suite.
  • Chores
    • Bumped SDK version to v0.78.0.

Walkthrough

Adds a URLValidator module (DNS + private-range blocking with ENV whitelist), extends Adapter base with get_configured_urls() and _validate_urls(), updates many adapters to expose configured URLs and optionally validate them during init, adds URL checks in x2text helper, adds tests, and bumps Milvus schema default and SDK version.

Changes

Cohort / File(s) Summary
Core validation infrastructure
src/unstract/sdk/adapters/base.py, src/unstract/sdk/adapters/url_validator.py
Add URLValidator and WhitelistEntry (DNS resolution, private-range blocking, ENV whitelist parsing). Extend Adapter with get_configured_urls() (default []) and _validate_urls() which calls URLValidator.validate_url() and raises AdapterError on failure; minor quoting tweak.
Embedding adapters
src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py, src/unstract/sdk/adapters/embedding/ollama/src/ollama.py, src/unstract/sdk/adapters/embedding/open_ai/src/open_ai.py
Constructors gain optional validate_urls: bool = False and call _validate_urls() when set; add get_configured_urls() to expose configured endpoint(s).
LLM adapters
src/unstract/sdk/adapters/llm/any_scale/src/anyscale.py, src/unstract/sdk/adapters/llm/azure_open_ai/src/azure_open_ai.py, src/unstract/sdk/adapters/llm/ollama/src/ollama.py, src/unstract/sdk/adapters/llm/open_ai/src/open_ai.py
Constructors accept validate_urls and invoke _validate_urls() when requested; add get_configured_urls() implementations; minor formatting/simplifications in some files.
Vector DB adapters & schema
src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py, src/unstract/sdk/adapters/vectordb/milvus/src/static/json_schema.json, src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py, src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py, src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py
Constructors accept validate_urls and may call _validate_urls() early; add get_configured_urls() to each adapter; update Milvus JSON schema uri default to "http://localhost:19530".
x2text adapters & helper
src/unstract/sdk/adapters/x2text/helper.py, src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py, src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py, src/unstract/sdk/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py, src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py, src/unstract/sdk/adapters/x2text/unstructured_enterprise/src/unstructured_enterprise.py
helper.py now validates unstructured_url via URLValidator and raises AdapterError on failure; x2text adapters add get_configured_urls() and some accept validate_urls to call _validate_urls() during init.
Tests
tests/test_url_validator.py
Add comprehensive tests for URLValidator covering public/private IP handling, DNS errors, malformed URLs, ENV whitelist parsing (IP, CIDR, port), localhost and cloud metadata addresses; uses mocked DNS and env setup/teardown.
Packaging / Version
src/unstract/sdk/__init__.py
Bump SDK version from v0.77.1 to v0.78.0.
Tooling / errors
src/unstract/sdk/tool/stream.py, src/unstract/sdk/prompt.py
StreamMixin.stream_error_and_exit updated to accept optional status_code and pass it to SdkError; prompt.py exception handling updated to capture and forward HTTP status codes and adjust formatting.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Adapter as Adapter Subclass
  participant Base as Adapter Base
  participant Validator as URLValidator
  Note over Adapter: __init__(config, validate_urls=False)
  Client->>Adapter: instantiate(config, validate_urls=True)
  Adapter->>Adapter: set self.config
  Adapter->>Base: _validate_urls()
  Base->>Adapter: get_configured_urls()
  loop For each non-empty URL
    Base->>Validator: validate_url(url)
    alt Invalid URL
      Validator-->>Base: (False, error_message)
      Base->>Base: log error with adapter name
      Base-->>Client: raise AdapterError
    else Valid URL
      Validator-->>Base: (True, "")
    end
  end
  Adapter-->>Client: instance ready
  Note right of Validator: DNS → IP check → private-range block → optional ENV whitelist allow
Loading
sequenceDiagram
  autonumber
  actor Client
  participant Adapter as Adapter Subclass
  participant Base as Adapter Base
  participant Validator as URLValidator
  Client->>Adapter: test_connection()
  Adapter->>Base: _validate_urls() (before network ops)
  Base->>Validator: validate_url(url)
  alt Any invalid
    Validator-->>Base: (False, msg)
    Base-->>Client: raise AdapterError
  else All valid
    Adapter->>Adapter: proceed with adapter-specific checks
    Adapter-->>Client: connection result
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ❓ Inconclusive The description follows the repository template and includes What/Why/How, Related Issues, testing notes, screenshots, and a checklist, so its structure is mostly complete and informative. However, it contains a critical inconsistency: the PR text refers to an environment variable named "ALLOWED_ADAPTER_PRIVATE_ENDPOINTS" while the implementation and tests set URLValidator.ENV_VAR = "WHITELISTED_ENDPOINTS", which could mislead operators and reviewers about the actual configuration name and format. The "Relevant Docs" section is empty and the description does not provide concrete example whitelist values or parsing rules, nor does it explicitly mention the new validate_urls constructor flags and the exact protocol restrictions, which are important operational details. Because of these gaps and the env var mismatch I cannot conclusively verify that the description matches the implementation. Please update the PR description to use the exact environment variable name from the code (e.g., WHITELISTED_ENDPOINTS) and include concrete example values and the expected parsing format, populate the "Relevant Docs" section with links to operational/security docs, and clearly state protocol restrictions (HTTP/HTTPS only) and behavior on DNS failures; also call out the new validate_urls constructor flag and reference the unit tests (tests/test_url_validator.py) in "Notes on Testing" so reviewers and operators can reproduce and validate the behavior.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "USEC-3 [FIX] Support url safety checking for adapters" succinctly and accurately summarizes the primary change (adding URL safety/SSRF protections for adapters) and includes the issue key and fix tag for traceability, making it clear for a teammate scanning PR history.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/url_safety_net

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/unstract/sdk/adapters/llm/any_scale/src/anyscale.py (1)

56-73: Avoid passing "None" strings for optional params.

str(None) becomes "None" and may break the client or logs; pass None instead for optional fields.

Apply this minimal fix:

             llm: LLM = Anyscale(
                 model=str(self.config.get(Constants.MODEL)),
                 api_key=str(self.config.get(Constants.API_KEY)),
-                api_base=str(self.config.get(Constants.API_BASE)),
-                additional_kwargs=self.config.get(Constants.ADDITIONAL_KWARGS),
+                api_base=self.config.get(Constants.API_BASE) or None,
+                additional_kwargs=self.config.get(Constants.ADDITIONAL_KWARGS) or None,
                 max_retries=int(
                     self.config.get(Constants.MAX_RETRIES, LLMKeys.DEFAULT_MAX_RETRIES)
                 ),
🧹 Nitpick comments (28)
src/unstract/sdk/adapters/vectordb/milvus/src/static/json_schema.json (1)

19-21: Default points to a blocked endpoint; add UX hint in description.

With SSRF protections, http://localhost:19530 will be blocked unless whitelisted. Consider clarifying this in the field description to reduce setup friction.

-      "description": "Provide the URI of the Milvus server. Example: `https://<instance-id>.api.gcp-us-west1.zillizcloud.com`"
+      "description": "Provide the URI of the Milvus server. Example: `https://<instance-id>.api.gcp-us-west1.zillizcloud.com`. Note: localhost/private IPs are blocked by default for security. For local development, ask your admin to allowlist the endpoint via ALLOWED_ADAPTER_PRIVATE_ENDPOINTS."
src/unstract/sdk/adapters/url_validator.py (5)

93-103: Precompute blocked networks and annotate as ClassVars.

Avoid reparsing CIDRs on every call and address Ruff RUF012.

-    BLOCKED_PRIVATE_RANGES = [
+    BLOCKED_PRIVATE_RANGES: ClassVar[list[str]] = [
@@
     ]
+
+    # Precompute networks once
+    BLOCKED_PRIVATE_NETWORKS: ClassVar[tuple[ipaddress._BaseNetwork, ...]] = tuple(
+        ipaddress.ip_network(cidr) for cidr in BLOCKED_PRIVATE_RANGES
+    )
@@
     def _is_private_ip(cls, ip: ipaddress.IPv4Address | ipaddress.IPv6Address) -> bool:
         """Check if IP address is in private ranges."""
-        for range_str in cls.BLOCKED_PRIVATE_RANGES:
-            try:
-                network = ipaddress.ip_network(range_str)
-                if ip in network:
-                    return True
-            except ValueError:
-                continue
+        for network in cls.BLOCKED_PRIVATE_NETWORKS:
+            if ip in network:
+                return True
         return False

88-91: Tighten exception handling and use logging.exception.

Meet TRY400/BLE001/RUF010 hints; keep fail-closed behavior.

-        except Exception as e:
-            logger.error(f"URL validation error for '{url}': {str(e)}")
-            return False, f"{str(e)}"
+        except socket.gaierror as e:
+            logger.exception("URL validation DNS error for '%s': %s", url, e)
+            return False, f"{e!s}"
+        except Exception as e:  # pragma: no cover
+            logger.exception("URL validation error for '%s': %s", url, e)
+            return False, f"{e!s}"

120-139: Whitelist parsing: avoid blanket Exception and improve messages.

Narrow the except and keep logs actionable.

-        for item in config.split(","):
+        for item in config.split(","):
@@
-            try:
-                entry = cls._parse_whitelist_entry(item)
-                if entry:
-                    entries.append(entry)
-            except Exception as e:
-                logger.warning(f"Invalid whitelist entry '{item}': {str(e)}")
+            entry = cls._parse_whitelist_entry(item)
+            if entry:
+                entries.append(entry)

140-172: IPv6 allowlist entries: document bracketless format for ports.

rsplit(":", 1) supports fe80::1:8080 but [fe80::1]:8080 will be rejected. Consider documenting accepted format or adding support for brackets.

Would you like me to add bracket support in _parse_whitelist_entry?


29-42: Consider extending the blocked ranges set.

Optional: add CGNAT and benchmarking ranges often blocked for SSRF.

         "169.254.0.0/16",  # Link-local
+        "100.64.0.0/10",   # Carrier-Grade NAT
+        "198.18.0.0/15",   # Benchmarking
tests/test_url_validator.py (3)

156-163: Prefer socket.gaierror to exercise the intended error path.

This ensures the DNS failure branch (not the generic Exception) is tested.

-        mock_gethostbyname.side_effect = Exception("DNS resolution failed")
+        import socket as _socket
+        mock_gethostbyname.side_effect = _socket.gaierror("DNS resolution failed")

164-177: Add tests for non-HTTP schemes and default-port allowlisting.

Covers scheme enforcement and implicit 80/443 behavior.

+    @patch("socket.gethostbyname", return_value="93.184.216.34")
+    def test_non_http_schemes_blocked(self, _):
+        is_valid, err = URLValidator.validate_url("ftp://example.com/resource")
+        self.assertFalse(is_valid)
+        self.assertIn("Only http and https are allowed", err)
+
+    @patch("socket.gethostbyname", return_value="10.0.0.5")
+    def test_default_port_allowlist_match(self, _):
+        os.environ[URLValidator.ENV_VAR] = "10.0.0.5:443"
+        is_valid, _ = URLValidator.validate_url("https://10.0.0.5")  # no explicit port
+        self.assertTrue(is_valid)

95-111: Consider adding IPv6 coverage.

Validate IPv6 resolution and allowlisting once the validator supports getaddrinfo.

Would you like me to add IPv6 tests using @patch("socket.getaddrinfo")?

src/unstract/sdk/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py (1)

48-52: Minor: trim and guard configured URL.

Strip whitespace and ensure empty strings don’t pass through.

 def get_configured_urls(self) -> list[str]:
     """Return all URLs this adapter will connect to."""
-    url = self.config.get("url")
-    return [url] if url else []
+    url = self.config.get("url")
+    if isinstance(url, str):
+        url = url.strip()
+    return [url] if url else []
src/unstract/sdk/adapters/llm/open_ai/src/open_ai.py (1)

57-61: Use the constant and trim the value.

Avoid string literals for config keys; trim to prevent accidental whitespace issues.

 def get_configured_urls(self) -> list[str]:
     """Return all URLs this adapter will connect to."""
-    api_base = self.config.get("api_base")
-    return [api_base] if api_base else []
+    api_base = self.config.get(Constants.API_BASE)
+    if isinstance(api_base, str):
+        api_base = api_base.strip()
+    return [api_base] if api_base else []
src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py (1)

39-43: Minor: trim configured URL.

Handles trailing spaces from env/config.

 def get_configured_urls(self) -> list[str]:
     """Return all URLs this adapter will connect to."""
-    url = self.config.get("url")
-    return [url] if url else []
+    url = self.config.get("url")
+    if isinstance(url, str):
+        url = url.strip()
+    return [url] if url else []
src/unstract/sdk/adapters/llm/ollama/src/ollama.py (1)

56-60: Minor: trim configured URL.

Avoids surprises from whitespace in BASE_URL.

 def get_configured_urls(self) -> list[str]:
     """Return all URLs this adapter will connect to."""
-    base_url = self.config.get(Constants.BASE_URL)
-    return [base_url] if base_url else []
+    base_url = self.config.get(Constants.BASE_URL)
+    if isinstance(base_url, str):
+        base_url = base_url.strip()
+    return [base_url] if base_url else []
src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (1)

53-57: Minor: trim configured URL.

Guard against whitespace-only values.

 def get_configured_urls(self) -> list[str]:
     """Return all URLs this adapter will connect to."""
-    url = self._config.get(Constants.URL)
-    return [url] if url else []
+    url = self._config.get(Constants.URL)
+    if isinstance(url, str):
+        url = url.strip()
+    return [url] if url else []
src/unstract/sdk/adapters/x2text/helper.py (1)

105-111: Tighten error surfacing; align with TRY003.

Log the validator’s message once and raise AdapterError with that message directly. This reduces noisy prefixes and satisfies Ruff TRY003.

-        if unstructured_url:
-            is_valid, error_message = URLValidator.validate_url(unstructured_url)
-            if not is_valid:
-                logger.error(f"Unstructured URL validation failed: {error_message}")
-                raise AdapterError(f"URL validation failed: {error_message}")
+        if unstructured_url:
+            is_valid, error_message = URLValidator.validate_url(unstructured_url)
+            if not is_valid:
+                logger.error(f"URL validation failed: {error_message}")
+                raise AdapterError(error_message)
src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py (1)

52-56: Minor: trim configured URL.

Consistent with other adapters.

 def get_configured_urls(self) -> list[str]:
     """Return all URLs this adapter will connect to."""
-    url = self._config.get(Constants.URL)
-    return [url] if url else []
+    url = self._config.get(Constants.URL)
+    if isinstance(url, str):
+        url = url.strip()
+    return [url] if url else []
src/unstract/sdk/adapters/llm/any_scale/src/anyscale.py (1)

51-55: Validate implicit defaults too (if the SDK provides one).

If the Anyscale client uses a built-in api_base when not configured, include that same default here so validation can’t be bypassed by omission.

Would you confirm whether Anyscale has an implicit api_base? If yes, mirror it:

 def get_configured_urls(self) -> list[str]:
-    api_base = self.config.get(Constants.API_BASE)
-    return [api_base] if api_base else []
+    api_base = self.config.get(Constants.API_BASE)
+    # If the client falls back internally, include that fallback here as well.
+    return [api_base] if api_base else [Constants.API_BASE_DEFAULT]  # if you introduce such a constant
src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py (1)

54-58: Use the defined constant for the key.

Minor consistency/readability tweak.

-        endpoint = self.config.get("azure_endpoint")
+        endpoint = self.config.get(Constants.AZURE_ENDPOINT)
src/unstract/sdk/adapters/embedding/open_ai/src/open_ai.py (2)

53-57: Also validate the default base URL when api_base is not set.

Prevents skipping validation via omission.

-        api_base = self.config.get("api_base")
+        api_base = self.config.get(Constants.API_BASE_KEY, Constants.API_BASE_VALUE)
         return [api_base] if api_base else []

58-63: Align httpx timeout with the adapter timeout setting.

Currently the httpx timeout is hardcoded to 10s while the adapter timeout uses config. Aligning avoids surprises.

-            httpx_timeout = httpx.Timeout(10.0, connect=60.0)
+            httpx_timeout = httpx.Timeout(timeout, connect=60.0)
src/unstract/sdk/adapters/x2text/unstructured_enterprise/src/unstructured_enterprise.py (2)

14-17: Consider validating in init for consistency with other adapters.

Process/test helpers may validate, but failing fast in init keeps behavior uniform across adapters.

 class UnstructuredEnterprise(X2TextAdapter):
     def __init__(self, settings: dict[str, Any]):
         super().__init__("UnstructuredIOEnterprise")
         self.config = settings
+        # Validate URLs BEFORE any network operations
+        self._validate_urls()

54-56: Run centralized validation in test_connection too.

Keeps the contract consistent with other adapters’ test_connection paths.

 def test_connection(self) -> bool:
-    result: bool = UnstructuredHelper.test_server_connection(self.config)
+    super().test_connection()
+    result: bool = UnstructuredHelper.test_server_connection(self.config)
     return result
src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (1)

208-210: Validate before probe: good; add request timeouts for reliability.

Super call ensures validation; add explicit timeouts to outbound requests to avoid hangs.

You can set a single default timeout (e.g., 30s) on both GET and POST:

# In _make_request(), add a timeout to requests
response = requests.get(url=llm_whisperer_svc_url, headers=headers, params=params, timeout=30)
# ...
response = requests.post(url=llm_whisperer_svc_url, headers=headers, params=params, data=data, timeout=30)
src/unstract/sdk/adapters/embedding/ollama/src/ollama.py (1)

47-51: Use the constant for the key.

Keeps things DRY and reduces typo risk.

-        base_url = self.config.get("base_url")
+        base_url = self.config.get(Constants.BASE_URL)
src/unstract/sdk/adapters/llm/azure_open_ai/src/azure_open_ai.py (2)

57-61: Unify key usage and fix constant name typo (backward-compatible).

  • Use the Constants key instead of a hardcoded string.
  • The constant is misspelled (AZURE_ENDPONT). Add a correctly named alias and switch call sites.

Apply within the selected ranges:

-        endpoint = self.config.get("azure_endpoint")
+        endpoint = self.config.get(Constants.AZURE_ENDPOINT)
-            "azure_endpoint": str(self.config.get(Constants.AZURE_ENDPONT)),
+            "azure_endpoint": str(self.config.get(Constants.AZURE_ENDPOINT)),

Add this alias near the Constants block (outside the selected range):

# Prefer this going forward
AZURE_ENDPOINT = "azure_endpoint"
# Back-compat alias; mark for removal in a future major/minor bump
AZURE_ENDPONT = AZURE_ENDPOINT

Also applies to: 76-76


84-84: Gate reasoning_effort to O-series models only.

Avoid passing reasoning_* kwargs unless the selected model supports them.

-        if enable_reasoning:
-            llm_kwargs["reasoning_effort"] = self.config.get(Constants.REASONING_EFFORT)
+        if enable_reasoning and model in O1_MODELS:
+            llm_kwargs["reasoning_effort"] = self.config.get(Constants.REASONING_EFFORT)
src/unstract/sdk/adapters/base.py (2)

48-59: Clear contract for adapter URLs.

Solid default and docstring. Consider reminding implementers to include all outbound URLs (including health/test endpoints) to ensure full coverage.


60-76: Tighten exception message (ruff TRY003) and keep context in logs.

Ruff flags the constructed message. Since you already log adapter/url context, raise with the validator’s message directly.

-                raise AdapterError(f"URL validation failed: {error_message}")
+                raise AdapterError(error_message)

Optional: add structured logging context.

-                logger.error(
-                    f"URL validation failed for adapter '{adapter_name}': {error_message}"
-                )
+                logger.error(
+                    "URL validation failed",
+                    extra={"adapter": adapter_name, "url": url, "reason": error_message},
+                )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge Base: Disabled due to Reviews > Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4afc1f2 and c4c64b7.

📒 Files selected for processing (21)
  • src/unstract/sdk/adapters/base.py (3 hunks)
  • src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py (2 hunks)
  • src/unstract/sdk/adapters/embedding/ollama/src/ollama.py (2 hunks)
  • src/unstract/sdk/adapters/embedding/open_ai/src/open_ai.py (2 hunks)
  • src/unstract/sdk/adapters/llm/any_scale/src/anyscale.py (2 hunks)
  • src/unstract/sdk/adapters/llm/azure_open_ai/src/azure_open_ai.py (3 hunks)
  • src/unstract/sdk/adapters/llm/ollama/src/ollama.py (3 hunks)
  • src/unstract/sdk/adapters/llm/open_ai/src/open_ai.py (2 hunks)
  • src/unstract/sdk/adapters/url_validator.py (1 hunks)
  • src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py (3 hunks)
  • src/unstract/sdk/adapters/vectordb/milvus/src/static/json_schema.json (1 hunks)
  • src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (3 hunks)
  • src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py (3 hunks)
  • src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (2 hunks)
  • src/unstract/sdk/adapters/x2text/helper.py (2 hunks)
  • src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (2 hunks)
  • src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (2 hunks)
  • src/unstract/sdk/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py (2 hunks)
  • src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py (2 hunks)
  • src/unstract/sdk/adapters/x2text/unstructured_enterprise/src/unstructured_enterprise.py (1 hunks)
  • tests/test_url_validator.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
src/unstract/sdk/adapters/x2text/helper.py

110-110: Avoid specifying long messages outside the exception class

(TRY003)

src/unstract/sdk/adapters/base.py

75-75: Avoid specifying long messages outside the exception class

(TRY003)

src/unstract/sdk/adapters/url_validator.py

29-42: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


66-66: Use explicit conversion flag

Replace with conversion flag

(RUF010)


86-86: Consider moving this statement to an else block

(TRY300)


88-88: Do not catch blind exception: Exception

(BLE001)


89-89: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


89-89: Use explicit conversion flag

Replace with conversion flag

(RUF010)


90-90: Use explicit conversion flag

Replace with conversion flag

(RUF010)


136-136: Do not catch blind exception: Exception

(BLE001)


137-137: Use explicit conversion flag

Replace with conversion flag

(RUF010)


171-171: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🔇 Additional comments (24)
src/unstract/sdk/adapters/url_validator.py (1)

111-117: Allowlist semantics: portless entries vs explicit default ports.

With the effective-port change, allowlisting 10.0.0.1:443 will now match https://10.0.0.1. Confirm this aligns with your intended policy.

src/unstract/sdk/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py (1)

27-29: Good: pre-flight URL validation added.

Validating before any network calls reduces SSRF risk and fail-fast behavior is correct.

src/unstract/sdk/adapters/llm/open_ai/src/open_ai.py (1)

32-34: Good: early URL validation.

This aligns with the repo-wide safety-net.

src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1)

22-24: Good: pre-flight URL validation.

Timing is correct — before any network interaction.

src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py (1)

18-20: Good: validate URLs before any IO.

Consistent with the new security posture.

src/unstract/sdk/adapters/llm/ollama/src/ollama.py (2)

31-33: Good: early URL validation.

Prevents accidental localhost/private hits unless whitelisted.


87-89: Good: call base test_connection first.

Ensures shared validation runs before live calls.

src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (1)

29-31: Reorder super().__init__() before calling _validate_urls() in WeaviateAdapter

  • _validate_urls() may reference base-initialized fields (e.g. self.name) that aren’t set until super().__init__() runs. Move the validation call after super().__init__() or explicitly assign any needed attributes beforehand.
src/unstract/sdk/adapters/x2text/helper.py (1)

8-8: Import looks good.

Used below in URL pre-validation.

src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py (2)

83-85: Good: base validation in test_connection.

Ensures policy enforcement before hitting the service.


28-30: Ignore init-order concern for _validate_urls
validate_urls only calls get_configured_urls (which uses the already-set self._config) and falls back to the class name when self.name is unset, so it doesn’t depend on Adapter.init; no reordering is required.

Likely an incorrect or invalid review comment.

src/unstract/sdk/adapters/llm/any_scale/src/anyscale.py (1)

26-28: Pre-flight URL validation: good placement.

Validating in init before any network work matches the SSRF goals.

src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py (1)

29-31: Pre-flight URL validation: LGTM.

Done before constructing httpx client; aligns with the protection goal.

src/unstract/sdk/adapters/embedding/open_ai/src/open_ai.py (1)

28-30: Pre-flight URL validation: LGTM.

Happens before creating the http client.

src/unstract/sdk/adapters/x2text/unstructured_enterprise/src/unstructured_enterprise.py (1)

36-40: Expose URL via get_configured_urls: LGTM.

This enables centralized validation.

src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (1)

58-62: Expose URL via get_configured_urls: LGTM.

Matches the new validation flow.

src/unstract/sdk/adapters/embedding/ollama/src/ollama.py (1)

22-24: Pre-flight URL validation: LGTM.

Early, before any client creation.

src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py (2)

25-27: Pre-flight URL validation: LGTM.

Done before creating the client/store instance.


49-53: Expose URI for validation: LGTM.

Simple and correct.

src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (3)

32-34: Pre-flight URL validation: LGTM.

Placed before building connections.


102-104: Validation on test_connection path: LGTM.

Keeps runtime checks consistent.


118-131: No action needed: URLValidator doesn’t enforce a schemes allowlist, so postgresql:// URLs pass through unconditionally.

src/unstract/sdk/adapters/llm/azure_open_ai/src/azure_open_ai.py (1)

32-34: Good call validating early.

Validating in init before any network activity is the right place.

src/unstract/sdk/adapters/base.py (1)

37-41: Schema path lookup LGTM.

Using getattr with an explicit error for missing SCHEMA_PATH is clear and safe.

Comment on lines 80 to 82
# Validate URLs first
super().test_connection()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify validator behavior for local/default URIs and scheme support.

Default schema was changed to http://localhost:19530; this will be blocked unless whitelisted. Confirm tests/documentation cover setting ALLOWED_ADAPTER_PRIVATE_ENDPOINTS for local/dev and that only intended schemes are allowed for Milvus.

Run this to inspect the allowed/prohibited schemes/ranges in the validator:


🏁 Script executed:

#!/bin/bash
rg -nC2 -e 'ALLOWED|SCHEME|SCHEMES|PROTOCOL|BLOCKED|PRIVATE|localhost' src/unstract/sdk/adapters/url_validator.py

Length of output: 882


🏁 Script executed:

#!/bin/bash
rg -nC3 -e 'ALLOWED_SCHEMES' src/unstract/sdk/adapters/url_validator.py || true
rg -nC3 -e 'scheme' src/unstract/sdk/adapters/url_validator.py
rg -nC3 -e 'protocol' src/unstract/sdk/adapters/url_validator.py
rg -nC3 -e 'http' src/unstract/sdk/adapters/url_validator.py

Length of output: 0


🏁 Script executed:

#!/bin/bash
sed -n '1,200p' src/unstract/sdk/adapters/url_validator.py

Length of output: 5661


🏁 Script executed:

#!/bin/bash
sed -n '1,200p' src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py

Length of output: 3122


🏁 Script executed:

#!/bin/bash
sed -n '1,200p' src/unstract/sdk/adapters/vectordb/vectordb_adapter.py

Length of output: 2688


Document/test private whitelist and enforce URL schemes

  • URLValidator blocks 127.0.0.0/8 (localhost) by default, so http://localhost:19530 will fail without whitelisting .
  • Add documentation and tests showing how to set ALLOWED_ADAPTER_PRIVATE_ENDPOINTS (e.g. 127.0.0.1:19530) for local/dev.
  • Extend URLValidator to restrict to only supported schemes (e.g. http, https).
🤖 Prompt for AI Agents
In src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py around lines 80 to
82, the call to super().test_connection() uses the global URLValidator which
blocks private IPs by default and does not enforce allowed schemes; update the
URL validation logic to (1) allow explicitly whitelisted private endpoints by
reading ALLOWED_ADAPTER_PRIVATE_ENDPOINTS (e.g. "127.0.0.1:19530") and passing
them to the URLValidator/validator instance so local/dev URLs like
http://localhost:19530 succeed, (2) restrict accepted URL schemes to only
supported ones (http and https) by configuring the URLValidator to validate
scheme membership, and (3) add/update documentation and unit tests that
demonstrate how to set ALLOWED_ADAPTER_PRIVATE_ENDPOINTS and assert that
localhost/private IPs are accepted when whitelisted and rejected otherwise.

@gaya3-zipstack gaya3-zipstack changed the title Support url safety checking for adapters USEC-3 [FIX] Support url safety checking for adapters Sep 10, 2025
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Gayathri <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1)

102-103: Guard against empty results to avoid IndexError.

LlamaParse may return an empty list; indexing [0] will crash.

-        response_text = documents[0].text
+        if not documents:
+            raise AdapterError("LlamaParse returned no documents for the input.")
+        response_text = documents[0].text
♻️ Duplicate comments (1)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1)

43-49: Fix SyntaxErrors in get_configured_urls (indent + stray diff marker).

The method is indented with 5 spaces inside the class (should be 4), and there’s a literal “- url = …” line left from a diff. Both trigger SyntaxError (confirmed by tooling).

Apply:

-     def get_configured_urls(self) -> list[str]:
-         """Return all URLs this adapter will connect to."""
--        url = self.config.get("url")
-        base_url = self.config.get(LlamaParseConfig.BASE_URL)
-        if isinstance(base_url, str):
-            base_url = base_url.strip()
-        return [base_url] if base_url else []
+    def get_configured_urls(self) -> list[str]:
+        """Return all URLs this adapter will connect to."""
+        base_url = self.config.get(LlamaParseConfig.BASE_URL)
+        if isinstance(base_url, str):
+            base_url = base_url.strip()
+        return [base_url] if base_url else []

Also: switching to LlamaParseConfig.BASE_URL correctly aligns validation with the actual client URL.

🧹 Nitpick comments (3)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (3)

91-100: Tighten error logging and message quality.

  • Missing f-string leaves “{exe}” unformatted.
  • Prefer structured logging and include exc_info.
  • Minor grammar: “llama-parse`s”.
-        except ConnectError as connec_err:
-            logger.error(f"Invalid Base URL given. : {connec_err}")
-            raise AdapterError(
-                "Unable to connect to llama-parse`s service, " "please check the Base URL"
-            )
+        except ConnectError as conn_err:
+            logger.error("Unable to connect to LlamaParse service: %s", conn_err, exc_info=True)
+            raise AdapterError("Unable to connect to LlamaParse service. Please check the Base URL.")
-        except Exception as exe:
-            logger.error(
-                "Seems like an invalid API Key or possible internal errors: {exe}"
-            )
-            raise AdapterError(str(exe))
+        except Exception as exc:
+            logger.error("Invalid API key or internal error: %s", exc, exc_info=True)
+            raise AdapterError(str(exc))

73-81: Avoid mixing encoding with binary modes; keep I/O modes consistent.

Passing encoding with "rb"/"wb" is inconsistent and may be ignored or error depending on the FileStorage backend.

-                    text_content = fs.read(
-                        path=input_file_path_copy, mode="rb", encoding="utf-8"
-                    )
+                    text_content = fs.read(
+                        path=input_file_path_copy, mode="rb"
+                    )
...
-                    fs.write(
+                    fs.write(
                         path=input_file_path,
-                        data=text_content,
-                        mode="wb",
-                        encoding="utf-8",
+                        data=text_content,
+                        mode="wb",
                     )

53-54: Prefer None defaults over instantiated FileStorage in function signatures.

Create FileStorage inside the function when fs is None to avoid constructing objects at import time and to ease testing.

Example:

-def _call_parser(
-    self,
-    input_file_path: str,
-    fs: FileStorage = FileStorage(provider=FileStorageProvider.LOCAL),
-) -> str:
+def _call_parser(
+    self,
+    input_file_path: str,
+    fs: FileStorage | None = None,
+) -> str:
+    fs = fs or FileStorage(provider=FileStorageProvider.LOCAL)

...
-def process(
+def process(
     self,
     input_file_path: str,
-    output_file_path: str | None = None,
-    fs: FileStorage = FileStorage(provider=FileStorageProvider.LOCAL),
+    output_file_path: str | None = None,
+    fs: FileStorage | None = None,
     **kwargs: dict[Any, Any],
 ) -> TextExtractionResult:
+    fs = fs or FileStorage(provider=FileStorageProvider.LOCAL)

Also applies to: 109-110

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge Base: Disabled due to Reviews > Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between c4c64b7 and 9eb31da.

📒 Files selected for processing (1)
  • src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py

43-43: SyntaxError: unindent does not match any outer indentation level


45-45: SyntaxError: Invalid assignment target


46-46: SyntaxError: Unexpected indentation

🔇 Additional comments (1)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1)

22-24: Validating URLs early is the right call.

Good placement; this ensures SSRF checks run before any network I/O.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (1)

111-116: SQL injection risk on identifier interpolation; use psycopg2.sql.Identifier and close cursor

Direct f-string interpolation of schema/table can drop arbitrary objects if those values are user-controlled. Use safe identifier quoting and a context manager.

Apply:

-        if self._client is not None:
-            self._client.cursor().execute(
-                f"DROP TABLE IF EXISTS "
-                f"{self._schema_name}.data_{self._collection_name} CASCADE"
-            )
-            self._client.commit()
+        if self._client is not None:
+            from psycopg2 import sql
+            with self._client.cursor() as cur:
+                cur.execute(
+                    sql.SQL("DROP TABLE IF EXISTS {}.{} CASCADE").format(
+                        sql.Identifier(self._schema_name),
+                        sql.Identifier(f"data_{self._collection_name}"),
+                    )
+                )
+            self._client.commit()
♻️ Duplicate comments (2)
src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py (2)

25-27: Good: Early URL validation to prevent pre-connect SSRF. Verify no base-init dependency.

Calling self._validate_urls() before any client creation is the right sequence. Please confirm that VectorDBAdapter._validate_urls does not rely on state initialized in VectorDBAdapter.__init__; if it does, move this call after super().__init__ or decouple the dependency.


49-53: Enforce http/https schemes at the adapter level (block non-web protocols).

Per the PR objective, restrict Milvus URIs to http/https. Add a light-weight scheme check here so invalid schemes are rejected even before the shared validator runs.

Apply:

--- a/src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py
+++ b/src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py
@@
-from typing import Any
+from typing import Any
+from urllib.parse import urlparse
@@
-    def get_configured_urls(self) -> list[str]:
-        """Return all URLs this adapter will connect to."""
-        uri = self._config.get(Constants.URI)
-        return [uri] if uri else []
+    def get_configured_urls(self) -> list[str]:
+        """Return all URLs this adapter will connect to. Enforces http/https."""
+        uri = str(self._config.get(Constants.URI, "")).strip()
+        if not uri:
+            return []
+        scheme = urlparse(uri).scheme.lower()
+        if scheme not in {"http", "https"}:
+            raise AdapterError(f"Milvus URI scheme '{scheme}' is not allowed; use http or https.")
+        return [uri]
🧹 Nitpick comments (7)
src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py (1)

79-88: Always clean up the test collection (use try/finally).

If test_vector_db_instance raises, the collection isn’t dropped. Wrap in try/finally and swallow best-effort cleanup errors.

-    def test_connection(self) -> bool:
-
-        vector_db = self.get_vector_db_instance()
-        test_result: bool = VectorDBHelper.test_vector_db_instance(
-            vector_store=vector_db
-        )
-        # Delete the collection that was created for testing
-        if self._client is not None:
-            self._client.drop_collection(self._collection_name)
-        return test_result
+    def test_connection(self) -> bool:
+        vector_db = self.get_vector_db_instance()
+        test_result: bool = False
+        try:
+            test_result = VectorDBHelper.test_vector_db_instance(
+                vector_store=vector_db
+            )
+            return test_result
+        finally:
+            if self._client is not None:
+                try:
+                    self._client.drop_collection(self._collection_name)
+                except Exception:
+                    pass
src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (3)

71-73: Avoid sending None-valued auth header

If UNSTRACT_KEY is absent, requests may send a header with a "None" value. Only include it when present.

-        return {
-            "accept": MimeType.JSON,
-            WhispererHeader.UNSTRACT_KEY: self.config.get(
-                WhispererConfig.UNSTRACT_KEY
-            ),
-        }
+        headers = {"accept": MimeType.JSON}
+        unstract_key = self.config.get(WhispererConfig.UNSTRACT_KEY)
+        if unstract_key:
+            headers[WhispererHeader.UNSTRACT_KEY] = unstract_key
+        return headers

407-409: Metadata pretty-printing: optional

Pretty JSON is nice for debugging but increases file size. Consider indent=None (compact) in production paths or making this configurable.


419-421: Log exceptions with traceback

Use logging.exception inside except blocks to retain stack traces (aligns with Ruff TRY400).

-                logger.error(
-                    f"Error while writing metadata to {metadata_file_path}: {e}"
-                )
+                logger.exception(
+                    f"Error while writing metadata to {metadata_file_path}"
+                )
src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py (3)

52-56: Normalize and type-guard URLs returned for validation.

Trim whitespace and guard non-string config to avoid false positives/negatives during validation.

 def get_configured_urls(self) -> list[str]:
     """Return all URLs this adapter will connect to."""
-    url = self._config.get(Constants.URL)
-    return [url] if url else []
+    url = self._config.get(Constants.URL)
+    if isinstance(url, str):
+        u = url.strip()
+        return [u] if u else []
+    return []

83-83: Stray blank line.

Drop the extra newline to keep method bodies tight; avoids churn in diffs if the repo enforces formatters.

-    def test_connection(self) -> bool:
-
+    def test_connection(self) -> bool:

119-121: Propagate HTTP status code when available.

Carry through e.status_code if present to improve error handling upstream.

-            return VectorDBError(
-                message=str(e), actual_err=e, status_code=status_code
-            )
+            return VectorDBError(
+                message=str(e),
+                actual_err=e,
+                status_code=(
+                    status_code
+                    if status_code is not None
+                    else getattr(e, "status_code", None)
+                ),
+            )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge Base: Disabled due to Reviews > Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 9eb31da and 6c53c7c.

📒 Files selected for processing (5)
  • src/unstract/sdk/adapters/llm/ollama/src/ollama.py (4 hunks)
  • src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py (3 hunks)
  • src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (4 hunks)
  • src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py (4 hunks)
  • src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (10 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/unstract/sdk/adapters/llm/ollama/src/ollama.py
🧰 Additional context used
🪛 Ruff (0.12.2)
src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py

119-121: Avoid specifying long messages outside the exception class

(TRY003)


419-421: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🔇 Additional comments (8)
src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (6)

141-144: Param helper signature: LGTM

Clear typing and defaults; no issues spotted.


209-212: Conditional highlight param: LGTM

Only adds the flag when enabled; avoids silently changing behavior.


261-263: Status parsing default: LGTM

Safe default to UNKNOWN; loop correctly handles non-terminal states.


275-277: Max-poll error message: LGTM

Message is clear and includes attempt count.


360-363: Pass-through to writer: LGTM

Arguments are explicit and typed; good separation of concerns.


403-406: Exclude text from metadata: LGTM

Prevents duplicating large payloads in metadata.

src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (2)

106-108: LGTM — purely cosmetic reflow

No functional change here.


31-34: OK to validate early — _validate_urls() is safe before super().init

Postgres.init sets self._config (used by get_configured_urls()) before calling self._validate_urls(); Adapter._validate_urls() only calls get_configured_urls(), uses getattr(self, "name", self.class.name) as a fallback, and calls the module-level URLValidator, so it does not depend on VectorDBAdapter/Adapter initializer state.

Comment on lines 61 to 63
encoded_password = quote_plus(
str(self._config.get(Constants.PASSWORD))
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Avoid "None" strings for password/port; add timeout; preserve exception context

  • quote_plus(str(self._config.get(Constants.PASSWORD))) turns a missing password into the literal "None".
  • str(self._config.get(Constants.PORT)) yields "None" when unset.
  • Add a connection timeout to prevent hangs.
  • Use exception chaining for debuggability.

Apply:

@@
-            encoded_password = quote_plus(
-                str(self._config.get(Constants.PASSWORD))
-            )
+            raw_password = self._config.get(Constants.PASSWORD)
+            encoded_password = (
+                quote_plus(raw_password) if raw_password is not None else None
+            )
+            port = self._config.get(Constants.PORT)
@@
-            vector_db: BasePydanticVectorStore = PGVectorStore.from_params(
+            vector_db: BasePydanticVectorStore = PGVectorStore.from_params(
                 database=self._config.get(Constants.DATABASE),
                 schema_name=self._schema_name,
                 host=self._config.get(Constants.HOST),
                 password=encoded_password,
-                port=str(self._config.get(Constants.PORT)),
+                port=port,
                 user=self._config.get(Constants.USER),
                 table_name=self._collection_name,
                 embed_dim=dimension,
             )
@@
-            self._client = psycopg2.connect(
+            self._client = psycopg2.connect(
                 database=self._config.get(Constants.DATABASE),
                 host=self._config.get(Constants.HOST),
                 user=self._config.get(Constants.USER),
-                password=self._config.get(Constants.PASSWORD),
-                port=str(self._config.get(Constants.PORT)),
+                password=raw_password,
+                port=port,
                 sslmode=ssl_mode,
+                connect_timeout=10,
             )
@@
-        except Exception as e:
-            raise AdapterError(str(e))
+        except Exception as e:
+            raise AdapterError(str(e)) from e

If PGVectorStore.from_params supports SSL or connect args, consider passing sslmode (and connect_timeout) there too; please confirm in its docs.

Also applies to: 77-85, 90-97


Avoid "None" strings for password/port; add timeout; preserve exception context

  • Stop doing quote_plus(str(...)) — use raw_password = self._config.get(Constants.PASSWORD) and encoded_password = quote_plus(raw_password) if raw_password is not None else None; read port = self._config.get(Constants.PORT).
  • Pass raw_password to psycopg2.connect(..., password=raw_password, port=port, sslmode=ssl_mode, connect_timeout=10).
  • Preserve exception context: raise AdapterError(str(e)) from e.
  • PGVectorStore.from_params accepts create_engine_kwargs — prefer passing SSL/connect options there (e.g. create_engine_kwargs={'connect_args': {'sslmode': ssl_mode, 'connect_timeout': 10}}). (docs.llamaindex.ai)

File: src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (around lines 61-97).

🤖 Prompt for AI Agents
In src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py around lines 61
to 97, stop converting the password to a "None" string via quote_plus(str(...)):
fetch raw_password = self._config.get(Constants.PASSWORD) and set
encoded_password = quote_plus(raw_password) only if raw_password is not None;
likewise read port = self._config.get(Constants.PORT) instead of stringifying
it. Use raw_password (not the quoted string) when calling psycopg2.connect and
pass port, sslmode and a connect_timeout of 10 (e.g. psycopg2.connect(...,
password=raw_password, port=port, sslmode=ssl_mode, connect_timeout=10)).
Preserve exception context when rethrowing by using raise AdapterError(str(e))
from e. When constructing PGVectorStore.from_params, prefer supplying
create_engine_kwargs to pass SSL and connection options (e.g.
create_engine_kwargs={'connect_args': {'sslmode': ssl_mode, 'connect_timeout':
10}}) rather than embedding them into a connection string.

Comment on lines 27 to 30

# Validate URLs BEFORE any network operations
self._validate_urls()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Good: validate endpoints before client init; confirm base-call safety.

Calling _validate_urls() before constructing QdrantClient is the right SSRF posture. Please confirm _validate_urls doesn’t rely on any state set in VectorDBAdapter.__init__ to avoid surprises from invoking it pre-super().__init__.

Run to inspect the base implementation usage:


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Locate and review the base adapter for _validate_urls dependencies
rg -nP --type=py '(class\s+VectorDBAdapter\b|def\s+_validate_urls\b|def\s+get_configured_urls\b)' -C3
# Inspect URL validator coverage (RFC1918, loopback, link-local, IPv6 ULA/LLA, redirects)
rg -nP --type=py '(class\s+URLValidator\b|RFC1918|127\.0\.0\.0/8|169\.254\.0\.0/16|::1|fc00::/7|fe80::/10|ALLOWED_ADAPTER_PRIVATE_ENDPOINTS|redirect|dns|resolve)' -C2

Length of output: 16772


Fix: don't call _validate_urls() before base init

base._validate_urls() calls self.get_configured_urls(); qdrant.get_configured_urls() uses self._config — calling _validate_urls() at src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py (≈lines 27–30) before the adapter/base initializer runs can raise AttributeError. Move the _validate_urls() call after super().init/after self._config is set, or make get_configured_urls() defensive (e.g. getattr(self, "_config", {}) or return [] if unset). Locations: src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py (call + get_configured_urls), src/unstract/sdk/adapters/base.py (_validate_urls).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/unstract/sdk/adapters/base.py (1)

48-49: Good call making get_configured_urls abstract.

This aligns with earlier feedback and enforces adapter implementations to declare their outbound endpoints.

🧹 Nitpick comments (5)
src/unstract/sdk/adapters/base.py (5)

59-59: Avoid a concrete default in an abstract method.

Returning [] from an abstract method can hide mistakes when a subclass accidentally calls super(); prefer an explicit “not implemented”.

-        return []
+        raise NotImplementedError

43-47: Type-safety: abstract get_adapter_type() should not return a string.

The return stub "" conflicts with the declared AdapterTypes. Use an explicit “not implemented” to keep type-checkers happy.

     @staticmethod
     @abstractmethod
     def get_adapter_type() -> AdapterTypes:
-        return ""
+        raise NotImplementedError

37-41: Specify file encoding when reading schema.

Prevents platform-dependent decoding issues.

-        with open(schema_path) as f:
+        with open(schema_path, encoding="utf-8") as f:
             return f.read()

80-84: Tiny grammar nit.

“a adapter” → “an adapter”.

-        """Override to test connection for a adapter.
+        """Override to test connection for an adapter.

61-77: Tighten validation loop: sanitize, de-duplicate, and raise with context (avoid logging before raising)

Strip whitespace, drop empties, preserve order when de-duplicating, precompute adapter_name, and raise AdapterError that includes the adapter name and offending URL; use debug logging only if needed.

Verified _validate_urls() is invoked in init of concrete adapters (examples: src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py:19, src/unstract/sdk/adapters/embedding/open_ai/src/open_ai.py:29, src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py:30).

 def _validate_urls(self) -> None:
-        """Validate all configured URLs against security rules."""
-        urls = self.get_configured_urls()
-
-        for url in urls:
-            if not url:  # Skip empty/None URLs
-                continue
-
-            is_valid, error_message = URLValidator.validate_url(url)
-            if not is_valid:
-                # Use class name as fallback when self.name isn't set yet
-                adapter_name = getattr(self, "name", self.__class__.__name__)
-                logger.error(
-                    f"URL validation failed for adapter '{adapter_name}': {error_message}"
-                )
-                raise AdapterError(f"URL validation failed: {error_message}")
+        """Validate all configured URLs against security rules."""
+        raw_urls = self.get_configured_urls()
+        # Normalize and drop empties
+        urls = [u.strip() for u in raw_urls if u is not None]
+        # De-duplicate while preserving order
+        urls = list(dict.fromkeys(u for u in urls if u))
+        adapter_name = getattr(self, "name", self.__class__.__name__)
+
+        for url in urls:
+            is_valid, error_message = URLValidator.validate_url(url)
+            if not is_valid:
+                # Consider adding a debug log if desired:
+                # logger.debug("Blocked URL for %s: %s (%s)", adapter_name, url, error_message)
+                raise AdapterError(f"{adapter_name}: blocked URL {url}: {error_message}")  # noqa: TRY003

Optional: if you prefer better UX, collect all invalid URLs and raise one aggregated error.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 81c5f9b and 6c3a630.

📒 Files selected for processing (1)
  • src/unstract/sdk/adapters/base.py (3 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
src/unstract/sdk/adapters/base.py

76-76: Avoid specifying long messages outside the exception class

(TRY003)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (2)

71-74: Guard against None → "None" leakage; validate required config before connecting.

str(self._config.get(...)) will pass "None" if the key is missing, leading to confusing failures and potentially bypassing your validator’s intent.

-            self._client = weaviate.connect_to_weaviate_cloud(
-                cluster_url=str(self._config.get(Constants.URL)),
-                auth_credentials=Auth.api_key(str(self._config.get(Constants.API_KEY))),
-            )
+            url = self._config.get(Constants.URL)
+            api_key = self._config.get(Constants.API_KEY)
+            if not isinstance(url, str) or not url.strip():
+                raise AdapterError("Missing or empty 'url' in settings.")
+            if not isinstance(api_key, str) or not api_key.strip():
+                raise AdapterError("Missing or empty 'api_key' in settings.")
+            self._client = weaviate.connect_to_weaviate_cloud(
+                cluster_url=url.strip(),
+                auth_credentials=Auth.api_key(api_key.strip()),
+            )

86-90: Use an explicit status-code check (HTTP 422) instead of parsing the exception message.

Weaviate returns HTTP 422 for "class already exists" — replace the brittle if "already exists" in e.message with a status-code check, e.g. if getattr(e, "status_code", None) == 422: logger.info("Collection already exists; continuing.") else: raise.
File: src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (lines 86-90)

src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py (1)

69-74: Avoid passing "None" as a string for azure_endpoint. Validate/clean before wiring.

-                azure_endpoint=str(self.config.get(Constants.AZURE_ENDPOINT)),
+                azure_endpoint=str(self.config.get(Constants.AZURE_ENDPOINT) or "").strip(),

Optionally, fail fast if missing to provide a clearer error:

+            if not self.config.get(Constants.AZURE_ENDPOINT) or not str(self.config.get(Constants.AZURE_ENDPOINT)).strip():
+                raise AdapterError("Missing or empty 'azure_endpoint' in settings.")
♻️ Duplicate comments (5)
src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py (1)

20-27: Heads-up: localhost default requires allowlist

With schema default http://localhost:19530, validation will fail unless the private endpoint is allowlisted. Ensure docs/tests show setting the env var for local/dev.

Would you like a quick doc snippet added to Milvus README showing ALLOWED_ADAPTER_PRIVATE_ENDPOINTS=127.0.0.1:19530 usage?

src/unstract/sdk/adapters/url_validator.py (2)

60-87: Resolve A/AAAA and derive effective port; block if any resolved IP is private

IPv4-only gethostbyname and missing default ports cause false negatives/positives. Use getaddrinfo, compute effective port.

-            # Resolve hostname to IP address
-            try:
-                host_ip = socket.gethostbyname(parsed.hostname)
-            except socket.gaierror as e:
-                return (
-                    False,
-                    f"DNS resolution failed for '{parsed.hostname}': {str(e)}",
-                )
-
-            # Check if IP is private
-            ip_obj = ipaddress.ip_address(host_ip)
-            if cls._is_private_ip(ip_obj):
-                # Private IP - check whitelist
-                port = parsed.port
-                if cls._is_whitelisted(ip_obj, port):
-                    logger.info(f"Private IP {host_ip}:{port} allowed by whitelist")
-                    return True, ""
-                else:
-                    error_msg = (
-                        f"URL blocked: Private IP {host_ip}"
-                        f"{':' + str(port) if port else ''} not in whitelist. "
-                        f"Contact platform admin for assistance."
-                    )
-                    return False, error_msg
-
-            # Public IP - allowed by default
-            return True, ""
+            # Resolve hostname to all IPs (A/AAAA)
+            try:
+                addrinfos = socket.getaddrinfo(parsed.hostname, None, proto=socket.IPPROTO_TCP)
+            except socket.gaierror as e:
+                return False, f"DNS resolution failed for '{parsed.hostname}': {e!s}"
+
+            resolved_ips = {ai[4][0] for ai in addrinfos}
+            # Effective port (defaults for HTTP/S if not provided)
+            effective_port = parsed.port
+            if effective_port is None:
+                if scheme == "https":
+                    effective_port = 443
+                elif scheme == "http":
+                    effective_port = 80
+
+            for host_ip in resolved_ips:
+                ip_obj = ipaddress.ip_address(host_ip)
+                if cls._is_private_ip(ip_obj):
+                    if cls._is_whitelisted(ip_obj, effective_port):
+                        logger.info("Private IP %s:%s allowed by whitelist", host_ip, effective_port)
+                        continue
+                    return False, (
+                        f"URL blocked: Private IP {host_ip}"
+                        f"{':' + str(effective_port) if effective_port else ''} not in whitelist. "
+                        "Contact platform admin for assistance."
+                    )
+
+            # All resolved IPs are public or allowlisted
+            return True, ""

26-27: Make ENV var backward-compatible and add ClassVar annotations

Code and docs still reference WHITELISTED_ENDPOINTS — add ALLOWED_ADAPTER_PRIVATE_ENDPOINTS while keeping the legacy name and annotate both with ClassVar (import typing.ClassVar).

File: src/unstract/sdk/adapters/url_validator.py (ENV_VAR definition)

-from dataclasses import dataclass
+from dataclasses import dataclass
+from typing import ClassVar
@@
-    ENV_VAR = "WHITELISTED_ENDPOINTS"
+    # New name per PR objective; keep legacy for compatibility
+    ENV_VAR: ClassVar[str] = "ALLOWED_ADAPTER_PRIVATE_ENDPOINTS"
+    LEGACY_ENV_VAR: ClassVar[str] = "WHITELISTED_ENDPOINTS"
src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (2)

61-95: Fix “None” coercions; pass proper types; add timeout; chain exceptions

Current code turns missing password/port into literal strings and lacks a connect timeout.

-            encoded_password = quote_plus(str(self._config.get(Constants.PASSWORD)))
+            raw_password = self._config.get(Constants.PASSWORD)
+            encoded_password = quote_plus(raw_password) if raw_password is not None else None
@@
-            vector_db: BasePydanticVectorStore = PGVectorStore.from_params(
+            vector_db: BasePydanticVectorStore = PGVectorStore.from_params(
                 database=self._config.get(Constants.DATABASE),
                 schema_name=self._schema_name,
                 host=self._config.get(Constants.HOST),
                 password=encoded_password,
-                port=str(self._config.get(Constants.PORT)),
+                port=self._config.get(Constants.PORT),
                 user=self._config.get(Constants.USER),
                 table_name=self._collection_name,
                 embed_dim=dimension,
             )
@@
-            self._client = psycopg2.connect(
+            self._client = psycopg2.connect(
                 database=self._config.get(Constants.DATABASE),
                 host=self._config.get(Constants.HOST),
                 user=self._config.get(Constants.USER),
-                password=self._config.get(Constants.PASSWORD),
-                port=str(self._config.get(Constants.PORT)),
+                password=raw_password,
+                port=self._config.get(Constants.PORT),
                 sslmode=ssl_mode,
+                connect_timeout=10,
             )
@@
-        except Exception as e:
-            raise AdapterError(str(e))
+        except Exception as e:
+            raise AdapterError(str(e)) from e

If supported, consider supplying SSL/connect args to PGVectorStore.from_params via create_engine_kwargs.

Also applies to: 98-99


115-127: Build parseable URL; bracket IPv6

Return a single normalized URL for validation.

-        host = self._config.get(Constants.HOST)
-        port = self._config.get(Constants.PORT)
-
-        if host:
-            # Construct the database URL for validation
-            if port:
-                url = f"postgresql://{host}:{port}"
-            else:
-                url = f"postgresql://{host}"
-            return [url]
-        return []
+        host = self._config.get(Constants.HOST)
+        port = self._config.get(Constants.PORT)
+        if not host:
+            return []
+        host_s = str(host).strip()
+        # Bracket IPv6 literals
+        try:
+            import ipaddress  # stdlib
+            ip = ipaddress.ip_address(host_s)
+            if ip.version == 6 and not (host_s.startswith("[") and host_s.endswith("]")):
+                host_s = f"[{host_s}]"
+        except Exception:
+            pass
+        url = f"postgresql://{host_s}" + (f":{port}" if port else "")
+        return [url]

Add at file top (outside this hunk) if not already present:

import ipaddress  # for IPv6 URL formatting
🧹 Nitpick comments (11)
src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py (1)

40-44: Harden get_configured_urls() against non-str/whitespace values.

Minor resilience: normalize and type-check the config value before returning.

Apply:

-        url = self.config.get("url")
-        return [url] if url else []
+        url = self.config.get("url")
+        if isinstance(url, str):
+            url = url.strip()
+            return [url] if url else []
+        return []
src/unstract/sdk/adapters/embedding/ollama/src/ollama.py (1)

48-52: Use Constants and normalize value

Prefer the defined constant and trim/guard to avoid empty strings.

-        base_url = self.config.get("base_url")
-        return [base_url] if base_url else []
+        base_url = self.config.get(Constants.BASE_URL)
+        if isinstance(base_url, str) and base_url.strip():
+            return [base_url.strip()]
+        return []
src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py (1)

49-53: Normalize URI for validation

Guard against blanks and missing scheme; align with schema default.

-        uri = self._config.get(Constants.URI)
-        return [uri] if uri else []
+        uri = self._config.get(Constants.URI)
+        if not isinstance(uri, str) or not uri.strip():
+            return []
+        uri_s = uri.strip()
+        if "://" not in uri_s:
+            uri_s = f"http://{uri_s}"
+        return [uri_s]
src/unstract/sdk/adapters/url_validator.py (5)

28-42: Annotate mutable class attribute

Ruff RUF012: mark as ClassVar[list[str]].

-    BLOCKED_PRIVATE_RANGES = [
+    BLOCKED_PRIVATE_RANGES: ClassVar[list[str]] = [

45-59: Allow optional scheme enforcement (keep DB adapters working)

Add an optional allowed_schemes parameter; web adapters can pass {"http","https"} while DB adapters pass None.

-    def validate_url(cls, url: str) -> tuple[bool, str]:
+    def validate_url(
+        cls, url: str, allowed_schemes: set[str] | None = None
+    ) -> tuple[bool, str]:
@@
-            parsed = urlparse(url)
+            parsed = urlparse(url)
+            scheme = (parsed.scheme or "").lower()
+            if allowed_schemes is not None and scheme not in allowed_schemes:
+                return False, (
+                    f"Invalid or unsupported URL scheme '{parsed.scheme}'. "
+                    f"Only {', '.join(sorted(allowed_schemes))} are allowed."
+                )

88-91: Use logging.exception; avoid bare catch

Preserve traceback and return a concise message.

-        except Exception as e:
-            logger.error(f"URL validation error for '{url}': {str(e)}")
-            return False, f"{str(e)}"
+        except Exception as e:
+            logger.exception("URL validation error for '%s'", url)
+            return False, str(e)

120-139: Legacy env support in parser

Also read legacy env var here so callers benefit without touching call sites.

-        config = os.getenv(cls.ENV_VAR, "").strip()
+        config = (os.getenv(cls.ENV_VAR) or os.getenv(cls.LEGACY_ENV_VAR, "")).strip()

142-172: Support bracketed IPv6 in allowlist entries

Strip brackets so [2001:db8::1]:443 parses.

-        ip_part = entry
+        ip_part = entry
@@
-        if ":" in entry:
+        if ":" in entry:
             parts = entry.rsplit(":", 1)
             if len(parts) == 2 and parts[1].isdigit():
                 ip_part = parts[0]
                 port = int(parts[1])
@@
-            if "/" in ip_part:
+            # Remove IPv6 brackets, if present
+            if ip_part.startswith("[") and ip_part.endswith("]"):
+                ip_part = ip_part[1:-1]
+            if "/" in ip_part:
                 # CIDR notation
src/unstract/sdk/adapters/llm/open_ai/src/open_ai.py (1)

57-61: Use constant and normalize

Use the defined constant and trim.

-        api_base = self.config.get("api_base")
-        return [api_base] if api_base else []
+        api_base = self.config.get(Constants.API_BASE)
+        if isinstance(api_base, str) and api_base.strip():
+            return [api_base.strip()]
+        return []
src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (1)

53-57: Trim and lightly normalize URL to avoid whitespace/empty-string edge cases.

 def get_configured_urls(self) -> list[str]:
   """Return all URLs this adapter will connect to."""
-  url = self._config.get(Constants.URL)
-  return [url] if url else []
+  url = self._config.get(Constants.URL)
+  if isinstance(url, str):
+    url = url.strip()
+  return [url] if url else []
src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py (1)

55-59: Use the defined constant and trim input to avoid whitespace/typos.

 def get_configured_urls(self) -> list[str]:
   """Return all URLs this adapter will connect to."""
-  endpoint = self.config.get("azure_endpoint")
-  return [endpoint] if endpoint else []
+  endpoint = self.config.get(Constants.AZURE_ENDPOINT)
+  if isinstance(endpoint, str):
+    endpoint = endpoint.strip()
+  return [endpoint] if endpoint else []
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 6c3a630 and afb0ab5.

📒 Files selected for processing (15)
  • src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py (2 hunks)
  • src/unstract/sdk/adapters/embedding/ollama/src/ollama.py (2 hunks)
  • src/unstract/sdk/adapters/embedding/open_ai/src/open_ai.py (2 hunks)
  • src/unstract/sdk/adapters/llm/any_scale/src/anyscale.py (2 hunks)
  • src/unstract/sdk/adapters/llm/azure_open_ai/src/azure_open_ai.py (3 hunks)
  • src/unstract/sdk/adapters/llm/ollama/src/ollama.py (2 hunks)
  • src/unstract/sdk/adapters/llm/open_ai/src/open_ai.py (2 hunks)
  • src/unstract/sdk/adapters/url_validator.py (1 hunks)
  • src/unstract/sdk/adapters/vectordb/milvus/src/milvus.py (2 hunks)
  • src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (2 hunks)
  • src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py (2 hunks)
  • src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (2 hunks)
  • src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (2 hunks)
  • src/unstract/sdk/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py (2 hunks)
  • src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/unstract/sdk/adapters/vectordb/qdrant/src/qdrant.py
  • src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py
  • src/unstract/sdk/adapters/llm/ollama/src/ollama.py
  • src/unstract/sdk/adapters/llm/any_scale/src/anyscale.py
  • src/unstract/sdk/adapters/llm/azure_open_ai/src/azure_open_ai.py
  • src/unstract/sdk/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py
  • src/unstract/sdk/adapters/embedding/open_ai/src/open_ai.py
🧰 Additional context used
🪛 Ruff (0.12.2)
src/unstract/sdk/adapters/url_validator.py

29-42: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


66-66: Use explicit conversion flag

Replace with conversion flag

(RUF010)


86-86: Consider moving this statement to an else block

(TRY300)


88-88: Do not catch blind exception: Exception

(BLE001)


89-89: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


89-89: Use explicit conversion flag

Replace with conversion flag

(RUF010)


90-90: Use explicit conversion flag

Replace with conversion flag

(RUF010)


136-136: Do not catch blind exception: Exception

(BLE001)


137-137: Use explicit conversion flag

Replace with conversion flag

(RUF010)


171-171: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🔇 Additional comments (7)
src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py (2)

18-21: Good: validation happens before any network calls.

This meets the “validate before use” requirement. No changes needed here.

Please confirm _validate_urls() re-validates resolved IPs (post-DNS) and handles redirects to private ranges to mitigate DNS-rebinding/redirect-based SSRF.


14-14: Enable URL validation by default (validate_urls=True) or prove all callers opt in

Repo scan found only the class definition at src/unstract/sdk/adapters/x2text/unstructured_community/src/unstructured_community.py:14 and no instantiations. Change the default to True or provide evidence every instantiation passes validate_urls=True.

Apply this change (preferred):

-    def __init__(self, settings: dict[str, Any], validate_urls: bool = False):
+    def __init__(self, settings: dict[str, Any], validate_urls: bool = True):
src/unstract/sdk/adapters/embedding/ollama/src/ollama.py (1)

18-25: Good: validate before any network work

Constructor gating with validate_urls and calling _validate_urls() early looks correct.

src/unstract/sdk/adapters/llm/open_ai/src/open_ai.py (1)

28-34: Good: early validation toggle

Pattern matches other adapters; no network work before _validate_urls().

src/unstract/sdk/adapters/vectordb/postgres/src/postgres.py (1)

26-34: Good: pre-init validation

Pattern is consistent with other adapters. Ensure the validator does not hard-restrict schemes to HTTP(S), or pass allowed schemes if that change lands.

src/unstract/sdk/adapters/embedding/azure_open_ai/src/azure_open_ai.py (1)

25-32: LGTM: URL preflight before any network ops.

Constructor sets config then validates URLs; no network side effects occur pre‑validation. This aligns with the SSRF mitigation goal.

src/unstract/sdk/adapters/vectordb/weaviate/src/weaviate.py (1)

24-31: Pre‑instantiation URL validation is safe for Weaviate.init.

Adapter._validate_urls() only calls get_configured_urls() and uses getattr(self, "name", self.class.name) as a fallback; Weaviate sets self._config before invoking _validate_urls(), so it does not rely on base init state — no change required.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/unstract/sdk/prompt.py (1)

220-224: Add request timeouts to prevent indefinite hangs.

External calls without timeouts can stall request threads and harm reliability.

+        TIMEOUT_SEC = 30
         if method.upper() == "POST":
             response = requests.post(
-                url=url, json=payload, params=params, headers=req_headers
+                url=url, json=payload, params=params, headers=req_headers, timeout=TIMEOUT_SEC
             )
         elif method.upper() == "GET":
-            response = requests.get(url=url, params=params, headers=req_headers)
+            response = requests.get(url=url, params=params, headers=req_headers, timeout=TIMEOUT_SEC)
🧹 Nitpick comments (9)
src/unstract/sdk/tool/stream.py (2)

118-131: Propagate status_code in the streamed error payload + clarify docstring optionals

Log consumers won’t see the HTTP code unless it’s included in the LOG record. Also tighten docstring types to reflect optionals.

-    def stream_error_and_exit(self, message: str, err: Exception | None = None, status_code: int | None = None) -> None:
+    def stream_error_and_exit(self, message: str, err: Exception | None = None, status_code: int | None = None) -> None:
@@
-            err (Exception): Actual exception that occurred
-            status_code (int): HTTP status code to preserve
+            err (Exception | None): Optional underlying exception
+            status_code (int | None): Optional HTTP/status code to preserve
@@
-        self.stream_log(message, level=LogLevel.ERROR)
+        self.stream_log(message, level=LogLevel.ERROR, status_code=status_code)

126-129: Use the public property consistently

Prefer is_exec_by_tool over touching the private field to avoid bypassing future overrides or invariants.

-        if self._exec_by_tool:
+        if self.is_exec_by_tool:
             exit(1)
src/unstract/sdk/prompt.py (4)

39-41: Use logger.exception for better diagnostics.

Switch to logger.exception to include stack trace automatically.

-                logger.error(f"{msg}\n{e}")
+                logger.exception(msg)

175-177: Don’t send None as a header value.

REQUEST_ID may be None; Requests expects str values. Add only when present.

-        request_headers = {RequestHeader.REQUEST_ID: self.request_id}
+        request_headers: dict[str, str] = {}
+        if self.request_id is not None:
+            request_headers[RequestHeader.REQUEST_ID] = self.request_id

216-216: Normalize URL join to avoid accidental double slashes.

Safer when base_url/url_path contain stray slashes.

-        url: str = f"{self.base_url}/{url_path}"
+        url: str = f"{self.base_url.rstrip('/')}/{url_path.lstrip('/')}"

41-41: Signature OK — callers with a third arg are valid; update docs

stream_error_and_exit is defined as (message, err: Exception | None = None, status_code: int | None = None) at src/unstract/sdk/tool/stream.py:118, so calls like args[0].tool.stream_error_and_exit(msg, e, None) are safe — no runtime fixes needed. Update docs at docs/site/docs/API Reference/tool.stream.md (shows only stream_error_and_exit(message: str) → None) to include the err and status_code parameters.

src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (3)

120-122: Simplify exception message per static analysis guidance

The exception message is embedded directly in the raise statement. Consider extracting it to a constant for better maintainability and to address the static analysis hint.

Apply this diff to extract the exception message:

+        UNSUPPORTED_METHOD_MSG = "Unsupported request method"
+
         # ... in _make_request method ...
-        raise ExtractorError(
-            f"Unsupported request method: {request_method}"
-        )
+        raise ExtractorError(f"{UNSUPPORTED_METHOD_MSG}: {request_method}")

276-278: Fix string formatting in error message

The error message has a formatting issue where the string is split across lines without proper concatenation.

Apply this diff to fix the string formatting:

-        raise ExtractorError(
-            "Unable to extract text after attempting"
-            f" {request_count} times"
-        )
+        raise ExtractorError(
+            f"Unable to extract text after attempting {request_count} times"
+        )

420-422: Use logging.exception for better error tracking

When logging within an exception handler, logging.exception is preferred as it automatically includes the traceback information, which is helpful for debugging.

Apply this diff to improve error logging:

-        logger.error(
-            f"Error while writing metadata to {metadata_file_path}: {e}"
-        )
+        logger.exception(
+            f"Error while writing metadata to {metadata_file_path}: {e}"
+        )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 440d847 and 77398b2.

📒 Files selected for processing (3)
  • src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (11 hunks)
  • src/unstract/sdk/prompt.py (6 hunks)
  • src/unstract/sdk/tool/stream.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py

120-122: Avoid specifying long messages outside the exception class

(TRY003)


420-422: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🔇 Additional comments (6)
src/unstract/sdk/tool/stream.py (1)

130-130: SdkError accepts status_code and actual_err kwargs Confirmed — SdkError.init declares status_code: int | None = None and actual_err: Exception | None = None in src/unstract/sdk/exceptions.py.

src/unstract/sdk/prompt.py (4)

10-10: Import changes look fine.

Constants are used below; no concerns.


22-24: Decorator signature formatting is OK.

Type vars preserved; no behavior change.


84-86: Base URL formatting change only.

No behavioral impact; OK.


232-234: Decorator formatting LGTM.

No functional change; consistent with style.

src/unstract/sdk/adapters/x2text/llm_whisperer/src/llm_whisperer.py (1)

59-62: URL normalization implementation is correct but should add _validate_urls() call in init

The get_configured_urls() method correctly implements the previously suggested URL normalization with proper string conversion, whitespace stripping, and trailing slash removal. This ensures consistent whitelist matching.

However, the URL validation call in __init__() is still missing. The base class provides _validate_urls() which should be invoked to enforce SSRF protections.

Apply this diff to add URL validation in __init__():

 def __init__(self, settings: dict[str, Any]):
     super().__init__("LLMWhisperer")
     self.config = settings
+    self._validate_urls()

This ensures that URL validation runs before any network operations, providing the security layer this PR is designed to implement. As the Python documentation notes, "We recommend that users of these APIs where the values may be used anywhere with security implications code defensively. Do some verification within your code before trusting a returned component part."

Comment on lines 43 to 58
error_message = str(e)
status_code = None
response = getattr(e, "response", None)
if response is not None:
status_code = response.status_code
if (
MimeType.JSON in response.headers.get("Content-Type", "").lower()
MimeType.JSON
in response.headers.get("Content-Type", "").lower()
and "error" in response.json()
):
error_message = response.json()["error"]
elif response.text:
error_message = response.text
msg = f"Error while {context}. {error_message}"
args[0].tool.stream_error_and_exit(msg, e)
args[0].tool.stream_error_and_exit(msg, e, status_code)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Harden RequestException parsing (avoid double JSON parse and JSONDecodeError).

Current code calls response.json() multiple times and assumes a dict with “error”. Guard for non‑JSON/invalid JSON and parse once.

-                error_message = str(e)
-                status_code = None
-                response = getattr(e, "response", None)
-                if response is not None:
-                    status_code = response.status_code
-                    if (
-                        MimeType.JSON
-                        in response.headers.get("Content-Type", "").lower()
-                        and "error" in response.json()
-                    ):
-                        error_message = response.json()["error"]
-                    elif response.text:
-                        error_message = response.text
+                error_message = str(e)
+                status_code = None
+                response = getattr(e, "response", None)
+                if response is not None:
+                    status_code = response.status_code
+                    content_type = response.headers.get("Content-Type", "")
+                    if "json" in content_type.lower():
+                        try:
+                            data = response.json()
+                        except ValueError:
+                            data = None
+                        if isinstance(data, dict) and "error" in data:
+                            error_message = str(data["error"])
+                        elif response.text:
+                            error_message = response.text
+                    elif response.text:
+                        error_message = response.text
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
error_message = str(e)
status_code = None
response = getattr(e, "response", None)
if response is not None:
status_code = response.status_code
if (
MimeType.JSON in response.headers.get("Content-Type", "").lower()
MimeType.JSON
in response.headers.get("Content-Type", "").lower()
and "error" in response.json()
):
error_message = response.json()["error"]
elif response.text:
error_message = response.text
msg = f"Error while {context}. {error_message}"
args[0].tool.stream_error_and_exit(msg, e)
args[0].tool.stream_error_and_exit(msg, e, status_code)
error_message = str(e)
status_code = None
response = getattr(e, "response", None)
if response is not None:
status_code = response.status_code
content_type = response.headers.get("Content-Type", "")
if "json" in content_type.lower():
try:
data = response.json()
except ValueError:
data = None
if isinstance(data, dict) and "error" in data:
error_message = str(data["error"])
elif response.text:
error_message = response.text
elif response.text:
error_message = response.text
msg = f"Error while {context}. {error_message}"
args[0].tool.stream_error_and_exit(msg, e, status_code)

print(json.dumps(record))

def stream_error_and_exit(self, message: str, err: Exception | None = None) -> None:
def stream_error_and_exit(self, message: str, err: Exception | None = None, status_code: int | None = None) -> None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Confirm Python version compatibility of PEP 604 unions

Exception | None needs Python 3.10+. If 3.9 is still supported, switch to Optional[Exception] (and from typing import Optional).


🏁 Script executed:

#!/bin/bash
# Detect declared Python version(s)
fd -a 'pyproject.toml|setup.cfg|setup.py|tox.ini' | sed 's#^#--- #'
rg -n 'python_requires|Programming Language :: Python ::' -C2 -- pyproject.toml setup.cfg setup.py tox.ini 2>/dev/null || true

Length of output: 3136


Use Optional[...] for Python 3.9 compatibility or require Python ≥3.10

pyproject.toml declares "Programming Language :: Python :: 3.9", so change the signature in src/unstract/sdk/tool/stream.py (line ~118) from Exception | None and int | None to Optional[Exception] and Optional[int] and add from typing import Optional — or update project metadata to require Python >=3.10.

🤖 Prompt for AI Agents
In src/unstract/sdk/tool/stream.py around line 118, the method signature uses
Python 3.10 union types (Exception | None, int | None) but pyproject declares
Python 3.9; change the signature to use Optional[Exception] and Optional[int],
add "from typing import Optional" at the top of the file (or alternatively
update project metadata to require Python ≥3.10), and ensure any other
occurrences of "X | None" in this file are similarly converted to Optional[X].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants