Skip to content

Conversation

CarlosFerLo
Copy link
Contributor

Related Issues

Proposed Changes:

Add a simple new component EmptyDocumentRemover as proposed on #8061 to remove empty documents from a list of documents, then return the new list.

How did you test it?

Added a simple unit test, as its implementation is rather simple.

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes ✅
  • I added unit tests and updated the docstrings ✅
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:. ✅
  • I documented my code ✅
  • I ran pre-commit hooks and fixed any issue ✅

@CarlosFerLo CarlosFerLo requested a review from a team as a code owner July 28, 2024 12:31
@CarlosFerLo CarlosFerLo requested review from julian-risch and removed request for a team July 28, 2024 12:31
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jul 28, 2024
@CarlosFerLo CarlosFerLo requested a review from a team as a code owner July 28, 2024 12:33
@CarlosFerLo CarlosFerLo requested review from dfokina and removed request for a team July 28, 2024 12:33
@coveralls
Copy link
Collaborator

coveralls commented Jul 28, 2024

Pull Request Test Coverage Report for Build 10151473812

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.007%) to 90.066%

Totals Coverage Status
Change from base Build 10144976547: 0.007%
Covered Lines: 6827
Relevant Lines: 7580

💛 - Coveralls

@julian-risch
Copy link
Member

Hi @CarlosFerLo I thought about your suggestion of a higher abstraction by making a DocumentFilter component that accepts a predicate during initialization. Isn't that very similar to a ConditionalRouter? Or an OutputAdapter? What do you think about the following example with an OutputAdapter using a custom filter?

from typing import List
from haystack import Pipeline, component, Document
from haystack.components.converters import OutputAdapter

def filter_empty_docs(docs):
    return [doc for doc in docs if doc.content is not None]

@component
class DocumentProducer:
    @component.output_types(documents=dict)
    def run(self):
        return {"documents": [Document(content="haystack1"), Document(content=None), Document(content=""), Document(content="haystack2")]}

pipe = Pipeline()
pipe.add_component(
    name="output_adapter",
    instance=OutputAdapter(template="{{ documents | filter_empty_docs}}",
                           output_type=List[Document],
                           custom_filters={"filter_empty_docs": filter_empty_docs}))

pipe.add_component(name="document_producer", instance=DocumentProducer())
pipe.connect("document_producer", "output_adapter")
result = pipe.run(data={})

print(result["output_adapter"]["output"])

@wochinge
Copy link
Contributor

wochinge commented Aug 2, 2024

I like re-using existing components 👍🏻 I think deserialization is a problem with your version @julian-risch . However, looking at this stackoverflow post everything would actually be possible via a jinja template, wouldn't it?

@julian-risch
Copy link
Member

julian-risch commented Aug 2, 2024

That's correct. We don't need to define filter_empty_docs. The following works and then there is no problem with deserialization:

pipe.add_component(
    name="output_adapter",
    instance=OutputAdapter(template="{{ documents | rejectattr('content', 'none') | list}}",
                           output_type=List[Document]))

@julian-risch
Copy link
Member

@CarlosFerLo Thanks for opening this PR and thereby brining the discussion forward! We found that the requested behavior can be easily achieved with an OutputAdapter so I will close this PR and the issue referring to the code snippet above. I'll check with my colleagues how we can better document that the OutputAdapter can do this.

@anakin87
Copy link
Member

anakin87 commented Aug 2, 2024

This fails at runtime. Related to #8095

from typing import List
from haystack import Pipeline, component, Document
from haystack.components.converters import OutputAdapter
from haystack.components.joiners.document_joiner import DocumentJoiner


def filter_empty_docs(docs):
    return [doc for doc in docs if doc.content is not None]

@component
class DocumentProducer:
    @component.output_types(documents=dict)
    def run(self):
        return {"documents": [Document(content="haystack1"), Document(content=None), Document(content=""), Document(content="haystack2")]}

pipe = Pipeline()
pipe.add_component(name="document_producer", instance=DocumentProducer())
pipe.add_component(
    name="output_adapter",
    instance=OutputAdapter(template="{{ documents | filter_empty_docs}}",
                           output_type=List[Document],
                           custom_filters={"filter_empty_docs": filter_empty_docs}))
pipe.add_component(instance=DocumentJoiner(), name="joiner")

pipe.connect("document_producer", "output_adapter")
pipe.connect("output_adapter", "joiner")
result = pipe.run(data={})

print(result)

/usr/local/lib/python3.10/dist-packages/haystack/components/joiners/document_joiner.py in _concatenate(self, document_lists)
126 docs_per_id = defaultdict(list)
127 for doc in itertools.chain.from_iterable(document_lists):
--> 128 docs_per_id[doc.id].append(doc)
129 for docs in docs_per_id.values():
130 doc_with_best_score = max(docs, key=lambda doc: doc.score if doc.score else -inf)

AttributeError: 'str' object has no attribute 'id'

@julian-risch
Copy link
Member

Thanks @anakin87 . I missed that recent change in my local tests and indeed the solution I proposed does not work from 2.3.1 onward. I suggest using a custom component instead: #8061 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add EmptyDocumentRemover component
5 participants