Home  >  Q&A  >  body text

Batch job submission error "Unable to process all documents" uris seems correct?

I've been trying to get Document AI batch submissions to work but I'm having some difficulty. I used a RawDocument for single file submission, assuming I could iterate over my dataset (27k images), but opted for batch processing as it seemed to be the more appropriate technique.

When I run the code I see the error: "Unable to process all documents". The first few lines of debug information are:

O:17:"Google\Rpc\Status":5:{ s:7:"*Code";i:3;s:10:"*Message";s:32:"Unable to process all documents."; s:26:"Google\Rpc\Statusdetails"; O:38:"Google\Protobuf\Internal\RepeatedField":4:{ s:49:"Google\Protobuf\Internal\RepeatedFieldcontainer";a:0:{}s:44:"Google\Protobuf\Internal\RepeatedFieldtype";i:11;s:45:"Google\Protobuf\Internal\RepeatedFieldklass ";s:19:"Google\Protobuf\Any";s:52:"Google\Protobuf\Internal\RepeatedFieldlegacy_klass";s:19:"Google\Protobuf\Any";}s:38:"Google\Protobuf\ Internal\Messagedesc";O:35:"Google\Protobuf\Internal\Descriptor":13:{s:46:"Google\Protobuf\Internal\Descriptorfull_name";s:17:"google.rpc.Status";s: 42:"Google\Protobuf\Internal\Descriptorfield";a:3:{i:1;O:40:"Google\Protobuf\Internal\FieldDescriptor":14:{s:46:"Google\Protobuf\Internal\FieldDescriptorname ";s:4:"code";```

Support for this error states that the cause of the error is:

gcsUriPrefix and gcsOutputConfig.gcsUri parameters need to start with gs:// and end with a backslash character (/). Check the configuration of the bucket URI.

I'm not using gcsUriPrefix (should I? My Bucket > Max Batch Limits), but my gcsOutputConfig.gcsUri is within those limits. The file list I provide gives the file names (pointing to the right bucket), so there should be no trailing backslash.

Welcome to consult

function filesFromBucket( $directoryPrefix ) {
        // NOT recursive, does not search the structure
        $gcsDocumentList = [];
    
        // see https://cloud.google.com/storage/docs/samples/storage-list-files-with-prefix
        $bucketName = 'my-input-bucket';
        $storage = new StorageClient();
        $bucket = $storage->bucket($bucketName);
        $options = ['prefix' => $directoryPrefix];
        foreach ($bucket->objects($options) as $object) {
            $doc = new GcsDocument();
            $doc->setGcsUri('gs://'.$object->name());
            $doc->setMimeType($object->info()['contentType']);
            array_push( $gcsDocumentList, $doc );
        }
    
        $gcsDocuments = new GcsDocuments();
        $gcsDocuments->setDocuments($gcsDocumentList);
        return $gcsDocuments;
    }
    
    function batchJob ( ) {
        $inputConfig = new BatchDocumentsInputConfig( ['gcs_documents'=>filesFromBucket('the-bucket-path/')] );
    
        // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentOutputConfig
        // nb: all uri paths must end with / or an error will be generated.
        $outputConfig = new DocumentOutputConfig( 
            [ 'gcs_output_config' =>
                   new GcsOutputConfig( ['gcs_uri'=>'gs://my-output-bucket/'] ) ]
        );
     
        // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentProcessorServiceClient
        $documentProcessorServiceClient = new DocumentProcessorServiceClient();
        try {
            // derived from the prediction endpoint
            $name = 'projects/######/locations/us/processors/#######';
            $operationResponse = $documentProcessorServiceClient->batchProcessDocuments($name, ['inputDocuments'=>$inputConfig, 'documentOutputConfig'=>$outputConfig]);
            $operationResponse->pollUntilComplete();
            if ($operationResponse->operationSucceeded()) {
                $result = $operationResponse->getResult();
                printf('<br>result: %s<br>',serialize($result));
            // doSomethingWith($result)
            } else {
                $error = $operationResponse->getError();
                printf('<br>error: %s<br>', serialize($error));
                // handleError($error)
            }
        } finally {
            $documentProcessorServiceClient->close();
        }    
    }

P粉696891871P粉696891871175 days ago279

reply all(2)I'll reply

  • P粉103739566

    P粉1037395662024-04-01 09:46:00

    Typically, the error "Unable to process all documents" is caused by incorrect syntax of the input file or output bucket. Since a malformed path may still be a "valid" path to the cloud storage, but not the file you expected. (Thank you for checking the error message page first!)

    If you want to provide a specific list of documents to process, you do not have to use gcsUriPrefix. Although based on your code it appears that you are adding all files in the GCS directory to the BatchDocumentsInputConfig.gcs_documents field, so it would make sense to try to send the prefix in the >BatchDocumentsInputConfig.gcs_uri_prefix instead Not a list of individual files.

    Note: The maximum number of files that can be sent in a single batch request (1000), and specific processors have their own page limits.

    https://cloud.google.com/document-ai/quotas#content_limits

    You can try splitting the file into multiple batch requests to avoid hitting this limit. The Document AI Toolbox Python SDK has a built-in function for this purpose, but you can try reimplementing this function in PHP based on your use case. https:// github.com/googleapis/python-documentai-toolbox/blob/ba354d8af85cbea0ad0cd2501e041f21e9e5d765/google/cloud/documentai_toolbox/utilities/gcs_utilities.py#L213

    reply
    0
  • P粉195402292

    P粉1954022922024-04-01 00:22:20

    This turns out to be an ID-10-T bug with clear PEBKAC overtones.

    $object->name() does not return the bucket name as part of the path.

    Change $doc->setGcsUri('gs://'.$object->name()); to $doc->setGcsUri('gs://'. $bucketName.'/'.$object->name()); solves the problem.

    reply
    0
  • Cancelreply