使用 Streamlit 和 AWS Translator 的文档翻译服务-Python教程-PHP中文网

首页

后端开发

Python教程

使用 Streamlit 和 AWS Translator 的文档翻译服务

Barbara Streisand

Jan 01, 2025 am 02:35 AM

介绍：

DocuTranslator，一个文档翻译系统，内置于 AWS 中，由 Streamlit 应用程序框架开发。该应用程序允许最终用户将文档翻译成他们想要上传的首选语言。它提供了根据用户需要翻译成多种语言的可行性，这确实帮助用户以舒适的方式理解内容。

背景：

这个项目的目的是提供一个用户友好、简单的应用程序界面，以完成用户期望的简单翻译过程。在此系统中，没有人需要通过进入 AWS Translate 服务来翻译文档，最终用户可以直接访问应用程序端点并满足要求。

高层架构图：

Document Translation Service using Streamlit & AWS Translator

这是如何运作的：

最终用户可以通过应用程序负载均衡器访问应用程序。
应用程序界面打开后，用户将上传所需的待翻译文件和要翻译的语言。
提交这些详细信息后，文件将上传到提到的源 S3 存储桶，这会触发 lambda 函数来连接 AWS Translator 服务。
翻译文档准备好后，将上传到目标 S3 存储桶。
之后，最终用户可以从 Streamlit 应用程序门户下载翻译后的文档。

技术架构：

Document Translation Service using Streamlit & AWS Translator

以上架构显示了以下要点 -

应用程序代码已容器化并存储到 ECR 存储库。
根据上述设计，已经设置了一个 ECS 集群，它实例化了两个从 ECR 存储库提取应用程序映像的任务。
这两个任务都是在 EC2 之上作为启动类型启动的。两个 EC2 均在 us-east-1a 和 us-east-1b 可用区的私有子网中启动。
创建 EFS 文件系统是为了在两个底层 EC2 实例之间共享应用程序代码。在两个可用区（us-east-1a 和 us-east-1b）中创建两个挂载点。
在私有子网前面配置两个公有子网，并在 us-east-1a 可用区的公有子网中设置 NAT 网关。
已在私有子网前面配置应用程序负载均衡器，该子网将流量分配到应用程序负载均衡器安全组 (ALB SG) 端口 80 处的两个公有子网。
两个 EC2 实例配置在两个不同的目标组中，具有相同的 EC2 安全组 (Streamlit_SG)，该安全组接受来自应用程序负载均衡器的 16347 端口流量。
EC2 实例中的端口 16347 和 ECS 容器中的端口 8501 之间配置了端口映射。一旦流量到达 EC2 安全组的 16347 端口，将被重定向到 ECS 容器级别的 8501 端口。

数据如何存储？

在这里，我们使用 EFS 共享路径在两个底层 EC2 实例之间共享相同的应用程序文件。我们在 EC2 实例内创建了一个挂载点 /streamlit_appfiles 并使用 EFS 共享挂载。这种方法将有助于在两个不同的服务器之间共享相同的内容。之后，我们的目的是创建一个复制相同的应用程序内容到容器工作目录 /streamlit。为此，我们使用了绑定挂载，以便对 EC2 级别的应用程序代码进行的任何更改也将被复制到容器。我们需要限制双向复制，这意味着如果任何人错误地从容器内部更改代码，它不应该复制到 EC2 主机级别，因此容器内部工作目录已创建为只读文件系统。

Document Translation Service using Streamlit & AWS Translator

ECS容器配置和容量：

底层 EC2 配置：
实例类型：t2.medium
网络类型：私有子网

容器配置：
图片：
网络模式：默认
主机端口：16347
集装箱港口：8501
任务CPU：2个vCPU（2048个）
任务内存：2.5 GB (2560 MiB)

Document Translation Service using Streamlit & AWS Translator

音量配置：
卷名称：streamlit-volume
源路径：/streamlit_appfiles
容器路径：/streamlit
只读文件系统：是

Document Translation Service using Streamlit & AWS Translator

任务定义参考：

{
    "taskDefinitionArn": "arn:aws:ecs:us-east-1:<account-id>:task-definition/Streamlit_TDF-1:5",
    "containerDefinitions": [
        {
            "name": "streamlit",
            "image": "<account-id>.dkr.ecr.us-east-1.amazonaws.com/anirban:latest",
            "cpu": 0,
            "portMappings": [
                {
                    "name": "streamlit-8501-tcp",
                    "containerPort": 8501,
                    "hostPort": 16347,
                    "protocol": "tcp",
                    "appProtocol": "http"
                }
            ],
            "essential": true,
            "environment": [],
            "environmentFiles": [],
            "mountPoints": [
                {
                    "sourceVolume": "streamlit-volume",
                    "containerPath": "/streamlit",
                    "readOnly": true
                }
            ],
            "volumesFrom": [],
            "ulimits": [],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/ecs/Streamlit_TDF-1",
                    "mode": "non-blocking",
                    "awslogs-create-group": "true",
                    "max-buffer-size": "25m",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ecs"
                },
                "secretOptions": []
            },
            "systemControls": []
        }
    ],
    "family": "Streamlit_TDF-1",
    "taskRoleArn": "arn:aws:iam::<account-id>:role/ecsTaskExecutionRole",
    "executionRoleArn": "arn:aws:iam::<account-id>:role/ecsTaskExecutionRole",
    "revision": 5,
    "volumes": [
        {
            "name": "streamlit-volume",
            "host": {
                "sourcePath": "/streamlit_appfiles"
            }
        }
    ],
    "status": "ACTIVE",
    "requiresAttributes": [
        {
            "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
        },
        {
            "name": "ecs.capability.execution-role-awslogs"
        },
        {
            "name": "com.amazonaws.ecs.capability.ecr-auth"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.28"
        },
        {
            "name": "com.amazonaws.ecs.capability.task-iam-role"
        },
        {
            "name": "ecs.capability.execution-role-ecr-pull"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.29"
        }
    ],
    "placementConstraints": [],
    "compatibilities": [
        "EC2"
    ],
    "requiresCompatibilities": [
        "EC2"
    ],
    "cpu": "2048",
    "memory": "2560",
    "runtimePlatform": {
        "cpuArchitecture": "X86_64",
        "operatingSystemFamily": "LINUX"
    },
    "registeredAt": "2024-11-09T05:59:47.534Z",
    "registeredBy": "arn:aws:iam::<account-id>:root",
    "tags": []
}
</account-id></account-id></account-id></account-id></account-id>

Document Translation Service using Streamlit & AWS Translator

开发应用程序代码并创建 Docker 映像：

app.py

import streamlit as st
import boto3
import os
import time
from pathlib import Path

s3 = boto3.client('s3', region_name='us-east-1')
tran = boto3.client('translate', region_name='us-east-1')
lam = boto3.client('lambda', region_name='us-east-1')


# Function to list S3 buckets
def listbuckets():
    list_bucket = s3.list_buckets()
    bucket_name = tuple([it["Name"] for it in list_bucket["Buckets"]])
    return bucket_name

# Upload object to S3 bucket
def upload_to_s3bucket(file_path, selected_bucket, file_name):
    s3.upload_file(file_path, selected_bucket, file_name)

def list_language():
    response = tran.list_languages()
    list_of_langs = [i["LanguageName"] for i in response["Languages"]]
    return list_of_langs

def wait_for_s3obj(dest_selected_bucket, file_name):
    while True:
        try:
            get_obj = s3.get_object(Bucket=dest_selected_bucket, Key=f'Translated-{file_name}.txt')
            obj_exist = 'true' if get_obj['Body'] else 'false'
            return obj_exist
        except s3.exceptions.ClientError as e:
            if e.response['Error']['Code'] == "404":
                print(f"File '{file_name}' not found. Checking again in 3 seconds...")
                time.sleep(3)

def download(dest_selected_bucket, file_name, file_path):
     s3.download_file(dest_selected_bucket,f'Translated-{file_name}.txt', f'{file_path}/download/Translated-{file_name}.txt')
     with open(f"{file_path}/download/Translated-{file_name}.txt", "r") as file:
       st.download_button(
             label="Download",
             data=file,
             file_name=f"{file_name}.txt"
       )

def streamlit_application():
    # Give a header
    st.header("Document Translator", divider=True)
    # Widgets to upload a file
    uploaded_files = st.file_uploader("Choose a PDF file", accept_multiple_files=True, type="pdf")
    # # upload a file
    file_name = uploaded_files[0].name.replace(' ', '_') if uploaded_files else None
    # Folder path
    file_path = '/tmp'
    # Select the bucket from drop down
    selected_bucket = st.selectbox("Choose the S3 Bucket to upload file :", listbuckets())
    dest_selected_bucket = st.selectbox("Choose the S3 Bucket to download file :", listbuckets())
    selected_language = st.selectbox("Choose the Language :", list_language())
    # Create a button
    click = st.button("Upload", type="primary")
    if click == True:
        if file_name:
            with open(f'{file_path}/{file_name}', mode='wb') as w:
                w.write(uploaded_files[0].getvalue())
        # Set the selected language to the environment variable of lambda function
        lambda_env1 = lam.update_function_configuration(FunctionName='TriggerFunctionFromS3', Environment={'Variables': {'UserInputLanguage': selected_language, 'DestinationBucket': dest_selected_bucket, 'TranslatedFileName': file_name}})
        # Upload the file to S3 bucket:
        upload_to_s3bucket(f'{file_path}/{file_name}', selected_bucket, file_name)
        if s3.get_object(Bucket=selected_bucket, Key=file_name):
            st.success("File uploaded successfully", icon="✅")
            output = wait_for_s3obj(dest_selected_bucket, file_name)
            if output:
              download(dest_selected_bucket, file_name, file_path)
        else:
            st.error("File upload failed", icon="?")


streamlit_application()

about.py

import streamlit as st

## Write the description of application
st.header("About")
about = '''
Welcome to the File Uploader Application!

This application is designed to make uploading PDF documents simple and efficient. With just a few clicks, users can upload their documents securely to an Amazon S3 bucket for storage. Here’s a quick overview
of what this app does:

**Key Features:**
- **Easy Upload:** Users can quickly upload PDF documents by selecting the file and clicking the 'Upload' button.
- **Seamless Integration with AWS S3:** Once the document is uploaded, it is stored securely in a designated S3 bucket, ensuring reliable and scalable cloud storage.
- **User-Friendly Interface:** Built using Streamlit, the interface is clean, intuitive, and accessible to all users, making the uploading process straightforward.

**How it Works:**
1. **Select a PDF Document:** Users can browse and select any PDF document from their local system.
2. **Upload the Document:** Clicking the ‘Upload’ button triggers the process of securely uploading the selected document to an AWS S3 bucket.
3. **Success Notification:** After a successful upload, users will receive a confirmation message that their document has been stored in the cloud.
This application offers a streamlined way to store documents on the cloud, reducing the hassle of manual file management. Whether you're an individual or a business, this tool helps you organize and store your
files with ease and security.
You can further customize this page by adding technical details, usage guidelines, or security measures as per your application's specifications.'''

st.markdown(about)

navigation.py

import streamlit as st

pg = st.navigation([
    st.Page("app.py", title="DocuTranslator", icon="?"),
    st.Page("about.py", title="About", icon="?")
], position="sidebar")

pg.run()

Dockerfile：

FROM python:3.9-slim
WORKDIR /streamlit
COPY requirements.txt /streamlit/requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
RUN mkdir /tmp/download
COPY . /streamlit
EXPOSE 8501
CMD ["streamlit", "run", "navigation.py", "--server.port=8501", "--server.headless=true"]

Docker 文件将通过打包所有上述应用程序配置文件来创建镜像，然后将其推送到 ECR 存储库。 Docker Hub 也可以用来存储镜像。

负载均衡

在该架构中，应用程序实例应该在私有子网中创建，并且负载均衡器应该创建以减少私有 EC2 实例的传入流量负载。
由于有两个底层 EC2 主机可用于托管容器，因此在两个 EC2 主机之间配置负载均衡以分配传入流量。创建两个不同的目标组，在每个目标组中放置两个 EC2 实例，权重为 50%。

负载均衡器接受端口 80 处的传入流量，然后传递到端口 16347 处的后端 EC2 实例，并传递给相应的 ECS 容器。

Document Translation Service using Streamlit & AWS Translator

拉姆达函数：

有一个 lambda 函数，配置为将源存储桶作为输入，从那里下载 pdf 文件并提取内容，然后将内容从当前语言翻译为用户提供的目标语言，并创建一个文本文件以上传到目标 S3桶。

{
    "taskDefinitionArn": "arn:aws:ecs:us-east-1:<account-id>:task-definition/Streamlit_TDF-1:5",
    "containerDefinitions": [
        {
            "name": "streamlit",
            "image": "<account-id>.dkr.ecr.us-east-1.amazonaws.com/anirban:latest",
            "cpu": 0,
            "portMappings": [
                {
                    "name": "streamlit-8501-tcp",
                    "containerPort": 8501,
                    "hostPort": 16347,
                    "protocol": "tcp",
                    "appProtocol": "http"
                }
            ],
            "essential": true,
            "environment": [],
            "environmentFiles": [],
            "mountPoints": [
                {
                    "sourceVolume": "streamlit-volume",
                    "containerPath": "/streamlit",
                    "readOnly": true
                }
            ],
            "volumesFrom": [],
            "ulimits": [],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/ecs/Streamlit_TDF-1",
                    "mode": "non-blocking",
                    "awslogs-create-group": "true",
                    "max-buffer-size": "25m",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ecs"
                },
                "secretOptions": []
            },
            "systemControls": []
        }
    ],
    "family": "Streamlit_TDF-1",
    "taskRoleArn": "arn:aws:iam::<account-id>:role/ecsTaskExecutionRole",
    "executionRoleArn": "arn:aws:iam::<account-id>:role/ecsTaskExecutionRole",
    "revision": 5,
    "volumes": [
        {
            "name": "streamlit-volume",
            "host": {
                "sourcePath": "/streamlit_appfiles"
            }
        }
    ],
    "status": "ACTIVE",
    "requiresAttributes": [
        {
            "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
        },
        {
            "name": "ecs.capability.execution-role-awslogs"
        },
        {
            "name": "com.amazonaws.ecs.capability.ecr-auth"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.28"
        },
        {
            "name": "com.amazonaws.ecs.capability.task-iam-role"
        },
        {
            "name": "ecs.capability.execution-role-ecr-pull"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.29"
        }
    ],
    "placementConstraints": [],
    "compatibilities": [
        "EC2"
    ],
    "requiresCompatibilities": [
        "EC2"
    ],
    "cpu": "2048",
    "memory": "2560",
    "runtimePlatform": {
        "cpuArchitecture": "X86_64",
        "operatingSystemFamily": "LINUX"
    },
    "registeredAt": "2024-11-09T05:59:47.534Z",
    "registeredBy": "arn:aws:iam::<account-id>:root",
    "tags": []
}
</account-id></account-id></account-id></account-id></account-id>

应用测试：

打开应用程序负载均衡器 URL“ALB-747339710.us-east-1.elb.amazonaws.com”以打开 Web 应用程序。浏览任何 pdf 文件，保持源 "fileuploadbucket-hwirio984092jjs" 和目标存储桶 "translatedfileuploadbucket-kh939809kjkfjsekfl" 不变，因为在 lambda 代码中，它已被硬编码为目标桶就是上面提到的。选择您想要翻译文档的语言，然后单击上传。单击后，应用程序将开始轮询目标 S3 存储桶以查明翻译文件是否已上传。如果找到确切的文件，则会显示一个新选项“下载”，用于从目标 S3 存储桶下载文件。

申请链接：http://alb-747339710.us-east-1.elb.amazonaws.com/

Document Translation Service using Streamlit & AWS Translator

实际内容：

import streamlit as st
import boto3
import os
import time
from pathlib import Path

s3 = boto3.client('s3', region_name='us-east-1')
tran = boto3.client('translate', region_name='us-east-1')
lam = boto3.client('lambda', region_name='us-east-1')


# Function to list S3 buckets
def listbuckets():
    list_bucket = s3.list_buckets()
    bucket_name = tuple([it["Name"] for it in list_bucket["Buckets"]])
    return bucket_name

# Upload object to S3 bucket
def upload_to_s3bucket(file_path, selected_bucket, file_name):
    s3.upload_file(file_path, selected_bucket, file_name)

def list_language():
    response = tran.list_languages()
    list_of_langs = [i["LanguageName"] for i in response["Languages"]]
    return list_of_langs

def wait_for_s3obj(dest_selected_bucket, file_name):
    while True:
        try:
            get_obj = s3.get_object(Bucket=dest_selected_bucket, Key=f'Translated-{file_name}.txt')
            obj_exist = 'true' if get_obj['Body'] else 'false'
            return obj_exist
        except s3.exceptions.ClientError as e:
            if e.response['Error']['Code'] == "404":
                print(f"File '{file_name}' not found. Checking again in 3 seconds...")
                time.sleep(3)

def download(dest_selected_bucket, file_name, file_path):
     s3.download_file(dest_selected_bucket,f'Translated-{file_name}.txt', f'{file_path}/download/Translated-{file_name}.txt')
     with open(f"{file_path}/download/Translated-{file_name}.txt", "r") as file:
       st.download_button(
             label="Download",
             data=file,
             file_name=f"{file_name}.txt"
       )

def streamlit_application():
    # Give a header
    st.header("Document Translator", divider=True)
    # Widgets to upload a file
    uploaded_files = st.file_uploader("Choose a PDF file", accept_multiple_files=True, type="pdf")
    # # upload a file
    file_name = uploaded_files[0].name.replace(' ', '_') if uploaded_files else None
    # Folder path
    file_path = '/tmp'
    # Select the bucket from drop down
    selected_bucket = st.selectbox("Choose the S3 Bucket to upload file :", listbuckets())
    dest_selected_bucket = st.selectbox("Choose the S3 Bucket to download file :", listbuckets())
    selected_language = st.selectbox("Choose the Language :", list_language())
    # Create a button
    click = st.button("Upload", type="primary")
    if click == True:
        if file_name:
            with open(f'{file_path}/{file_name}', mode='wb') as w:
                w.write(uploaded_files[0].getvalue())
        # Set the selected language to the environment variable of lambda function
        lambda_env1 = lam.update_function_configuration(FunctionName='TriggerFunctionFromS3', Environment={'Variables': {'UserInputLanguage': selected_language, 'DestinationBucket': dest_selected_bucket, 'TranslatedFileName': file_name}})
        # Upload the file to S3 bucket:
        upload_to_s3bucket(f'{file_path}/{file_name}', selected_bucket, file_name)
        if s3.get_object(Bucket=selected_bucket, Key=file_name):
            st.success("File uploaded successfully", icon="✅")
            output = wait_for_s3obj(dest_selected_bucket, file_name)
            if output:
              download(dest_selected_bucket, file_name, file_path)
        else:
            st.error("File upload failed", icon="?")


streamlit_application()

翻译内容（加拿大法语）

import streamlit as st

## Write the description of application
st.header("About")
about = '''
Welcome to the File Uploader Application!

st.markdown(about)

结论：

本文向我们展示了文档翻译过程如何像我们想象的那样简单，最终用户必须单击一些选项来选择所需的信息，并在几秒钟内获得所需的输出，而无需考虑配置。目前，我们已经包含了翻译 pdf 文档的单一功能，但稍后我们将对此进行更多研究，以便在单个应用程序中具有多种功能，并具有一些有趣的功能。

以上是使用 Streamlit 和 AWS Translator 的文档翻译服务的详细内容。更多信息请关注PHP中文网其他相关文章！

声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

Python中的合并列表：选择正确的方法May 14, 2025 am 12:11 AM

Tomergelistsinpython，YouCanusethe操作员，estextMethod，ListComprehension，Oritertools

如何在Python 3中加入两个列表？May 14, 2025 am 12:09 AM

在Python3中，可以通过多种方法连接两个列表：1)使用运算符，适用于小列表，但对大列表效率低；2)使用extend方法，适用于大列表，内存效率高，但会修改原列表；3)使用*运算符，适用于合并多个列表，不修改原列表；4)使用itertools.chain，适用于大数据集，内存效率高。

Python串联列表字符串May 14, 2025 am 12:08 AM

使用join()方法是Python中从列表连接字符串最有效的方法。1)使用join()方法高效且易读。2)循环使用运算符对大列表效率低。3)列表推导式与join()结合适用于需要转换的场景。4)reduce()方法适用于其他类型归约，但对字符串连接效率低。完整句子结束。

Python执行，那是什么？May 14, 2025 am 12:06 AM

pythonexecutionistheprocessoftransformingpypythoncodeintoExecutablestructions.1）InternterPreterReadSthecode，ConvertingTingitIntObyTecode，whepythonvirtualmachine（pvm）theglobalinterpreterpreterpreterpreterlock（gil）the thepythonvirtualmachine（pvm）

Python：关键功能是什么May 14, 2025 am 12:02 AM

Python的关键特性包括：1.语法简洁易懂，适合初学者；2.动态类型系统，提高开发速度；3.丰富的标准库，支持多种任务；4.强大的社区和生态系统，提供广泛支持；5.解释性，适合脚本和快速原型开发；6.多范式支持，适用于各种编程风格。

Python：编译器还是解释器？May 13, 2025 am 12:10 AM

Python是解释型语言，但也包含编译过程。1）Python代码先编译成字节码。2）字节码由Python虚拟机解释执行。3）这种混合机制使Python既灵活又高效，但执行速度不如完全编译型语言。

python用于循环与循环时：何时使用哪个？May 13, 2025 am 12:07 AM

useeAforloopWheniteratingOveraseQuenceOrforAspecificnumberoftimes; useAwhiLeLoopWhenconTinuingUntilAcIntiment.ForloopSareIdeAlforkNownsences，而WhileLeleLeleLeleLoopSituationSituationSituationsItuationSuationSituationswithUndEtermentersitations。