


Let's talk about how to parse Apache Avro data (explanation with examples)
How to parse Apache Avro data? This article will introduce you to the methods of serializing to generate Avro data, deserializing to parse Avro data, and using FlinkSQL to parse Avro data. I hope it will be helpful to you!
With the rapid development of the Internet, cutting-edge technologies such as cloud computing, big data, artificial intelligence AI, and the Internet of Things have become mainstream high-tech technologies in today's era, such as e-commerce websites , face recognition, driverless driving, smart homes, smart cities, etc., not only facilitate people's daily necessities, food, housing and transportation, but behind the scenes, there is always a large amount of data being collected, cleared and analyzed by various system platforms. , and it is particularly important to ensure low latency, high throughput, and security of data. Apache Avro itself is serialized through Schema for binary transmission. On the one hand, it ensures high-speed transmission of data, and on the other hand, it ensures data security. , avro is currently used more and more widely in various industries. How to process and parse avro data is particularly important. This article will demonstrate how to generate avro data through serialization and use FlinkSQL for analysis.
This article is a demo of avro parsing. Currently, FlinkSQL is only suitable for simple avro data parsing. Complex nested avro data is not supported for the time being.
Scene introduction
This article mainly introduces the following three key contents:
How to serialize and generate Avro data
-
How to deserialize and parse Avro data
How to use FlinkSQL to parse Avro data
Prerequisites
To understand what avro is, please refer to the apache avro official website quick start guide
Understand avro application scenarios
Operation steps
1. Create a new avro maven project and configure the pom dependency
The content of the pom file is as follows:
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.huawei.bigdata</groupId> <artifactId>avrodemo</artifactId> <version>1.0-SNAPSHOT</version> <dependencies> <dependency> <groupId>org.apache.avro</groupId> <artifactId>avro</artifactId> <version>1.8.1</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.avro</groupId> <artifactId>avro-maven-plugin</artifactId> <version>1.8.1</version> <executions> <execution> <phase>generate-sources</phase> <goals> <goal>schema</goal> </goals> <configuration> <sourceDirectory>${project.basedir}/src/main/avro/</sourceDirectory> <outputDirectory>${project.basedir}/src/main/java/</outputDirectory> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.6</source> <target>1.6</target> </configuration> </plugin> </plugins> </build> </project>
Note: The above pom file is configured to be automatically generated The path to the class, i.e. {project.basedir}/src/main/java/, after this configuration, when executing the mvn command, this plug-in will automatically generate a class file from the avsc schema in this directory and put it in the latter Under contents. If the avro directory is not generated, just create it manually. 2. Define schema
Use JSON to define schema for Avro. The schema consists of basic types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed). For example, the following defines a user's schema, creates an avro directory in the main directory, and then creates a new file user.avsc in the avro directory:
{"namespace": "lancoo.ecbdc.pre", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
3. Compile schema
点击maven projects项目的compile进行编译,会自动在创建namespace路径和User类代码
4、序列化
创建TestUser类,用于序列化生成数据
User user1 = new User(); user1.setName("Alyssa"); user1.setFavoriteNumber(256); // Leave favorite col or null // Alternate constructor User user2 = new User("Ben", 7, "red"); // Construct via builder User user3 = User.newBuilder() .setName("Charlie") .setFavoriteColor("blue") .setFavoriteNumber(null) .build(); // Serialize user1, user2 and user3 to disk DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class); DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter); dataFileWriter.create(user1.getSchema(), new File("user_generic.avro")); dataFileWriter.append(user1); dataFileWriter.append(user2); dataFileWriter.append(user3); dataFileWriter.close();
执行序列化程序后,会在项目的同级目录下生成avro数据
user_generic.avro内容如下:
Objavro.schema�{"type":"record","name":"User","namespace":"lancoo.ecbdc.pre","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["int","null"]},{"name":"favorite_color","type":["string","null"]}]}
至此avro数据已经生成。
5、反序列化
通过反序列化代码解析avro数据
// Deserialize Users from disk DatumReader<User> userDatumReader = new SpecificDatumReader<User>(User.class); DataFileReader<User> dataFileReader = new DataFileReader<User>(new File("user_generic.avro"), userDatumReader); User user = null; while (dataFileReader.hasNext()) { // Reuse user object by passing it to next(). This saves us from // allocating and garbage collecting many objects for files with // many items. user = dataFileReader.next(user); System.out.println(user); }
执行反序列化代码解析user_generic.avro
avro数据解析成功。
6、将user_generic.avro上传至hdfs路径
hdfs dfs -mkdir -p /tmp/lztest/ hdfs dfs -put user_generic.avro /tmp/lztest/
7、配置flinkserver
- 准备avro jar包
将flink-sql-avro-*.jar、flink-sql-avro-confluent-registry-*.jar放入flinkserver lib,将下面的命令在所有flinkserver节点执行
cp /opt/huawei/Bigdata/FusionInsight_Flink_8.1.2/install/FusionInsight-Flink-1.12.2/flink/opt/flink-sql-avro*.jar /opt/huawei/Bigdata/FusionInsight_Flink_8.1.3/install/FusionInsight-Flink-1.12.2/flink/lib chmod 500 flink-sql-avro*.jar chown omm:wheel flink-sql-avro*.jar
-
同时重启FlinkServer实例,重启完成后查看avro包是否被上传
hdfs dfs -ls /FusionInsight_FlinkServer/8.1.2-312005/lib
8、编写FlinkSQL
CREATE TABLE testHdfs( name String, favorite_number int, favorite_color String ) WITH( 'connector' = 'filesystem', 'path' = 'hdfs:///tmp/lztest/user_generic.avro', 'format' = 'avro' );CREATE TABLE KafkaTable ( name String, favorite_number int, favorite_color String ) WITH ( 'connector' = 'kafka', 'topic' = 'testavro', 'properties.bootstrap.servers' = '96.10.2.1:21005', 'properties.group.id' = 'testGroup', 'scan.startup.mode' = 'latest-offset', 'format' = 'avro' ); insert into KafkaTable select * from testHdfs;
保存提交任务
9、查看对应topic中是否有数据
FlinkSQL解析avro数据成功。
【推荐:Apache使用教程】
The above is the detailed content of Let's talk about how to parse Apache Avro data (explanation with examples). For more information, please follow other related articles on the PHP Chinese website!

The core function of Apache is modular design and high customization, allowing it to meet various web service needs. 1. Modular design allows for extended functions by loading different modules. 2. Supports multiple operating systems and is suitable for different environments. 3. Multi-process, multi-threaded and event-driven models improve performance. 4. The basic usage includes configuring the virtual host and document root directory. 5. Advanced usage involves URL rewriting, load balancing and reverse proxying. 6. Common errors can be debugged through syntax checking and log analysis. 7. Performance optimization includes adjusting MPM settings and enabling cache.

What makes Apache still popular in modern web environments is its powerful capabilities and flexibility. 1) Modular design allows custom functions such as security certification and load balancing. 2) Support multiple operating systems to enhance popularity. 3) Efficiently handle concurrent requests, suitable for various application scenarios.

The reasons why Apache has developed from an open source project to an industry standard include: 1) community-driven, attracting global developers to participate; 2) standardization and compatibility, complying with Internet standards; 3) business support and ecosystem, and obtaining enterprise-level market support.

Apache's impact on Webhosting is mainly reflected in its open source features, powerful capabilities and flexibility. 1) Open source features lower the threshold for Webhosting. 2) Powerful features and flexibility make it the first choice for large websites and businesses. 3) The virtual host function saves costs. Although performance may decline in high concurrency conditions, Apache remains competitive through continuous optimization.

Originally originated in 1995, Apache was created by a group of developers to improve the NCSAHTTPd server and become the most widely used web server in the world. 1. Originated in 1995, it aims to improve the NCSAHTTPd server. 2. Define the Web server standards and promote the development of the open source movement. 3. It has nurtured important sub-projects such as Tomcat and Kafka. 4. Facing the challenges of cloud computing and container technology, we will focus on integrating with cloud-native technologies in the future.

Apache has shaped the Internet by providing a stable web server infrastructure, promoting open source culture and incubating important projects. 1) Apache provides a stable web server infrastructure and promotes innovation in web technology. 2) Apache has promoted the development of open source culture, and ASF has incubated important projects such as Hadoop and Kafka. 3) Despite the performance challenges, Apache's future is still full of hope, and ASF continues to launch new technologies.

Since its creation by volunteers in 1995, ApacheHTTPServer has had a profound impact on the web server field. 1. It originates from dissatisfaction with NCSAHTTPd and provides more stable and reliable services. 2. The establishment of the Apache Software Foundation marks its transformation into an ecosystem. 3. Its modular design and security enhance the flexibility and security of the web server. 4. Despite the decline in market share, Apache is still closely linked to modern web technologies. 5. Through configuration optimization and caching, Apache improves performance. 6. Error logs and debug mode help solve common problems.

ApacheHTTPServer continues to efficiently serve Web content in modern Internet environments through modular design, virtual hosting functions and performance optimization. 1) Modular design allows adding functions such as URL rewriting to improve website SEO performance. 2) Virtual hosting function hosts multiple websites on one server, saving costs and simplifying management. 3) Through multi-threading and caching optimization, Apache can handle a large number of concurrent connections, improving response speed and user experience.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 Chinese version
Chinese version, very easy to use

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Linux new version
SublimeText3 Linux latest version

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.
