Home  >  Article  >  Java  >  Let you understand the Java structured data processing open source library SPL

Let you understand the Java structured data processing open source library SPL

WBOY
WBOYforward
2022-05-24 13:34:591883browse

This article brings you relevant knowledge about java, which mainly introduces issues related to the open source library SPL for structured data processing. Let’s take a look at the ideal structure under java. Data processing class library, I hope it will be helpful to everyone.

Let you understand the Java structured data processing open source library SPL

Recommended study: "java Video Tutorial"

Modern Java application architecture increasingly emphasizes the separation of data storage and processing. Obtain better maintainability, scalability and portability, such as the popular microservices. This architecture usually requires business logic to be implemented in Java programs instead of being placed in the database as in traditional application architecture.

Most of the business logic in applications involves structured data processing. The database (SQL) has rich support for such tasks, and business logic can be implemented relatively easily. However, Java has always lacked such basic support, making it very cumbersome and inefficient to implement business logic in Java. As a result, despite various architectural advantages, development efficiency has dropped significantly.

If we also provide a complete set of structured data processing and calculation class libraries in Java, then this problem can be solved: enjoying the advantages of the architecture without reducing development efficiency.

What kind of abilities are needed?

What characteristics should an ideal structured data processing class library under Java have? We can summarize it from SQL:

1 Set computing capability

Structured data often appears in batches (in the form of sets). In order to conveniently calculate this type of data , it is necessary to provide sufficient set computing capabilities.

If there is no set operation library, there is only the basic data type of array (equivalent to a set). If we want to do a simple sum of the set members, we also need to write four or five lines of loop statements to complete, filter, and group. Operations such as aggregation require hundreds of lines of code.

SQL provides a rich set of operations, such as SUM/COUNT and other aggregation operations, WHERE is used for filtering, GROUP is used for grouping, and it also supports basic operations such as intersection, union, and difference for sets. The code written in this way will be much shorter.

2 Lambda syntax

Is it enough to have set operation capabilities? If we develop a batch of set operation libraries for Java, can we achieve the effect of SQL?

It’s not that simple!

Take filtering operation as an example. Filtering usually requires a condition to retain the set members that meet the condition. In SQL, this condition appears in the form of an expression. For example, writing WHERE x>0 means retaining those members that make the calculation result of x>0 true. The expression x>0 is not evaluated before executing this statement, but is evaluated for each set member during the iteration. Essentially, this expression is essentially a function, a function that takes the current collection members as parameters. For the WHERE operation, it is equivalent to using a function defined with an expression as a parameter of WHERE.

There is a term for this way of writing called Lambda grammar, or functional language.

Without Lambda syntax, we would often have to temporarily define functions, which would make the code very cumbersome and prone to name conflicts.

Lambda syntax is widely used in SQL. It is not required for filtering and grouping operations. It can also be used in unnecessary scenarios such as calculated columns, which greatly simplifies the code.

3 Directly refer to fields in Lambda syntax

Structured data is not a simple single value, but a record with fields.

We found that when referencing record fields in SQL expression parameters, in most cases you can use the field name directly without specifying the record to which the field belongs. Only when there are multiple fields with the same name, you need to use the table name ( or alias) to distinguish.

Although the new version of Java has also begun to support Lambda syntax, it can only pass the current record as a parameter into the function defined with Lambda syntax, and then always bring this record when writing calculation formulas. For example, when calculating the amount using unit price and quantity, if the parameter used to represent the current member is named x, it needs to be written in the long-winded form of "x. unit price * x. quantity". In SQL, it can be written more intuitively as "unit price * quantity".

4 Dynamic data structure

SQL can also support dynamic data structures very well.

In structured data calculations, the return value is often structured data, and the result data structure is related to operations and cannot be prepared before writing the code. Therefore, it is necessary to support dynamic data structure capabilities.

Any SELECT statement in SQL will generate a new data structure. You can add and delete fields at will in the code without defining the structure (class) in advance. This is not possible with languages ​​like Java. All structures (classes) used must be defined during the code compilation phase. In principle, new structures cannot be dynamically generated during execution.

5 Interpreted Language

From the analysis of the previous articles, we can already draw the conclusion: Java itself is not suitable as a language for structured data processing. Its Lambda mechanism does not support feature 3, and as a compiled language, it cannot implement feature 4.

In fact, the Lambda syntax mentioned earlier is not suitable for implementation in compiled languages. The compiler cannot determine whether the expression written to the parameter position should calculate the value of the expression on the spot and then pass it on, or whether it should compile the entire expression into a function and pass it, and more syntax symbols need to be designed to distinguish it. Interpreted languages ​​do not have this problem. Whether the expression as a parameter is calculated first or is calculated after traversing the set members can be decided by the function itself.

SQL is indeed an interpreted language.

Introducing SPL

Stream is a structured data processing library launched in an official capacity in Java 8, but it does not meet the above requirements. It does not have professional structured data types, lacks many important structured data calculation functions, is not an interpreted language, does not support dynamic data types, and the interface of Lambda syntax is complex.

Kotlin is part of the Java ecosystem. It has made slight improvements based on Stream and also provides structured data calculation types. However, due to insufficient structured data calculation functions, it is not An interpreted language does not support dynamic data types, and the interface of Lambda syntax is complex. It is still not an ideal structured data computing class library.

Scala provides a rich set of structured data calculation functions, but the characteristics of compiled languages ​​also prevent it from becoming an ideal structured data calculation class library.

So, what else can be used in the Java ecosystem?

esProc SPL.

SPL is a programming language interpreted and executed by Java. It has a rich structured data calculation class library, simple Lambda syntax and convenient and easy-to-use dynamic data structures. It is an ideal structured processing class library under Java.

Rich set operation functions

SPL provides a professional structured data type, namely sequence table. Like the SQL data table, the sequence table is a collection of batch records and has the general functions of structured data types. The following is an example.

Parse the source data and generate the sequence table:

Orders=T("d:/Orders.csv")

Generate a new sequence table from the original sequence table by column name:

Orders.new(OrderID, Amount, OrderDate)

Calculated column:

Orders.new(OrderID, Amount, year(OrderDate))

Field rename:

Orders.new(OrderID:ID, SellerId, year(OrderDate):y)

Use fields by serial number:

Orders.groups(year(_5),_2; sum(_4))

Sequence table rename (left association)

join@1(Orders:o,SellerId ; Employees:e,EId).groups(e.Dept; sum(o.Amount))

Sequence table supports all structured calculation functions, and the calculation results are also It is also a sequence table, not a data type such as Map. For example, continue to process structured data for grouped summary results:

Orders.groups(year(OrderDate):y; sum(Amount):m).new(y:OrderYear, m*0.2:discount)

Based on the sequence table, SPL provides a wealth of structured data calculation functions, such as filtering, sorting, grouping, deduplication, and renaming. , calculated columns, associations, subqueries, set calculations, ordered calculations, etc. These functions have powerful computing capabilities and can complete calculations independently without hard-coding assistance:

Combined query:

Orders.select(Amount>1000 && Amount<=3000 && like(Client,"*bro*"))

Sort:

Orders.sort(-Client,Amount)

Group summary:

Orders.groups(year(OrderDate),Client; sum(Amount))

Internal association:

join(Orders:o,SellerId ; Employees:e,EId).groups(e.Dept; sum(o.Amount))

Simple Lambda syntax

SPL supports simple Lambda syntax. There is no need to define function names and function bodies. Expressions can be used directly. As parameters of functions, such as filtering:

Orders.select(Amount>1000)

When modifying business logic, there is no need to reconstruct the function, just simply modify the expression:

Orders.select(Amount>1000 && Amount<2000)

SPL is an interpreted language and uses parameter expressions It is not necessary to explicitly define parameter types, making the Lambda interface simpler. For example, to calculate the sum of squares, if you want to calculate the square during the sum process, you can write it intuitively:

Orders.sum(Amount*Amount)

Similar to SQL, SPL syntax also supports the direct use of field names in single-table calculations:

Orders.sort(-Client, Amount)

Dynamic Data Structure

SPL is an interpreted language that naturally supports dynamic data structures and can dynamically generate new sequences based on the calculation result structure. It is particularly suitable for calculations such as calculated columns, group summarization, and correlation. For example, directly recalculating the results of group summary:

Orders.groups(Client;sum(Amount):amt).select(amt>1000 && like(Client,"*S*"))

or directly recalculating the results of correlation calculation:

join(Orders:o,SellerId ; Employees:e,Eid).groups(e.Dept; sum(o.Amount))

More complex ones Calculations are usually divided into multiple steps, and the data structure of each intermediate result is almost different. SPL supports dynamic data structures without having to define the structure of these intermediate results first. For example, based on the customer payment record table of a certain year, calculate the top 10 customers with monthly payment amount:

Sales2021.group(month(sellDate)).(~.groups(Client;sum(Amount):sumValue)).(~.sort(-sumValue)) .(~.select(#<=10)).(~.(Client)).isect()

Directly execute SQL

SPL It also implements a SQL interpreter, which can directly execute SQL. It can support everything from basic WHERE, GROUP to JOIN, and even WITH:

$select * from d:/Orders.csv where (OrderDate=date('2020-12-31') and Amount>100)
$select year(OrderDate),Client ,sum(Amount),count(1) from d:/Orders.csv
group by year(OrderDate),Client
having sum(Amount)<=100
$select o.OrderId,o.Client,e.Name e.Dept from d:/Orders.csv o
join d:/Employees.csv e on o.SellerId=e.Eid
$with t as (select Client ,sum(amount) s from d:/Orders.csv group by Client)
select t.Client, t.s, ct.Name, ct.address from t
left join ClientTable ct on t.Client=ct.Client

More language advantages

As a professional structure As a data processing language, SPL not only covers all the computing capabilities of SQL, but also has more powerful advantages in terms of language:

discreteness and its support for more thorough aggregation

Collection is a basic feature of SQL, which supports data to participate in operations in the form of sets. However, the discrete nature of SQL is very poor. All set members must participate in the operation as a whole and cannot be separated from the set. High-level languages ​​such as Java support good discreteness, and array members can be operated independently.

但是,更彻底的集合化需要离散性来支持,集合成员可以游离在集合之外,并与其它数据随意构成新的集合参与运算 。

SPL兼具了SQL的集合化和Java的离散性,从而可以实现更彻底的集合化。

比如,SPL中很容易表达“集合的集合”,适合分组后计算。比如,找到各科成绩均在前10名的学生:


A
1=T(“score.csv”).group(subject)
2=A2.(.rank(score).pselect@a(<=10))
3=A1.(~(A3(#)).(name)).isect()
SPL序表的字段可以存储记录或记录集合,这样可以用对象引用的方式,直观地表达关联关系,即使关系再多,也能直观地表达。比如,根据员工表找到女经理下属的男员工:
Employees.select(性别:"男",部门.经理.性别:"女")

有序计算是离散性和集合化的典型结合产物,成员的次序在集合中才有意义,这要求集合化,有序计算时又要将每个成员与相邻成员区分开,会强调离散性。SPL兼具集合化和离散性,天然支持有序计算。

具体来说,SPL可以按绝对位置引用成员,比如,取第3条订单可以写成Orders(3),取第1、3、5条记录可以写成Orders([1,3,5])。

SPL也可以按相对位置引用成员,比如,计算每条记录相对于上一条记录的金额增长率:Orders.derive(amount/amount[-1]-1)

SPL还可以用#代表当前记录的序号,比如把员工按序号分成两组,奇数序号一组,偶数序号一组:Employees.group(#%2==1)

更方便的函数语法

大量功能强大的结构化数据计算函数,这本来是一件好事,但这会让相似功能的函数不容易区分。无形中提高了学习难度。

SPL提供了特有的函数选项语法,功能相似的函数可以共用一个函数名,只用函数选项区分差别。比如select函数的基本功能是过滤,如果只过滤出符合条件的第1条记录,只须使用选项@1:

Orders.select@1(Amount>1000)

数据量较大时,用并行计算提高性能,只须改为选项@m:

Orders.select@m(Amount>1000)

对排序过的数据,用二分法进行快速过滤,可用@b:

Orders.select@b(Amount>1000)

函数选项还可以组合搭配,比如:

Orders.select@1b(Amount>1000)

结构化运算函数的参数常常很复杂,比如SQL就需要用各种关键字把一条语句的参数分隔成多个组,但这会动用很多关键字,也使语句结构不统一。

SPL支持层次参数,通过分号、逗号、冒号自高而低将参数分为三层,用通用的方式简化复杂参数的表达:

join(Orders:o,SellerId ; Employees:e,EId)

扩展的Lambda语法

普通的Lambda语法不仅要指明表达式(即函数形式的参数),还必须完整地定义表达式本身的参数,否则在数学形式上不够严密,这就让Lambda语法很繁琐。比如用循环函数select过滤集合A,只保留值为偶数的成员,一般形式是:

A.select(f(x):{x%2==0} )

这里的表达式是x%2==0,表达式的参数是f(x)里的x,x代表集合A里的成员,即循环变量。

SPL用固定符号~代表循环变量,当参数是循环变量时就无须再定义参数了。在SPL中,上面的Lambda语法可以简写作:A.select(~ %2==0)

普通Lambda语法必须定义表达式用到的每一个参数,除了循环变量外,常用的参数还有循环计数,如果把循环计数也定义到Lambda中,代码就更繁琐了。

SPL用固定符号#代表循环计数变量。比如,用函数select过滤集合A,只保留序号是偶数的成员,SPL可以写作:A.select(# %2==0)

相对位置经常出现在难度较大的计算中,而且相对位置本身就很难计算,当要使用相对位置时,参数的写法将非常繁琐。

SPL用固定形式[序号]代表相对位置


AB
1=T(“Orders.txt”)/订单序表
2=A1.groups(year(Date):y,month(Date):m; sum(Amount):amt)/按年月分组汇总
3=A2.derive(amt/amt[-1]:lrr, amt[-1:1].avg():ma)/计算比上期和移动平均

无缝集成、低耦合、热切换

作为用Java解释的脚本语言,SPL提供了JDBC驱动,可以无缝集成进Java应用程中。

简单语句可以像SQL一样直接执行:

…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
PrepareStatement st = conn.prepareStatement("=T(\"D:/Orders.txt\").select(Amount>1000 && Amount<=3000 && like(Client,\"*S*\"))");
ResultSet result=st.execute();
...

复杂计算可以存成脚本文件,以存储过程方式调用

…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement st = connection.();
CallableStatement st = conn.prepareCall("{call splscript1(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000); 
ResultSet result=st.execute();
...

将脚本外置于Java程序,一方面可以降低代码耦合性,另一方面利用解释执行的特点还可以支持热切换,业务逻辑变动时只要修改脚本即可立即生效,不像使用Java时常常要重启整个应用。这种机制特别适合编写微服务架构中的业务处理逻辑。

推荐学习:《java视频教程

The above is the detailed content of Let you understand the Java structured data processing open source library SPL. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:csdn.net. If there is any infringement, please contact admin@php.cn delete