Home >Java >javaTutorial >How to print error logs in Java projects
1. Illegal parameters introduced by the upper system. For errors introduced by illegal parameters, errors can be intercepted through parameter verification and precondition verification;
2. Errors generated by interaction with the underlying system. There are two types of errors caused by interaction with the lower layer:
a. The lower layer system is successfully processed, but a communication error occurs, which will lead to data inconsistency between subsystems;
For this situation, a timeout compensation mechanism can be used to record the task in advance and correct the data later through scheduled tasks.
If you have any better design solutions, you can also leave a message.
b. The communication was successful, but an error occurred in the lower layer processing.
In this case, it is necessary to communicate with the lower-level developers to coordinate the interaction between subsystems;
It is necessary to handle it appropriately or give reasonable treatment according to the error code and error description returned by the lower-level layer. Prompt information.
No matter which case, it is necessary to assume that the reliability of the underlying system is average, and make design considerations for errors.
3. There is an error in the system processing at this layer.
Causes for errors in this layer of systems:
Cause 1: Caused by negligence. Omission means that the programmer could have avoided such errors but failed to do so. For example, && was typed into &, == was typed into =; boundary errors, compound logic judgment errors, etc. Negligence is either due to the programmer's lack of concentration, such as being tired, working overtime all night, or writing programs while having meetings; or the programmer is rushing to implement the function without taking into account the robustness of the program, etc.
Improvement measures: Use code static analysis tools and unit test line coverage to effectively avoid such problems.
Cause 2: Caused by insufficient error and exception handling. For example, enter a question. When calculating the addition of two numbers, we must not only consider the problem of calculation overflow, but also consider the situation of illegal input. For the former, it may be avoided through understanding, making mistakes, or experience, while for the latter, it must be limited so that it is within the range that our IQ can control, such as using regular expressions to filter out illegal input. Regular expressions must be tested. For illegal input, provide as detailed, understandable and friendly prompt information, reasons and suggestions as possible.
Improvement measures: Consider various error situations and exception handling as comprehensively as possible. After implementing the main process, add a step: carefully consider various possible errors and exceptions, and return reasonable error codes and error descriptions. Each interface or module effectively handles its own errors and exceptions, which can effectively avoid bugs caused by complex scene interactions.
For example, a business use case is completed by scenario A.B.C interaction. The actual execution of A.B is successful, but C fails. At this time, B needs to roll back reasonable codes and messages returned by C and return reasonable codes and messages to A. A rolls back based on B's return and returns reasonable codes and messages to the client. codes and messages. This is a segmented rollback mechanism that requires each scenario to consider rollback under abnormal circumstances.
Reason 3: Caused by tight logical coupling. Due to the tight coupling of business logic, as software products develop step by step, various logical relationships are intricate and complex, making it difficult to see the overall situation, causing the impact of local modifications to spread to the global scope, causing unpredictable problems.
Improvement measures: Write short functions and short methods, each function or method should preferably not exceed 50 lines. Write stateless functions and methods, read-only global state, the same precondition will always output the same result, and will not change its behavior depending on external state; define reasonable structures, interfaces and logical segments to make the connection between interfaces Interactions should be as orthogonal and low-coupled as possible; for the service layer, simple and orthogonal interfaces should be provided as much as possible; continuous reconstruction should be carried out to keep applications modular and loosely coupled, and clarify logical dependencies.
For situations where a large number of business interfaces interact with each other, the logical processes and interdependencies of each business interface must be sorted out and optimized as a whole; for entities with a large number of states, the relevant business interfaces also need to be sorted out. Organize the transition relationships between states.
Cause four: Caused by incorrect algorithm.
Improvement measures: First separate the algorithm from the application. If the algorithm has multiple implementations, it can be found out through unit testing of cross-checking, such as sorting operations; if the algorithm has reversible properties, it can be found out through unit testing of reversible checking, such as encryption and decryption operations.
Cause five: Parameters of the same type are passed in in the wrong order. For example, modifyFlow(int rx, int tx), the actual call is modifyFlow(tx,rx)
Improvement measures: Make the type as specific as possible. Use floating point numbers when you should use them, use strings when you need to use strings, and use specific object types when you need to use them; parameters of the same type should be staggered as much as possible; if none of the above can be satisfied, you must verify it through interface testing. , the interface parameter values must be different.
Cause six: Null pointer exception. Null pointer exceptions usually occur when the object is not initialized correctly, or whether the object is non-null is not checked before using it.
Improvement measures: For configuration objects, check whether they are successfully initialized; for ordinary objects, before obtaining the entity object for use, check whether it is non-null.
Cause seven: Network communication error. Network communication errors are usually caused by network delays, congestion, or blockages. Network communication errors are usually low-probability events, but low-probability events are likely to cause large-scale failures and difficult-to-reproduce bugs.
Improvement measures: Create INFO logs at the end point of the previous subsystem and the entry point of the next subsystem respectively. The time difference between the two provides a clue.
Cause eight: Transaction and concurrency errors. The combination of transactions and concurrency can easily produce errors that are very difficult to locate.
Improvement measures: For concurrent operations in the program that involve shared variables and important status modifications, INFO logs must be added.
If there is a more effective way, please leave a message to point it out.
Cause nine: Configuration error.
Improvement measures: When starting the application or starting the corresponding configuration, detect all configuration items, print the corresponding INFO log, and ensure that all configurations are loaded successfully.
Cause 10: Errors caused by unfamiliarity with the business. In medium and large systems, some business logic and business interactions are relatively complex. The entire business logic may exist in the brains of multiple development students, and everyone's understanding is incomplete. This can easily lead to business coding errors.
Improvement measures: Design correct business use cases through multi-person discussion and communication, and write and implement business logic based on business use cases; the final business logic and business use cases must be fully archived; The business interface indicates the preconditions, processing logic, post-verification and precautions of the business; when the business changes, the business notes need to be updated simultaneously; code REVIEW. Business annotations are important documents for business interfaces and play an important caching role in business understanding.
Cause 11: Errors caused by design issues. For example, the synchronous serial method will have performance and slow response problems, while the concurrent asynchronous method can solve the performance and slow response problems, but it will bring hidden dangers to security and correctness. The asynchronous approach will lead to changes in the programming model, adding new issues such as asynchronous message pushing and receiving. Using cache can improve performance, but there will be problems with cache updates.
Improvement measures: Write and carefully review design documents. The design document must describe the background, requirements, business goals to be met, business performance indicators to be achieved, possible impacts, overall design ideas, detailed plans, foreseeing the advantages, disadvantages and possible impacts of the plan; through testing and acceptance, ensure that the changes are The design solution does meet business goals and business performance metrics.
Cause 12: Error caused by unknown details. Such as buffer overflow and SQL injection attacks. From a functional point of view, there is no problem, but from a malicious use point of view, there are loopholes. For another example, if you choose the jackson library for JSON string parsing, by default, parsing errors will occur when new fields are added to the object. The @JsonIgnoreProperties(ignoreUnknown = true) annotation must be added to the object to correctly respond to changes. If you choose other JSON libraries, you may not have this problem.
Improvement measures: On the one hand, it is necessary to accumulate experience, on the other hand, consider security issues and exceptions, and choose mature and rigorously tested libraries.
Reason 13: Bugs that appear over time. It's not uncommon for solutions that seemed great in the past to become unwieldy or even useless in current or future scenarios. For example, encryption and decryption algorithms may have been considered perfect in the past, but should be used with caution after being cracked.
Improvement measures: Pay attention to changes and vulnerability repair news, and promptly correct outdated codes, libraries, and behaviors.
Cause Fourteen: Hardware related errors. For example, memory leaks, insufficient storage space, OutOfMemoryError, etc.
Improvement measures: Increase the performance monitoring of important indicators such as CPU/memory/network of the application system.
Common errors that occur in the system:
The entity record in the database does not exist. You must specify which entity or entity identifier it is;
If the entity configuration is incorrect, you must specify which configuration is problematic and what the correct configuration should be;
If the entity resource does not meet the conditions, you must specify what the current resource is and what the resource is. What are the requirements;
If the preconditions for the entity operation are not met, you must specify what preconditions need to be met and what the current status is;
If the post-check of the entity operation is not satisfied, you must specify what post-check needs to be met and what the current status is;
Performance problems cause timeout, you must specify what causes the performance Question, how to optimize in the future;
Interaction communication errors among multiple subsystems lead to inconsistent status or data?
Generally, errors that are difficult to locate will appear in relatively low-level places. Because the bottom layer cannot predict the specific business scenario, the error messages given are relatively general.
This requires providing as many clues as possible at the upper level of the business. The error must be caused by the preconditions not being met on a certain layer of the stack during the interaction of multiple systems or layers. When programming, try to ensure that all necessary preconditions are met in each layer of the stack, avoid passing wrong parameters to the bottom layer as much as possible, and intercept errors at the business layer as much as possible.
Most errors are caused by a combination of reasons. But every mistake must have its cause. After resolving errors, conduct an in-depth analysis of how the errors occurred and how to avoid them from happening again. You can succeed with hard work, but only by reflection can you make progress! Recommendation: Java's elegant logging: log4j practical chapter
How to write an error log that makes it easier to troubleshoot problems
Basic principles of error logging:
As complete as possible. Each error log fully describes: what error occurred in what scenario, what is the reason (or possible reasons), how to solve it (or solution tips);
Maybe specific. For example, if NC resources are insufficient, what exactly does it refer to and whether it can be specified directly through the program; for general errors, such as VM NOT EXIST, it is necessary to specify the scenario in which it occurs, which may facilitate subsequent statistical work.
Be as direct as possible. The ideal error log should allow people to know the cause and how to solve it at the first intuition, rather than having to go through several steps to find the real cause.
Integrate existing experience directly into the system. All problems and experiences that have been solved should be integrated into the system in as friendly a way as possible to give new employees better tips, rather than being buried elsewhere.
The layout should be neat and orderly, and the format should be unified and standardized. Dense, essay-like logs are heart-wrenching to look at. They are quite unfriendly and inconvenient for troubleshooting problems.
Use multiple keywords to uniquely identify the request , and highlight the keywords: time, entity identifier (such as vmname), and operation name.
Basic steps for troubleshooting:
Log in to the application server-> Open the log file-> Locate the error log location-> Follow the clues in the error log guidance to troubleshoot, identify and solve problems.
Among them:
From logging in to opening the log file. Since there are multiple application servers, it is inconvenient to log in one by one to view them. It is necessary to write a tool and place it on the AG to view all server logs directly on the AG, and even directly filter out the required error logs.
Locate the error log location. The current layout of logs is densely packed, making it difficult to locate error logs. Generally, you can first use "time" to locate the location near the front of the error log, and then use the entity keyword/operation name combination to lock the error log location. Although locating error logs based on requestId is more traditional, it requires finding requestId first and is not descriptive. It is best to locate the error log location directly based on time/content keywords.
3. Analyze error log. The content of the error log should be more direct and clear, clearly indicate that it is consistent with the characteristics of the current problem to be investigated, and give important clues.
Usually, the problem with program error logs is that the log content can only be understood based on the current code situation. It looks concise, but it is always incompletely written and in a semi-English format; once you leave the code situation, it is difficult to know what is going on. What is it? You have to think about it or look at the code to understand what the log means. Isn’t this a case of self-inflicted suffering? Extension: Describe in detail the mainstream Java logging tool library
For example:
if ((storageType == StorageType.dfs1 || storageType == StorageType.dfs2) && (zone.hasStorageType(StorageType.io3) || zone.hasStorageType(StorageType.io4))) { // 进入dfs1 和dfs2 在io3 io4 存储。 } else { log.info("zone storage type not support, zone: " + zone.getZoneId() + ", storageType: " + storageType.name()); throw new BizException(DeviceErrorCode.ZONE_STORAGE_TYPE_NOT_SUPPORT); }
zone What storage type should be supported? Do Not Let Me Think!
Error logs should be able to clearly describe what happened even if you leave the context of the code.
In addition, if you can directly explain the reasons in the error log, you can save some effort when making inspection logs.
In a sense, the error log can also be a very useful document, recording various illegal running use cases.
The content of the current program error log may have the following problems:
1. The error log does not specify the error parameters and content:
catch(Exception ex){ log.error("control ip insert failed", ex); return new ResultSet<AddControlIpResponse>( ControlIpErrorCode.ERROR_CONTROL_IP_INSERT_FAILURE); }
does not specify the insertion Failed control ip. If you add the control ip keyword, it is easier to search and lock the error.
Similarly,
log.error("Get some errors when insert subnet and its IPs into database. Add subnet or IP failure.", e);
does not specify which subnet and which IPs it belongs to. It is worth noting that to specify these, you need to do some extra things, which may slightly affect the performance. This is where performance and debuggability need to be weighed.
解决方案:使用 String.format("Some msg to ErrorObj: %s", errobj) 方法指明错误参数及内容。
这通常要求对 DO 对象编写可读的 toString 方法。
2.错误场景不明确:
log.error("nc has exist, nc ip" + request.getIp());
在 createNc 中检测到 NC 已经存在报错。但是日志上没有指明错误场景, 让人猜测,为什么会报NC已存在错误。
可以改为
log.error("nc has exist when want to create nc, please check nc parameters. Given nc ip: " + request.getIp()); log.error("[create nc] nc has exist, please check nc parameters. Given nc ip: " + request.getIp());
类似的还有:
log.error("not all vm destroyed, nc id " + request.getNcId());
改成
log.error("[delete nc] some vms [%s] in the nc are not destroyed. nc id: %s", vmNames, request.getNcId());
解决方案:错误消息加上 when 字句, 或者错误消息前加上 【接口名】, 指明错误场景,直接从错误日志就知道明白了。
一般能够知道 executor 的可以加上 【接口名】, service 加上 when 字句。
3.内容不明确, 或不明其义:
if(aliMonitorReporter == null) { log.error("aliMonitorReporter is null!"); } else { aliMonitorReporter.attach(new ThreadPoolMonitor(namePrefix, asynTaskThreadPool.getThreadPoolExecutor())); }
改为:
log.error("aliMonitorReporter is null, probably not initialized properly, please check configuration in file xxx.");
类似的还有:
if (diskWbps == null && diskRbps == null && diskWiops == null && diskRiops == null) { log.error("none of attribute is specified for modifying"); throw new BizException(DeviceErrorCode.NO_ATTRIBUTE_FOR_MODIFY); }
改为
log.error("[modify disk attribute] None of [diskWbps,diskRbps,diskWiops,diskRiops] is specified for disk id:" + diskId);
解决方案:更清晰贴切地描述错误内容。
4.排查问题的引导内容不明确:
log.error("get gw group ip segment failed. zkPath: " + LockResource.getGwGroupIpSegmnetLockPath(request.getGwGroupId()));
zkPath ? 如何去排查这个问题?我该去找谁?到哪里去查找更具体的线索?
解决方案:加上相应的背景知识和引导排查措施。
5.错误内容不够具体细致:
if (!ncResourceService.isNcResourceEnough(ncResourceDO, vmResourceCondition)) { log.error("disk space is not enough at vm's nc, nc id:" + vmDO.getNcId()); throw new BizException(ResourceErrorCode.ERROR_RESOURCE_NOT_ENOUGH); }
究竟是什么资源不够?目前剩余多少?现在需要多少?值得注意的是, 要指明这些要额外做一些事情, 可能会稍微影响性能。这时候需要权衡性能和可调试性。
解决方案:通过改进程序或程序技巧, 尽可能揭示出具体的差异所在, 减少人工比对的操作。
6.半英文句式读起来不够清晰明白,需要思考来拼凑起完整的意思:
log.warn("cache status conflict, device id "+deviceDO.getId()+" db status "+deviceDO.getStatus() +", nc status "+ status);
改为:
log.warn(String.format("[query cache status] device cache status conflicts between regiondb and nc, status of device '%s' in regiondb is %s , but is %s in nc.", deviceDO.getId(), deviceDO.getStatus(), status));
解决方案:改为自然可读的英文句式。
总结起来, 错误日志格式可以为:
log.error("[接口名或操作名] [Some Error Msg] happens. [params] [Probably Because]. [Probably need to do]."); log.error(String.format("[接口名或操作名] [Some Error Msg] happens. [%s]. [Probably Because]. [Probably need to do].", params));
或
log.error("[Some Error Msg] happens to 错误参数或内容 when [in some condition]. [Probably Because]. [Probably need to do]."); log.error(String.format("[Some Error Msg] happens to %s when [in some condition]. [Probably Because]. [Probably need to do].", parameters));
[Probably Reason]. [Probably need to do]. 在某些情况下可以省略;在一些重要接口和场景下最好能说明一下。
每一条错误日志都是独立的,尽可能完整、具体、直接说明何种场景下发生了什么错误,由什么原因导致,要采用什么措施或步骤。
问题:
1.String.format 的性能会影响打日志吗?
一般来说, 错误日志应该是比较少的, 使用 String.format 的频度并不会太高,不会对应用和日志造成影响。
2.开发时间非常紧张时, 有时间去斟酌字句吗?
建立一个标准化的内容格式,将内容往格式套,可以节省斟酌字句的时间。
3.什么时候使用 info, warn , error ?
info 用于打印程序应该出现的正常状态信息, 便于追踪定位;
warn 表明系统出现轻微的不合理但不影响运行和使用;
error 表明出现了系统错误和异常,无法正常完成目标操作。
错误日志是排查问题的重要手段之一。当我们编程实现一项功能时, 通常会考虑可能发生的各种错误及相应原因:
要排查出相应的原因, 就需要一些关键描述来定位原因。
这就会形成三元组:
错误现象 -> 错误关键描述 -> 最终的错误原因。
需要针对每一种错误尽可能提供相应的错误关键描述,从而定位到相应的错误原因。
The above is the detailed content of How to print error logs in Java projects. For more information, please follow other related articles on the PHP Chinese website!