-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8286212: Cgroup v1 initialization causes NPE on some systems #8629
Conversation
👋 Welcome back sgehwolf! A progress list of the required criteria for merging this PR into |
@jerboaa The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just started to look at the code so just one comment for now.
@@ -62,6 +62,22 @@ public void setPath(String cgroupPath) { | |||
String cgroupSubstr = cgroupPath.substring(root.length()); | |||
path = mountPoint + cgroupSubstr; | |||
} | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like path
is still not set if the condition at line 61 if (cgroupPath.length() > root.length()) {
is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. There are more cases where the cgroup path might be null. If you know of cases that should be handled and aren't, please do let me know. Incidentally, that's why I've added the catch for NPE in CgroupV1Subsystem.initSubSystem()
, because it's not clear what should be used as path
in those corner cases. It certainly shouldn't throw a NPE :-). It then also more closely matches what hotspot does. Having said that, if cgroupPath.length < root.length()
it's implied that cgroupPath.startsWith(root)
is false. Then the only case where it would still be null is when the paths match in lenght, but aren't the same.
I'm not convinced the logic when then cgroup patch starts with the root path, then it should do what it does today. I.e. given mountpath=/sys/fs/cgroup/memory
, root=/foo/bar
and cgroupPath=/foo/bar/baz
then path=/sys/fs/cgroup/memory/baz
. I've personally not seen such a setup and the code predates me. Considering that the equivalent hotspot code had a bug in this logic since day 1, it doesn't seem very widely used...
The most common cases I know to date are:
root=/
,cgroupPath=/some
,mount=/sys/fs/cgroup/controller
=>path=/sys/fs/cgroup/controller/some
(host systems)root=/foo/bar/baz
,cgroupPath=/foo/bar/baz
,mount=/sys/fs/cgroup/controller
=>path=/sys/fs/cgroup/controller
(most container engines)
Sure, thanks for the review! |
Thanks for taking my suggestion! Much simpler now :-) |
root_p++; cgroup_p++; | ||
} | ||
ss.print_raw(_root, last_matching_slash_pos); | ||
_path = os::strdup(ss.base()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean Find the longest common prefix
? Maybe give an example in the comments? Text parsing in C code is really difficult to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe factor out the search like this
// Return length of common starting substring. E.g. "cat" for ("cattle", "catnip");
static int find_common_starting_substring(const char* s1, const char* s2) {
...
}
That way its clearer and we can find and move this to utilities if we ever need this in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iklam yes I meant Find the longest common prefix
. Fixed the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced the extra function makes the code more readable, but here it is. I can revert back if this is too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerboaa Feel free to go back to your original variant. This was only a proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understood.
@@ -59,7 +60,7 @@ void CgroupV1Controller::set_subsystem_path(char *cgroup_path) { | |||
_path = os::strdup(buf); | |||
} else { | |||
char *p = strstr(cgroup_path, _root); | |||
if (p != NULL && p == _root) { | |||
if (p != NULL && p == cgroup_path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if cgroup_path is "/foo/bar" and _root is "/fo"?
Maybe this block should be combined with the new else
block you're adding? However, the behavior seems opposite between these two blocks of code:
The upper block: _root is a prefix of cgroup_path, we take the tail of cgroup_path
TestCase substring_match = {
"/sys/fs/cgroup/memory", // mount_path
"/user.slice/user-1000.slice", // root_path
"/user.slice/user-1000.slice/user@1001.service", // cgroup_path
"/sys/fs/cgroup/memory/user@1001.service" // expected_path
};
The lower block: The beginning of _root is a prefix of cgroup_path, we take the head of cgroup_path
TestCase prefix_matched_cg = {
"/sys/fs/cgroup/memory", // mount_path
"/user.slice/user-1000.slice/session-50.scope", // root_path
"/user.slice/user-1000.slice/session-3.scope", // cgroup_path
"/sys/fs/cgroup/memory/user.slice/user-1000.slice" // expected_path
};
I find the behavior hard to understand. I think the rules (and reasons) should be added to the comment block above the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if cgroup_path is "/foo/bar" and _root is "/fo"?
With a mount path of /bar
this ends up being /bar/o/bar
. Pretty strange, but then again it's a bit of a contrived example as those paths come from /proc
parsing. Anyway, this is code that got added with JDK-8146115. It's not something I've written and to be honest, I'm not sure this branch is needed, but I didn't want to change the existing behaviour with this patch. I have no more insight than you in terms of why that approach has been taken.
Maybe this block should be combined with the new
else
block you're adding?
Maybe, but I'm not sure if it would break something.
However, the behavior seems opposite between these two blocks of code:
The upper block: _root is a prefix of cgroup_path, we take the tail of cgroup_path
TestCase substring_match = { "/sys/fs/cgroup/memory", // mount_path "/user.slice/user-1000.slice", // root_path "/user.slice/user-1000.slice/user@1001.service", // cgroup_path "/sys/fs/cgroup/memory/user@1001.service" // expected_path };
Yes. Though, I cannot comment on why that has been chosen. It's been there since day one :-/
The lower block: The beginning of _root is a prefix of cgroup_path, we take the head of cgroup_path
TestCase prefix_matched_cg = { "/sys/fs/cgroup/memory", // mount_path "/user.slice/user-1000.slice/session-50.scope", // root_path "/user.slice/user-1000.slice/session-3.scope", // cgroup_path "/sys/fs/cgroup/memory/user.slice/user-1000.slice" // expected_path };
I find the behavior hard to understand. I think the rules (and reasons) should be added to the comment block above the function.
The reason why I've gone down the path of adding the head of cgroup_path is because of this document (in conjunction to what the user was observing on an affected system):
https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
The user was observing paths as listed in the test:
"/user.slice/user-1000.slice/session-50.scope", // root_path
"/user.slice/user-1000.slice/session-3.scope", // cgroup_path
This very much looks like systemd managed. Given that and knowing that systemd adds processes into scopes
or services
and groups them via slices
and also knowing that cgroups are hierarchical (i.e. limits of /foo/bar
also apply to /foo/bar/baz
), it seems likely that if there are any limits, then it'll be on /user.slice/user-1000.slice
within the mounted controller. Unfortunately, I'm not able to reproduce this myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if the problem is this:
We have systemd running on the host, and a different copy of systemd that runs inside the container.
- They both set up
/user.slice/user-1000.slice/session-??.scope
within their own file systems - For some reason, when you're looking inside the container,
/proc/self/cgroup
might use a path in the containerized file system whereas/proc/self/mountinfo
uses a path in the host file system. These two paths may look alike but they have absolutely no relation to each other.
I have asked the reporter for more information:
Meanwhile, I think the current method of finding "which directory under /sys/fs/cgroup/memory controls my memory usage" is broken. As mentioned about, the path you get from /proc/self/cgroup
and /proc/self/mountinfo
have no relation to each other, but we use them anyway to get our answer, with many ad-hoc methods that are not documented in the code.
Maybe we should do this instead?
- Read /proc/self/cgroup
- Find the
10:memory:<path>
line - If
/sys/fs/cgroup/memory/<path>/tasks
contains my PID, this is the path - Otherwise, scan all
tasks
files under/sys/fs/cgroup/memory/
. Exactly one of them contains my PID.
For example, here's a test with docker:
INSIDE CONTAINER
# cat /proc/self/cgroup | grep memory
10:memory:/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050
# cat /proc/self/mountinfo | grep memory
801 791 0:42 /docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050 /sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime master:23 - cgroup cgroup rw,memory
# cat /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
cat: /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks: No such file or directory
# cat /sys/fs/cgroup/memory/tasks | grep $$
1
ON HOST
# cat /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
37494
# cat /proc/37494/status | grep NSpid
NSpid: 37494 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think the current PR could produce the wrong answer, if systemd is indeed running inside the container, and we have:
"/user.slice/user-1000.slice/session-50.scope", // root_path
"/user.slice/user-1000.slice/session-3.scope", // cgroup_path
The PR gives /sys/fs/cgroup/memory/user.slice/user-1000.slice/, which specifies the overall memory limit for user-1000. However, the correct answer may be /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-3.scope, which may have a smaller memory limit, and the JVM may end up allocating a larger heap than allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we can decide which one the right file is. This is largely undocumented territory. The correct fix is a) find the correct path to the namespace hierarchy the process is a part of. b) starting at the leaf node, walk up the hierarchy and find the lowest limits. Doing this would be very expensive!
Aside: Current container detection in the JVM/JDK is notoriously imprecise. It's largely based on common setups (containers like docker). The heuristics assume that memory limits are reported inside the container at the leaf node. If, however, that's not the case, the detected limits will be wrong (it will detect it as unlimited, even though it's - for example - memory constrained at the parent). This can for example be reproduced on a cgroups v2 system with a systemd slice using memory limits. We've worked-around this in OpenJDK for cgroups v1 by https://bugs.openjdk.java.net/browse/JDK-8217338
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should do this instead?
* Read /proc/self/cgroup * Find the `10:memory:<path>` line * If `/sys/fs/cgroup/memory/<path>/tasks` contains my PID, this is the path * Otherwise, scan all `tasks` files under `/sys/fs/cgroup/memory/`. Exactly one of them contains my PID.
Something like that seems most promising, but it would have to be cgroup.procs
not tasks
as tasks
is the task id (i.e. Linux's thread), not the process. We could keep the two common cases as short circuiting. I.e. host and docker cases in the test.
@@ -59,7 +60,7 @@ void CgroupV1Controller::set_subsystem_path(char *cgroup_path) { | |||
_path = os::strdup(buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on the old code: why this cannot be simply
_path = os::strdup(_mount_point);
Also, all the strncat calls in this function seem problematic, and should be converted to stringStream for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I've filed https://bugs.openjdk.java.net/browse/JDK-8287007 to track this. I'd like to limit the changes of this patch to a minimum. It seems orthogonal.
int last_matching_slash_pos = -1; | ||
for (int i = 0; *s1 == *s2 && *s1 != 0; i++) { | ||
if (*s1 == '/') { | ||
last_matching_slash_pos = i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the behavior a little hard to understand. Is it intentional?
"/cat/cow", "/cat/dog" -> "/cat/"
"/cat/", "/cat/dog" -> "/cat/"
"/cat", "/cat/dog" -> "/"
The name find_last_slash_pos
doesn't properly describe the behavior. I thought about find_common_path_prefix
, but that doesn't match the current behavior (for the 3rd case, the common path prefix is "/cat"
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's supposed to find the common path prefix as you say. I'll fix it. Open to suggestions to make it easier to understand.
const char* root_p = _root; | ||
const char* cgroup_p = cgroup_path; | ||
int last_slash = find_last_slash_pos(root_p, cgroup_p); | ||
assert(last_slash >= 0, "not an absolute path?"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are root_p and cgroup_p directly read from the /proc/xxx files. If so, do we validate the input to make sure they are absolute paths?
It seems like our code cannot handle trailing '/' in the input. If so, we should clear all trailing '/' from the input string. Then, in functions that process them, we should assert that they don't end with slash. See my comment in find_last_slash_pos().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those values come from /proc/self/mountinfo
and /proc/self/cgroup
. There is no validation being done. Then again, we only end up in this branch if the root path is not a substring of the cgroup path. In that case trailing slashes don't matter, since there would not be a character by character match earlier.
I'll add handling of trailing slashes and appropriate asserts where it makes sense.
@jerboaa This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
Don't close this yet please, bot. |
@jerboaa This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
It's still on my plate to work on, but priorities keep this pushed to the bottom... |
@jerboaa This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
Don't close bot, please. |
@jerboaa This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@@ -217,6 +221,7 @@ public class TestCgroupSubsystemFactory { | |||
"2:cpu,cpuacct:/\n" + | |||
"1:name=systemd:/user.slice/user-1000.slice/user@1000.service/apps.slice/apps-org.gnome.Terminal.slice/vte-spawn-3c00b338-5b65-439f-8e97-135e183d135d.scope\n" + | |||
"0::/user.slice/user-1000.slice/user@1000.service/apps.slice/apps-org.gnome.Terminal.slice/vte-spawn-3c00b338-5b65-439f-8e97-135e183d135d.scope\n"; | |||
private String cgroupv1SelfPrefixContent = "9:memory:/user.slice/user-1000.slice/session-3.scope\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove one redundant space
private String cgroupv1SelfPrefixContent = "9:memory:/user.slice/user-1000.slice/session-3.scope\n"; | |
private String cgroupv1SelfPrefixContent = "9:memory:/user.slice/user-1000.slice/session-3.scope\n"; |
@jerboaa This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
Please review this change to the cgroup v1 subsystem which makes it more resilient on some of the stranger systems. Unfortunately, I wasn't able to re-create a similar system as the reporter. The idea of using the longest substring match for the cgroupv1 file paths was based on the fact that on systemd systems processes run in separate scopes and the maven forked test runner might exhibit this property. For that it makes sense to use the common ancestor path. Nothing changes in the common cases where the
cgroup_path
matches_root
and when the_root
is/
(container the former, host system the latter).In addition, the code paths which are susceptible to throw NPE have been hardened to catch those situations. Should it happen in the future it makes more sense (to me) to not have accurate container detection in favor of continuing to keep running.
Finally, with the added unit-tests a bug was uncovered on the "substring" match case of cgroup paths in hotspot.
p
returned fromstrstr
can never point to_root
as it's used as the "needle" to find in "haystack"cgroup_path
(not the other way round).Testing:
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/8629/head:pull/8629
$ git checkout pull/8629
Update a local copy of the PR:
$ git checkout pull/8629
$ git pull https://git.openjdk.java.net/jdk pull/8629/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 8629
View PR using the GUI difftool:
$ git pr show -t 8629
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/8629.diff