Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8286212: Cgroup v1 initialization causes NPE on some systems #8629

Closed
wants to merge 4 commits into from

Conversation

jerboaa
Copy link
Contributor

@jerboaa jerboaa commented May 10, 2022

Please review this change to the cgroup v1 subsystem which makes it more resilient on some of the stranger systems. Unfortunately, I wasn't able to re-create a similar system as the reporter. The idea of using the longest substring match for the cgroupv1 file paths was based on the fact that on systemd systems processes run in separate scopes and the maven forked test runner might exhibit this property. For that it makes sense to use the common ancestor path. Nothing changes in the common cases where the cgroup_path matches _root and when the _root is / (container the former, host system the latter).

In addition, the code paths which are susceptible to throw NPE have been hardened to catch those situations. Should it happen in the future it makes more sense (to me) to not have accurate container detection in favor of continuing to keep running.

Finally, with the added unit-tests a bug was uncovered on the "substring" match case of cgroup paths in hotspot. p returned from strstr can never point to _root as it's used as the "needle" to find in "haystack" cgroup_path (not the other way round).

Testing:

  • Added unit tests
  • GHA
  • Container tests on cgroups v1 Linux. Continue to pass

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (1 review required, with at least 1 reviewer)

Issue

  • JDK-8286212: Cgroup v1 initialization causes NPE on some systems

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/8629/head:pull/8629
$ git checkout pull/8629

Update a local copy of the PR:
$ git checkout pull/8629
$ git pull https://git.openjdk.java.net/jdk pull/8629/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 8629

View PR using the GUI difftool:
$ git pr show -t 8629

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/8629.diff

Sorry, something went wrong.

@bridgekeeper
Copy link

bridgekeeper bot commented May 10, 2022

👋 Welcome back sgehwolf! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label May 10, 2022
@openjdk
Copy link

openjdk bot commented May 10, 2022

@jerboaa The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot-runtime
  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added serviceability serviceability-dev@openjdk.org hotspot-runtime hotspot-runtime-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels May 10, 2022
@mlbridge
Copy link

mlbridge bot commented May 10, 2022

Webrevs

Copy link
Member

@iklam iklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just started to look at the code so just one comment for now.

@@ -62,6 +62,22 @@ public void setPath(String cgroupPath) {
String cgroupSubstr = cgroupPath.substring(root.length());
path = mountPoint + cgroupSubstr;
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like path is still not set if the condition at line 61 if (cgroupPath.length() > root.length()) { is false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There are more cases where the cgroup path might be null. If you know of cases that should be handled and aren't, please do let me know. Incidentally, that's why I've added the catch for NPE in CgroupV1Subsystem.initSubSystem(), because it's not clear what should be used as path in those corner cases. It certainly shouldn't throw a NPE :-). It then also more closely matches what hotspot does. Having said that, if cgroupPath.length < root.length() it's implied that cgroupPath.startsWith(root) is false. Then the only case where it would still be null is when the paths match in lenght, but aren't the same.

I'm not convinced the logic when then cgroup patch starts with the root path, then it should do what it does today. I.e. given mountpath=/sys/fs/cgroup/memory, root=/foo/bar and cgroupPath=/foo/bar/baz then path=/sys/fs/cgroup/memory/baz. I've personally not seen such a setup and the code predates me. Considering that the equivalent hotspot code had a bug in this logic since day 1, it doesn't seem very widely used...

The most common cases I know to date are:

  1. root=/, cgroupPath=/some, mount=/sys/fs/cgroup/controller => path=/sys/fs/cgroup/controller/some (host systems)
  2. root=/foo/bar/baz, cgroupPath=/foo/bar/baz, mount=/sys/fs/cgroup/controller => path=/sys/fs/cgroup/controller (most container engines)

@jerboaa
Copy link
Contributor Author

jerboaa commented May 12, 2022

I just started to look at the code so just one comment for now.

Sure, thanks for the review!

@tstuefe
Copy link
Member

tstuefe commented May 13, 2022

Thanks for taking my suggestion! Much simpler now :-)

root_p++; cgroup_p++;
}
ss.print_raw(_root, last_matching_slash_pos);
_path = os::strdup(ss.base());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean Find the longest common prefix? Maybe give an example in the comments? Text parsing in C code is really difficult to understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe factor out the search like this

// Return length of common starting substring. E.g. "cat" for ("cattle", "catnip");
static int find_common_starting_substring(const char* s1, const char* s2) {
...
}

That way its clearer and we can find and move this to utilities if we ever need this in other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iklam yes I meant Find the longest common prefix. Fixed the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced the extra function makes the code more readable, but here it is. I can revert back if this is too much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerboaa Feel free to go back to your original variant. This was only a proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood.

@@ -59,7 +60,7 @@ void CgroupV1Controller::set_subsystem_path(char *cgroup_path) {
_path = os::strdup(buf);
} else {
char *p = strstr(cgroup_path, _root);
if (p != NULL && p == _root) {
if (p != NULL && p == cgroup_path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if cgroup_path is "/foo/bar" and _root is "/fo"?

Maybe this block should be combined with the new else block you're adding? However, the behavior seems opposite between these two blocks of code:

The upper block: _root is a prefix of cgroup_path, we take the tail of cgroup_path

  TestCase substring_match = {
    "/sys/fs/cgroup/memory",                           // mount_path
    "/user.slice/user-1000.slice",                     // root_path
    "/user.slice/user-1000.slice/user@1001.service",   // cgroup_path
    "/sys/fs/cgroup/memory/user@1001.service"          // expected_path
  };

The lower block: The beginning of _root is a prefix of cgroup_path, we take the head of cgroup_path

 TestCase prefix_matched_cg = {
    "/sys/fs/cgroup/memory",                           // mount_path
    "/user.slice/user-1000.slice/session-50.scope",    // root_path
    "/user.slice/user-1000.slice/session-3.scope",     // cgroup_path
    "/sys/fs/cgroup/memory/user.slice/user-1000.slice" // expected_path
  };

I find the behavior hard to understand. I think the rules (and reasons) should be added to the comment block above the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if cgroup_path is "/foo/bar" and _root is "/fo"?

With a mount path of /bar this ends up being /bar/o/bar. Pretty strange, but then again it's a bit of a contrived example as those paths come from /proc parsing. Anyway, this is code that got added with JDK-8146115. It's not something I've written and to be honest, I'm not sure this branch is needed, but I didn't want to change the existing behaviour with this patch. I have no more insight than you in terms of why that approach has been taken.

Maybe this block should be combined with the new else block you're adding?

Maybe, but I'm not sure if it would break something.

However, the behavior seems opposite between these two blocks of code:

The upper block: _root is a prefix of cgroup_path, we take the tail of cgroup_path

  TestCase substring_match = {
    "/sys/fs/cgroup/memory",                           // mount_path
    "/user.slice/user-1000.slice",                     // root_path
    "/user.slice/user-1000.slice/user@1001.service",   // cgroup_path
    "/sys/fs/cgroup/memory/user@1001.service"          // expected_path
  };

Yes. Though, I cannot comment on why that has been chosen. It's been there since day one :-/

The lower block: The beginning of _root is a prefix of cgroup_path, we take the head of cgroup_path

 TestCase prefix_matched_cg = {
    "/sys/fs/cgroup/memory",                           // mount_path
    "/user.slice/user-1000.slice/session-50.scope",    // root_path
    "/user.slice/user-1000.slice/session-3.scope",     // cgroup_path
    "/sys/fs/cgroup/memory/user.slice/user-1000.slice" // expected_path
  };

I find the behavior hard to understand. I think the rules (and reasons) should be added to the comment block above the function.

The reason why I've gone down the path of adding the head of cgroup_path is because of this document (in conjunction to what the user was observing on an affected system):
https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/

The user was observing paths as listed in the test:

"/user.slice/user-1000.slice/session-50.scope",    // root_path
"/user.slice/user-1000.slice/session-3.scope",     // cgroup_path

This very much looks like systemd managed. Given that and knowing that systemd adds processes into scopes or services and groups them via slices and also knowing that cgroups are hierarchical (i.e. limits of /foo/bar also apply to /foo/bar/baz), it seems likely that if there are any limits, then it'll be on /user.slice/user-1000.slice within the mounted controller. Unfortunately, I'm not able to reproduce this myself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if the problem is this:

We have systemd running on the host, and a different copy of systemd that runs inside the container.

  • They both set up /user.slice/user-1000.slice/session-??.scope within their own file systems
  • For some reason, when you're looking inside the container, /proc/self/cgroup might use a path in the containerized file system whereas /proc/self/mountinfo uses a path in the host file system. These two paths may look alike but they have absolutely no relation to each other.

I have asked the reporter for more information:

https://gist.github.com/gaol/4d96eace8290e6549635fdc0ea41d0b4?permalink_comment_id=4172593#gistcomment-4172593

Meanwhile, I think the current method of finding "which directory under /sys/fs/cgroup/memory controls my memory usage" is broken. As mentioned about, the path you get from /proc/self/cgroup and /proc/self/mountinfo have no relation to each other, but we use them anyway to get our answer, with many ad-hoc methods that are not documented in the code.

Maybe we should do this instead?

  • Read /proc/self/cgroup
  • Find the 10:memory:<path> line
  • If /sys/fs/cgroup/memory/<path>/tasks contains my PID, this is the path
  • Otherwise, scan all tasks files under /sys/fs/cgroup/memory/. Exactly one of them contains my PID.

For example, here's a test with docker:

INSIDE CONTAINER
# cat /proc/self/cgroup | grep memory
10:memory:/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050
# cat /proc/self/mountinfo | grep memory
801 791 0:42 /docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050 /sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime master:23 - cgroup cgroup rw,memory
# cat /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
cat: /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks: No such file or directory
# cat /sys/fs/cgroup/memory/tasks | grep $$
1

ON HOST
# cat /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
37494
# cat /proc/37494/status | grep NSpid
NSpid:	37494	1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think the current PR could produce the wrong answer, if systemd is indeed running inside the container, and we have:

"/user.slice/user-1000.slice/session-50.scope",    // root_path
"/user.slice/user-1000.slice/session-3.scope",     // cgroup_path

The PR gives /sys/fs/cgroup/memory/user.slice/user-1000.slice/, which specifies the overall memory limit for user-1000. However, the correct answer may be /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-3.scope, which may have a smaller memory limit, and the JVM may end up allocating a larger heap than allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we can decide which one the right file is. This is largely undocumented territory. The correct fix is a) find the correct path to the namespace hierarchy the process is a part of. b) starting at the leaf node, walk up the hierarchy and find the lowest limits. Doing this would be very expensive!

Aside: Current container detection in the JVM/JDK is notoriously imprecise. It's largely based on common setups (containers like docker). The heuristics assume that memory limits are reported inside the container at the leaf node. If, however, that's not the case, the detected limits will be wrong (it will detect it as unlimited, even though it's - for example - memory constrained at the parent). This can for example be reproduced on a cgroups v2 system with a systemd slice using memory limits. We've worked-around this in OpenJDK for cgroups v1 by https://bugs.openjdk.java.net/browse/JDK-8217338

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should do this instead?

* Read /proc/self/cgroup

* Find the `10:memory:<path>` line

* If `/sys/fs/cgroup/memory/<path>/tasks` contains my PID, this is the path

* Otherwise, scan all `tasks` files under  `/sys/fs/cgroup/memory/`. Exactly one of them contains my PID.

Something like that seems most promising, but it would have to be cgroup.procs not tasks as tasks is the task id (i.e. Linux's thread), not the process. We could keep the two common cases as short circuiting. I.e. host and docker cases in the test.

@@ -59,7 +60,7 @@ void CgroupV1Controller::set_subsystem_path(char *cgroup_path) {
_path = os::strdup(buf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on the old code: why this cannot be simply

_path = os::strdup(_mount_point);

Also, all the strncat calls in this function seem problematic, and should be converted to stringStream for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I've filed https://bugs.openjdk.java.net/browse/JDK-8287007 to track this. I'd like to limit the changes of this patch to a minimum. It seems orthogonal.

int last_matching_slash_pos = -1;
for (int i = 0; *s1 == *s2 && *s1 != 0; i++) {
if (*s1 == '/') {
last_matching_slash_pos = i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the behavior a little hard to understand. Is it intentional?

"/cat/cow", "/cat/dog"    -> "/cat/"
"/cat/",    "/cat/dog"    -> "/cat/"
"/cat",     "/cat/dog"    -> "/"

The name find_last_slash_pos doesn't properly describe the behavior. I thought about find_common_path_prefix, but that doesn't match the current behavior (for the 3rd case, the common path prefix is "/cat").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's supposed to find the common path prefix as you say. I'll fix it. Open to suggestions to make it easier to understand.

const char* root_p = _root;
const char* cgroup_p = cgroup_path;
int last_slash = find_last_slash_pos(root_p, cgroup_p);
assert(last_slash >= 0, "not an absolute path?");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are root_p and cgroup_p directly read from the /proc/xxx files. If so, do we validate the input to make sure they are absolute paths?

It seems like our code cannot handle trailing '/' in the input. If so, we should clear all trailing '/' from the input string. Then, in functions that process them, we should assert that they don't end with slash. See my comment in find_last_slash_pos().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those values come from /proc/self/mountinfo and /proc/self/cgroup. There is no validation being done. Then again, we only end up in this branch if the root path is not a substring of the cgroup path. In that case trailing slashes don't matter, since there would not be a character by character match earlier.

I'll add handling of trailing slashes and appropriate asserts where it makes sense.

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 20, 2022

@jerboaa This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@jerboaa
Copy link
Contributor Author

jerboaa commented Jun 21, 2022

Don't close this yet please, bot.

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 19, 2022

@jerboaa This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@jerboaa
Copy link
Contributor Author

jerboaa commented Jul 19, 2022

It's still on my plate to work on, but priorities keep this pushed to the bottom...

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 16, 2022

@jerboaa This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@jerboaa
Copy link
Contributor Author

jerboaa commented Aug 16, 2022

Don't close bot, please.

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 13, 2022

@jerboaa This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@@ -217,6 +221,7 @@ public class TestCgroupSubsystemFactory {
"2:cpu,cpuacct:/\n" +
"1:name=systemd:/user.slice/user-1000.slice/user@1000.service/apps.slice/apps-org.gnome.Terminal.slice/vte-spawn-3c00b338-5b65-439f-8e97-135e183d135d.scope\n" +
"0::/user.slice/user-1000.slice/user@1000.service/apps.slice/apps-org.gnome.Terminal.slice/vte-spawn-3c00b338-5b65-439f-8e97-135e183d135d.scope\n";
private String cgroupv1SelfPrefixContent = "9:memory:/user.slice/user-1000.slice/session-3.scope\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove one redundant space

Suggested change
private String cgroupv1SelfPrefixContent = "9:memory:/user.slice/user-1000.slice/session-3.scope\n";
private String cgroupv1SelfPrefixContent = "9:memory:/user.slice/user-1000.slice/session-3.scope\n";

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 7, 2022

@jerboaa This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot-runtime hotspot-runtime-dev@openjdk.org rfr Pull request is ready for review serviceability serviceability-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

None yet

4 participants