Apache Knox is a gateway application and the door to access data in a data lake hidden behind a firewall. While the usage is fairly simple the setup, configuration and debugging process can be tedious due to many different components that Apache Knox ties together. I published this article also on Hortonworks Community Connection.
- First try to access the service directly before you go over Knox. In many cases, there’s nothing wrong with your Knox setup, but with either the way you setup and configured the service behind Knox or the way you try to access that service.
- When you are familiar on how to access your service directly and when you have verified that it works as intended, try to do the same call on Knox.
- You want to check if webhdfs is reachable so you first verify directly at the service and try to get the home directory of the service.
curl --negotiate -u : http://webhdfs-host.field.hortonworks.com:50070/webhdfs/v1/?op=GETHOMEDIRECTORY
- If above request gives a valid 200 response and a meaningful answer you can safely check your Knox setup.
curl -k -u myUsername:myPassword https://knox-host.field.hortonworks.com:8443/gateway/default/webhdfs/v1/?op=GETHOMEDIRECTORY
- Note: Direct access of WebHDFS and access of WebHDFS over Knox use two different authentication mechanisms: The first one uses SPNEGO which requires a valid Kerberos TGT in a secure cluster, if you don’t want to receive a “401 – Unauthorized” response. The latter one uses HTTP basic authentication against an LDAP, which is why you need to provide username and password on the command line.
- Note 2: For the sake of completeness, I mention that here: Obviously, you direct the first request directly to the service host and port, while you direct your second request to the Knox host and port and specify which service.
The next section answers the question, what to do if the second command fails? (If the first command fails, go setup your service correctly and return later).
Security Related Issues
So what do the HTTP response codes mean for a Knox application? Where to start?
- Very common are “401 – Unauthorized”. This can be misleading, since 401 is always tied to authentication – not authorization. That means you need to probably check one of the following items. Which of these items causes the error can be found in the knox log (per default
- Is your username password combination correct (LDAP)?
- Is your username password combination in the LDAP you used?
- Is your LDAP server running?
- Is your LDAP configuration in the Knox topology correct (hostname, port, binduser, binduser password,…)?
- Is your LDAP controller accessible through the firewall (ports 389 or 636 open from the Knox host)?
- Note: Currently (in HDP 2.6), you can specify an alias for the binduser password. Make sure, that this alias is all lowercase. Otherwise you will get a 401 response as well.
- If you got past the 401s, a popular response code is “403 – Unauthorized”. Now this has actually really something to do with authorization. Depending on if you use ACL authorization or Ranger Authorization (which is recommended) you go ahead differently. If you use ACLs, make sure that the user/group is authorized in your topology definition. If you use Ranger, check the Ranger audit log dashboard and you will immediately notice two possible error sources:
Your user/group is not allowed to use Knox.
Your user/group is not allowed to use the service that you want to access behind Knox.
Well, we came a long way and with respect to security we are almost done. One possible problem you could become is with impersonation. You need knox to be allowed to impersonate any user who access a service with knox. This is a configuration in core-site.xml:
hadoop.proxyuser.knox.hosts. Enter a comma separated list of groups and hosts that should be able to access a service over knox or set a wildcard
This is what you get in the Knox log, when your Ranger Admin server is not running and policies cannot be refreshed.
2017-07-05 21:11:53,700 ERROR util.PolicyRefresher (PolicyRefresher.java:loadPolicyfromPolicyAdmin(288)) - PolicyRefresher(serviceName=condlahdp_knox): failed to refresh policies. Will continue to use last known version of policies (3) javax.ws.rs.ProcessingException: java.net.ConnectException: Connection refused (Connection refused)
This is also a nice example of Ranger’s design to not interfere with services if it’s down: policies will not be refreshed, but are still able operate as intended with the set of policies before Ranger crashed.
Application Specific Issues
Once you are past the authentication and authorization issues, there might be issues with how Knox interacts with its applications. This section might grow with time. If you have more examples of application specific issues, leave a comment or send me an email.
- To enable Hive working with Knox, you need to change the transport mode from binary to http. It might be necessary in rare cases to not only restart Hiveserver2 after this configuration change, but also the Knox gateway.
- This is what you get when you don’t switch the transport mode from “binary” to “http”. Binary runs on port 10000, http runs on port 10001. When binary transport mode is still active Knox will try to connect to port 10001 which is not available and thus fails with “Connection refused”.
2017-07-05 08:24:31,508 WARN hadoop.gateway (DefaultDispatch.java:executeOutboundRequest(146)) - Connection exception dispatching request: http://condla0.field.hortonworks.com:10001/cliservice?doAs=user org.apache.http.conn.HttpHostConnectException: Connect to condla0.field.hortonworks.com:10001 [condla0.field.hortonworks.com/172.26.201.30] failed: Connection refused (Connection refused) org.apache.http.conn.HttpHostConnectException: Connect to condla0.field.hortonworks.com:10001 [condla0.field.hortonworks.com/172.26.201.30] failed: Connection refused (Connection refused) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
- When you fixed all possible HTTP 401 errors for other services than Hive, but still get on in Hive, you might forget to pass username and password to beeline
beeline -u "<jdbc-connection-string>" -n "<username>" -p "<password>"
- The correct jdbc-connection-string should have a format as in the example below:
$TRUSTSTORE_PATHis the path to the truststore containing the knox server certificate, on the server with root access you could e.g. use
$KNOX_HOSTNAMEis the hostname where the Knox instance is running
$KNOX_PORTis the port exposed by Knox
$TRUSTSTORE_SECRETis the secret you are using for your truststore
- Now, this is what you get, when you connect via beeline trying to talk to Knox from a different (e.g. internal) hostname than the one configured in the ssl certificate of the server. Just change the hostname and everything will work fine. While this error is not specifically Hive related, you will most of the time encounter it in combination with Hive, since most of the other services don’t require you to check your certificates.
Connecting to jdbc:hive2://knoxserver-internal.field.hortonworks.com:8443/;ssl=true;sslTrustStore=truststore.jks;trustStorePassword=myPassword;transportMode=http;httpPath=gateway/default/hive 17/07/06 12:13:37 [main]: ERROR jdbc.HiveConnection: Error opening session org.apache.thrift.transport.TTransportException: javax.net.ssl.SSLPeerUnverifiedException: Host name 'knoxserver-internal.field.hortonworks.com' does not match the certificate subject provided by the peer (CN=knoxserver.field.hortonworks.com, OU=Test, O=Hadoop, L=Test, ST=Test, C=US)
- WEBHBASE is the service in a Knox topology to access HBase via the HBase REST server. Of course, a prerequisite is that the HBase REST server is up and running.
- Even if it is up and running it can occur that you receive an Error with HTTP code 503. 503: Unavailable. This is not related to Knox. You can track down the issue to a HBase REST server related issue, in which the authenticated user does not have privileges to e.g. scan the data. Give the user the correct permissions to solve this error.