Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extensions and changes from LANL #5

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

jti-lanl
Copy link

@jti-lanl jti-lanl commented Dec 3, 2014

[The following is the contents of README_0.5.2 ...]

This is not (yet) an officially-released version of aws4c. 0.5.2
represents unofficial extensions to the 0.5 release, by Los Alamos National
Laboratory.

We attempted to preserve functionality of all interfaces provided by 0.5.
Where new functionality was added, we've typically added new interfaces.

(1) Extensions to IOBuf / IOBufNode.

The 0.5 version simply adds strings into a linked list, via a malloc and
strcpy(). This might be acceptable when all that's needed is to parse
individual response headers, but we wanted to support more-efficient
operations when large buffers are being sent or received. Thus, it is now
possible to add data for writing by installing user buffers directly into
an IOBuf's linked list. It is also possible to add storage for receiving
data in a similar way. In either case, the added storage may be static or
dynamic. Here are some typical scenarios.

// GET an object from the server into user's <data_ptr>
//     NOTE: this uses the "extend" functions to add unused storage
aws_iobuf_reset(io_buf);
aws_iobuf_extend_static(io_buf, data_ptr, data_length);
AWS4C_CHECK( s3_get(io_buf, obj_name) );
AWS4C_CHECK_OK( io_buf );

// PUT the contents of user's <data_ptr>
//     NOTE: this uses the "append" functions to add data
aws_iobuf_reset(io_buf);
aws_iobuf_append_static(io_buf, data_ptr, data_length);
AWS4C_CHECK( s3_put(io_buf, obj_name) );
AWS4C_CHECK_OK( io_buf );

These use cases will typically also want to call aws_iobuf_reset() after
completion, so as to avoid keeping a pointer internal to the IOBuf which
will go out of scope.

(2) Re-using connections

In 0.5, every call to aws4c functions creates a new CURL connection. This
can add overhead to an application that is performing many operations. We
allow the user to specify that connections should be preserved, or to reset
the connection at a specific time.

aws_reuse_connections(1); // begin reusing connections
// ...
aws_reset_connection();   // reset the connection once
// ...
aws_reuse_connections(0); // stop reusing connections

(3) GET/PUT to/from file

Companion-functions to the 0.5 head/get/put/post functions allow user to
specify a file. This also allows user to provide one IOBuf to capture the
response, in addition to one used to send the request.

(4) binary data

The old getline() function was not very useful for reading arbitrary
streams of binary (or text data). We added a get_raw() function that
ignores newlines.

(5) EMC extensions

EMC supports some extensions to "pure" S3, such as using byte-ranges to
write parts of an object in parallel, or to append to an object. These
normally aren't legal, so you have to call a special function to allow the
library support to be used:

// use byte-range to append to an object
s3_enable_EMC_extensions(1);
s3_set_byte_range(-1, -1);   // creates "Range: bytes=-1-"
s3_put(io_buf, obj_name);

// another interface to append to an object
s3_enable_EMC_extensions(1);
emc_put_append(io_buf, obj_name);

// instead of multi-part upload ...
s3_enable_EMC_extensions(1);
s3_set_byte_range(offset, length);
s3_put(io_buf, obj_name);

(6) extras

The 0.5.2 makefile builds a library, libaws4c. We also provide some
debugging functions and XML support, which probably should not be part of
the default library. Therefore, these have their own header
(aws4c_extras.h) and are built into a separate library libaws4c_extras.
This allows test-apps to use extra functionality, without requiring the
production library to be as big.

(7) unit-tests

Feel free to add new unit-tests to test_aws.c This allows simple tests of
new functions, and provides a crude regression-test.

jti-lanl and others added 30 commits December 3, 2014 12:44
[The following is the contents of README_0.5.2 ...]

This is not (yet) an officially-released version of aws4c.  0.5.2
represents unofficial extensions to the 0.5 release, by Los Alamos National
Laboratory.

We attempted to preserve functionality of all interfaces provided by 0.5.
Where new functionality was added, we've typically added new interfaces.

(1) Extensions to IOBuf / IOBufNode.

The 0.5 version simply adds strings into a linked list, via a malloc and
strcpy().  This might be acceptable when all that's needed is to parse
individual response headers, but we wanted to support more-efficient
operations when large buffers are being sent or received.  Thus, it is now
possible to add data for writing by installing user buffers directly into
an IOBuf's linked list.  It is also possible to add storage for receiving
data in a similar way.  In either case, the added storage may be static or
dynamic.  Here are some typical scenarios.

    // GET an object from the server into user's <data_ptr>
    //     NOTE: this uses the "extend" functions to add unused storage
    aws_iobuf_reset(io_buf);
    aws_iobuf_extend_static(io_buf, data_ptr, data_length);
    AWS4C_CHECK( s3_get(io_buf, obj_name) );
    AWS4C_CHECK_OK( io_buf );

    // PUT the contents of user's <data_ptr>
    //     NOTE: this uses the "append" functions to add data
    aws_iobuf_reset(io_buf);
    aws_iobuf_append_static(io_buf, data_ptr, data_length);
    AWS4C_CHECK( s3_put(io_buf, obj_name) );
    AWS4C_CHECK_OK( io_buf );

These use cases will typically also want to call aws_iobuf_reset() after
completion, so as to avoid keeping a pointer internal to the IOBuf which
will go out of scope.

(2) Re-using connections

In 0.5, every call to aws4c functions creates a new CURL connection.  This
can add overhead to an application that is performing many operations.  We
allow the user to specify that connections should be preserved, or to reset
the connection at a specific time.

    aws_reuse_connections(1); // begin reusing connections
    // ...
    aws_reset_connection();   // reset the connection once
    // ...
    aws_reuse_connections(0); // stop reusing connections

(3) GET/PUT to/from file

Companion-functions to the 0.5 head/get/put/post functions allow user to
specify a file.  This also allows user to provide one IOBuf to capture the
response, in addition to one used to send the request.

(4) binary data

The old getline() function was not very useful for reading arbitrary
streams of binary (or text data).  We added a get_raw() function that
ignores newlines.

(5) EMC extensions

EMC supports some extensions to "pure" S3, such as using byte-ranges to
write parts of an object in parallel, or to append to an object.  These
normally aren't legal, so you have to call a special function to allow the
library support to be used:

    // use byte-range to append to an object
    s3_enable_EMC_extensions(1);
    s3_set_byte_range(-1, -1);   // creates "Range: bytes=-1-"
    s3_put(io_buf, obj_name);

    // another interface to append to an object
    s3_enable_EMC_extensions(1);
    emc_put_append(io_buf, obj_name);

    // instead of multi-part upload ...
    s3_enable_EMC_extensions(1);
    s3_set_byte_range(offset, length);
    s3_put(io_buf, obj_name);

(6) extras

The 0.5.2 makefile builds a library, libaws4c.  We also provide some
debugging functions and XML support, which probably should not be part of
the default library.  Therefore, these have their own header
(aws4c_extras.h) and are built into a separate library libaws4c_extras.
This allows test-apps to use extra functionality, without requiring the
production library to be as big.

(7) unit-tests

Feel free to add new unit-tests to test_aws.c This allows simple tests of
new functions, and provides a crude regression-test.
Manipulate a linked-list of key-value pairs, to create the meta-data list.
Then install it onto an iobuf.  This will be added to the object when it is
written.  During reading, a parser will retrieve these values and install
them onto the iobuf used for the get.  You can build a list of values once,
and reuse them many times,

Also, changed const-ness in the arguments of some functions.
User can provide a custom readfunc/writefunc.  This allows threaded
interaction with curl, such that a series of writes (or reads) can be
incrementally added to the data portion of a PUT (or GET).

There is example code in test_aws (cases 11 and 12).  These tests rely on
phtreads, which may not be available on all platforms that want to use
libaws4c.  Therefore, these tests are not compiled, by default.  If you
want to run them, you have to build with 'make ... PTHREADS=1'

Also added support for chunked-transfer-encoding.  This allows a PUT to be
sent when the total size of the final object is not known at the time the
PUT is invoked.  This could be combined with a streaming write.

Extended IOBuf so that it keeps track of how much unread data is available.
This can be useful for implementing a streaming readfunc.

Also extended IOBuf to provide a pointer to user data.  This can also be
useful in a threaded readfunc/writefunc.  In libaws4c, these functions
receive callbacks from curl, receiving a pointer to an IOBuf.  If they need
some other context, this context can be placed into IOBuf.user_data.

Finally, test_aws adds an error-message if loading of the user's
config-file fails.  Otherwise, this failure causes an obscure segfault at
runtime.
Added support for user readfunc/writefunc/headerfunc.  These are useful to
support streaming transactions, where a client starts a request, then
provided buffers which are filled(emptied) by successive calls to
read(write), controlled via locking.

Exported default read/write/header functions, so custom functions can
hand-off, if desired.  Renamed the default functions by appending "aws_" to
their name, in order to avoid conflicts with functions in standard
libraries.

aws_iobuf_reset() doesn't clear read/write/headerfunc, or user_data, in
addition to growth_size.  This allows streaming I/O to communicate with
these functions by calling aws_iobuf_reset() to indicate empty, without
having to carefully reinstall these functionks.
All global variables have been moved into a new AWSContext struct.  There
is a global instance of this struct, which is used by default everywhere.
However, users can now also create their own contexts and attach them to
individual IOBufs.  This allows multiple threads calling into libaws4c to
avoid stepping on each other's parameters.

The GET/PUT/DELETE requests generated by this library will look for a
context in the IOBuf, falling back to the default context, otherwise.
Thus, old code should continue to work without modification.
All the old interfaces continue to work the same as they did before
(i.e. not thread-safe).  The old functions that manipulated global
variables (e.g. aws_set_id(), or s3_set_host()) now just manipulate the
default context.  However, these functions now also have "_r" variants
(e.g. aws_set_id_r(), or s3_set_host_r()), which take an extra context
argument, changing the settings only in that context.

In other words, if you want/need thread-safety, you can now do something
like this:

   AWSContext* ctx = aws_context_new();
   aws_set_host_r(ctx, myhost);
   IOBuf* b = aws_iobuf_new();
   aws_iobuf_context(b, ctx);
   s3_put(b ...);
   aws_iobuf_reset_hard(b);  // frees context, if present
   aws_iobuf_free(b);
This fixes a problem where aws_iobuf_reset() was wiping out IOBuf.flags
where chunked-transfer-encoding was enabled.  The solution would be either
(a), add IOBuf.flags tot he things that are preserved across calls to
aws_iobuf_reset(), or (b) move it to context, which is already preserved.

TBD: In general, maybe a lot of the stuff that is currently preserved in
calls to aws_iobuf_reset() should just be moved into AWSContext, one way or
another.  Maybe it makes sense to keep IOBuf.user_data in IOBuf, and
preserve it through aws_iobuf_reset(), but many of the other things
(e.g. read_fn, write_fn, etc) should probably just be context-things.
See test_aws.c for an example, and there's discussion in README_lanl.
When using SSL (https), either insecure or not, we need to provide the
string "https", rather than "http" in the header-generating code.
Separated these as two different settings, to be enabled via s3_https() and
s3_https_insecure(), with appropriate changes to the header-generating
code.
…alues.

This behavior caused a subtle bug in MarFS, where the curl callback to a
custom readfunction happened at exactly the moment when the IOBuf.user_data
value was zeroed out, when another thread called aws_iobuf_reset().  The
other thread thought this was safe, because the custom readfunc always
waits on a semaphore before accessing the IOBuf contents. However, the
readfunc gets access to the semaphore via the user_data.  Even though
user_data was "not altered", in the long-run, by aws_iobuf_reset(), it was
actually being temporarily wiped, and then restored.

The upshot is that aws_iobuf_reset() can not temporarily wipe everything
and then restore selected values, unless we wrapp a lot more locking around
things.  The simpler approach is to tweak aws_iobuf_reset() so that it only
wipes those values that it is supposed to wipe, and leaves everything else
alone.
This is a workhorse, that moves data to and from curl buffers within custom
readfuncs used in MarFS.  Instead of iterating through chars, we move
swaths of storage via memcpy().
If we aren't already inside one of the library functions, then resetting
the connection should call curl_easy_cleanup(), instead of waiting for the
next use of this connection.  Otherwise, if the connection is never used
again, we leave file-descriptors sitting in CLOSE_WAIT, forever.
Use set_byte_range[_r] with negative length to cause an open-ended HTTP
Range header to be use with the GET request.  This is useful in the MarFS
fuse implementation, to allow streams to stay open across calls to read().
s3_sproxyd() was setting CTE instead of SPROXYD.  This would force everyone
to use chunked-transfer-encoding, and would also compute S3 encrypted
headers even when they were not needed.

Compiling optimized, by default.  This turned up a possible us of an
uninitialized variable in sqs_example.c

AWS4C_CHECK1() now returns non-zero return-codes directly, allowing
more-nuanced handling of curl errors.

aws_iobuf_extend_internal returns early, if <len> is zero.  This allows
more-efficient ways to send special singals to curl callback functions.
The idea is that you call s3_set_content_length() or
s3_set_content_length_r() before a call to some put/post operation
(e.g. s3_put()).  The content-length field in the AWSContext gets reset
during the put/post, so you must call it again before every put/post.

Testing with command-line curl shows significant bandwidth improvement (to
Scality sproxyd) using known content-length, as opposed to
chunked-transfer-encoding.

However, this attempt to invoke the same functionality through libcurl
apparently doesn't actually work, as of libcurl 7.19.7.  I'm still running
into something very similar to the 8-year-old bug reported here:

  http://curl.haxx.se/mail/archive-2008-05/0032.html

Or maybe it's this:

  http://curl.haxx.se/mail/archive-2011-08/0106.html

Anyhow, I want this support in place, in case we can fix it later.
The idea is that you call s3_set_content_length() or
s3_set_content_length_r() before a call to some put/post operation
(e.g. s3_put()).  The content-length field in the AWSContext gets reset
during the put/post, so you must call it again before every put/post.

Testing with command-line curl shows significant bandwidth improvement (to
Scality sproxyd) using known content-length, as opposed to
chunked-transfer-encoding.

However, this attempt to invoke the same functionality through libcurl
apparently doesn't actually work, as of libcurl 7.19.7.  I'm still running
into something very similar to the 8-year-old bug reported here:

  http://curl.haxx.se/mail/archive-2008-05/0032.html

Or maybe it's this:

  http://curl.haxx.se/mail/archive-2011-08/0106.html

Anyhow, I want this support in place, in case we can fix it later.
Added some comments, to warn that the content-length tools of the previous
commit are not apparently working yet, as noted in the previous commit-log
entry.

Conflicts:
	aws4c.c
	aws4c.h
This is useful in the case where a "streaming" write (e.g. from a pipe)
knows the length of data it is ultimately going to write.  This allows
libcurl to do some things more efficiently.  Used in MarFS.
Only if your libcurl is >= 7.38 I actually haven't been able to test this
yet, because I have an older libcurl.  But some colleagues may be
downloading this soon, to test against a newer libcurl, and we want to try
this feature.
…om the

IOBuf, before pushing new ones to be handed to the curl-interaction thread.
However, the streaming_writeheaderfunc(), which parses and installs
response-header values into the IOBuf, may be invoked well before all the
streaming data has arrived.  In that case, aws_iobuf_reset() will wipe the
parsed results.  Therefore, we provide aws_iobuf_reset_lite(), which leaves
any parsed header-fields untouched.  Thus, the asynchrony between the two
threads is not a problem.
Use s3_http_digest() to enable/disable HTTP digest authentication.  We use
the user/pass parsed from the ~/.awsAuth file in aws_read_config() as the
user/password for libcurl, which is ultimately where the authentication is
performed at runtime.  Subsequent calls to aws_context_clone() will get a
context that can still do this authentication.

This approach allows a process running as root to load the credentials
(from ~/.awsAuth) at initialization time, then de-escalate and continue to
use the context to do authentication, leaving the /root/.awsAuth file
unreadable by other users.
…passwd file in place of HOME to find password.
…es to NULL.

This appears to address a MarFS problem, where streaming_writeheader() was
accessing illegal memory.
jti-lanl added 7 commits May 11, 2016 11:07
Optional arg allows access to object-store that uses HTTP-digest authentication.
…date.

GetStringToSign() takes a new DateConv struct, instead of a char** to
receive the date that is computed internally.  If the time_t* inside
the DateConv is non-null, we use that, instead of the current time, to
generate the signature.

This allows a custom server to authenticate a signed request by
generating its own signature, using a date supplied in the request,
and a password looked up locally using a user-name supplied in a
request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants