How to detect whether a string is Base64 encoded or not?
Every Base64 string is a subset of all possible strings. Those, not any string is valid Base64 encoded string.
The easiest way to hack this validation in your code is to try to decode the input string and check if the decoding has passed successfully. You can learn from snippets in the corresponding page related to Base64 and programming languages.
Base64 alphabet and detailed explanation
First, let’s consider the trivial case: the empty string. Every empty string is a valid Base64 string.
Non-trivial cases are all remaining. Knowing the internals of Base64 encoding will help us as follows.
Base64 encoding use the limited set of characters to represent the output. This set is called The Base64 Alhabet and it is described in the original RFC 4648 under the section “The Base 64 Alphabet”.
This alphabet contains following characters:
Notice that we separate
=, we will talk about this below.
This implies the first thing we should determine: if the input string combined only from the characters of the Base64 Alphabet.
But this condition is not enough.
The separated symbol
= has sole purpose in Base64 encoding: the padding!
The padding appears, when the number of bits in the source string provided as input to Base64 encode is not a multiple of 3. The remaining part of the output is “padded” by
= character to provide consistent output, available for decoding back.
By design, we either have zero-padded,
== padded or
=== padded outputs.
Consequently, the second condition we have to check, if the string does not contain the padding symbol
= somewhere before the end of string. The string either built from Base64 Alphabet symbols except
=, or it contains from 1 to 3
= characters at the end.
Code prototype of detecting whether a string is Base64 encoded or not
We can prototype this algorithm in Python.
def check_string_is_base64_encoded(input): chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=' first_condition = set(input).issubset(chars) if not first_condition: return False if '=' not in input: return True last_char_index = len(input) - 1 max_padding_length = 3 contains_padding_in_between = input.index('=') < last_char_index - max_padding_length if contains_padding_in_between: return False second_condition = not contains_padding_in_between and (input.endswith('=') or input.endswith('==') or input.endswith('===')) return second_condition