Encodings

Content

Content
Links
Definitions
Different encode types
Encoding tricks
Special characters

Definitions

Symbol encoding: establishes rule how symbols\pictures correlate with arithmetic numbers.
(e.g. unicode)
Character encoding: establishes rule how numbers (signifying some character) will be encoded in bytes (and written somewhere) and vice versa.
(e.g. UTF-8, UTF-16, …)

Exists a lot of abnormal encodings (e.g. cp1251, …), which are messing up two concepts, enclosing both of them: symbol encoding and character encoding.

Different encode types

URL encode: (url must be represented by ascii symbols 0 - 126)

Hello World –> Hello%20%57%6f%72%6c%64 (normal ascii symbols can be represented without encode by choice)
` ` –> + or %20
not ascii symbols: ü –> %C3%BC (utf-8 hex representation)

HTML entities: Any symbol can be encoded in decimal { or in hex ģ
Encoded symbols will be not interpreted by browser as a special symbols.

examples:

’ ‘	non-breaking space	` `	` `
<	less than	`<`	`<`
>	greater than	`>`	`>`
&	ampersand	`&`	`&`
¢	cent	`¢`	`¢`
£	pound	`£`	`£`
¥	yen	`¥`	`¥`
€	euro	`€`	`€`
©	copyright	`©`	`©`
®	registered trademark	`®`	`®`
	etc.

Encoding tricks

Encodings latin1, gbk and character escaping

In latin1 string=%BF%27=¿'

After escaping symbol %27=' with %5C=\ string=%BF%5C%27

In gbk encoding string=%BF%5C%27=縗'

If mysql SET NAMES gbk; was set, then this encoding trick will help to bypass mysql_real_escape_string php function.

Similar tricks can be done with next encodings: big5, cp932, gb2312, gbk and sjis.
\x90 - assembler’s nop-code

Special characters

Special unicode symbols:

unicode replacement symbol - “\ufffd”
RTLO - RLO - Right-To-Left override - “0x202E”

space immitation (hex)

%20	space
/**/	comment
%09	tabulation
%0A	new line
%0B	vertical tabulation
%0C	new page
%0D	carriage return
%A0	non-breaking space

using utf-8 encoding (encode one-byte symbol using 2 bytes):

in different systems \xC0 and \xe0 can be not recognized as utf-8 service byte

  \xC0\x49, \xC0\x4A, \xC0\x4B, \xC0\x4C, \xC0\x4D
  \xC0\x89, \xC0\x8A, \xC0\x8B, \xC0\x8C, \xC0\x8D
  \xC0\xC9, \xC0\xCA, \xC0\xCB, \xC0\xCC, \xC0\xCD
  \xC0\xE0, \xC0\x60
  \xE0\x00\x09, \xE0\x40\x09, \xE0\x80\x09

ascii special characters (dec)

’ ‘	32	+	43	@	64
!	33	,	44	[	91
”	34	-	45	\	92
#	35	.	46	]	93
$	36	/	47	^	94
%	37	:	58	_	95
&	38	;	59	`	96
’	39	<	60	{	123
(	40	=	61	\|	124
)	41	>	62	}	125
*	42	?	63	~	126

ascii control characters (dec)

NUL	00	null character	VT	11	vertical tab	SYN	22	synchronize
SOH	01	start of header	FF	12	form feed	ETB	23	end transmission block
STX	02	start of text	CR	13	carriage return	CAN	24	cancel
ETX	03	end of text	SO	14	shift out	EM	25	end of medium
EOT	04	end of transmission	SI	15	shift in	SUB	26	substitute
ENQ	05	enquiry	DLE	16	data link escape	ESC	27	escape
ACK	06	acknowledge	DC1	17	device control 1	FS	28	file separator
BEL	07	bell (ring)	DC2	18	device control 2	GS	29	group separator
BS	08	backspace	DC3	19	device control 3	RS	30	record separator
HT	09	horizontal tab	DC4	20	device control 4	US	31	unit separator
LF	10	line feed	NAK	21	negative acknowledge	DEL	127	delete (rubout)

’ ‘	non-breaking space	` `	` `
<	less than	`<`	`<`
>	greater than	`>`	`>`
&	ampersand	`&`	`&`
¢	cent	`¢`	`¢`
£	pound	`£`	`£`
¥	yen	`¥`	`¥`
€	euro	`€`	`€`
©	copyright	`©`	`©`
®	registered trademark	`®`	`®`
	etc.

’ ‘	non-breaking space	` `	` `
<	less than	`<`	`<`
>	greater than	`>`	`>`
&	ampersand	`&`	`&`
¢	cent	`¢`	`¢`
£	pound	`£`	`£`
¥	yen	`¥`	`¥`
€	euro	`€`	`€`
©	copyright	`©`	`©`
®	registered trademark	`®`	`®`
	etc.

Content

Links

Definitions

Different encode types

Encoding tricks

Special characters

’ ‘	non-breaking space	` `	` `
<	less than	`<`	`<`
>	greater than	`>`	`>`
&	ampersand	`&`	`&`
¢	cent	`¢`	`¢`
£	pound	`£`	`£`
¥	yen	`¥`	`¥`
€	euro	`€`	`€`
©	copyright	`©`	`©`
®	registered trademark	`®`	`®`
	etc.