Create random test data using real data from your MySql database.

by Ronald Speelman

Mix and match random data with real data from your database.

Recently I showed a function that could create test data based on a mask or pattern.
See: http://moinne.com/blog/ronald/mysql/howto-generate-meaningful-test-data-using-a-mysql-function
This works fine for phone numbers, zip-codes etc. but it is not very useful for things like names of persons, products you sell, countries- and cities, etc. etc.
However it is very simple to mix the output of the str_random() function with data that is already available in your database.

Consider this example, we want to generate a random e-mail address only using the str_random() function:
MYSQL>SELECT str_random('c{3}c(5)[.|_]c{8}c(8)@[google|yahoo|live|mail]".com"') AS email;

This function can give results like this:
edbsotd.afdyittnge@google.com
smhwnzi_pwodzpzbwef@live.com
hpgz_gvkmgtbyitups@yahoo.com

We have generated valid email addresses, but they do not look like real email addresses.
We could do better if we could use real names in the test data. The following example gets this data from the table “persons”.
I ask only for a random female name, a random last name and optionally we add a number after the last name:

MYSQL>SELECT CONCAT((SELECT firstname FROM persons WHERE gender = 'f' AND RAND()<(SELECT ((1/COUNT(*))*10) FROM persons) LIMIT 1)
             ,str_random('[.|-|_]')
             ,(SELECT lastname FROM persons WHERE RAND()<(SELECT ((1/COUNT(*))*10) FROM persons) LIMIT 1)
             ,str_random('d(4)@[google|yahoo|live|mail]".com"')
            ) AS email;

The results are like this:
Emily_Kendall55@google.com
Angela-Lim86@mail.com
Melanie.Kirby@mail.com

This is much better, as we can use real names combined with different formatting and different mail providers.
I have added the "persons" table in the zip file for your convenience.

Fortunately, the "persons" table is small, only a thousand rows. I do not recommend selecting random rows from very large tables, this can be very slow.
If you need to select random rows for your test data from a very large table, check out this link: http://www.warpconduit.net/2011/03/23/selecting-a-random-record-using-mysql-benchmark-results/
This will show different techniques to do this as efficient as possible.

In the next example we create some random user information: firstname, lastname, birthdate, email, city, address and country.
- The firstname and lastname should be reflected in the emailadres.
- The city name should belong in the country.

The country information is from geonames.org. You can download the information here: http://download.geonames.org/export/dump/countryInfo.txt
I have used a simplified version and have added it in the zip file as a table.

I have used the city information from geonames.org also. You can download this text file here: http://download.geonames.org/export/dump/cities1000.zip
I have added a new auto increment column so it is more efficient to select random rows from it.
You can create a table and import the data like this:

CREATE TABLE `cities` (
  `auto_id` int(11) unsigned AUTO_INCREMENT NOT NULL,
  `id` int(11) unsigned NOT NULL,
  `name` varchar(200) NOT NULL DEFAULT '',
  `ascii_name` varchar(200) NOT NULL DEFAULT '',
  `alternate_names` varchar(2000) NOT NULL DEFAULT '',
  `lat` double(11,7) NOT NULL DEFAULT '0.0000000',
  `lon` double(11,7) NOT NULL DEFAULT '0.0000000',
  `feature_class` char(1) DEFAULT NULL,
  `feature_code` varchar(10) DEFAULT NULL,
  `country_code` char(2) DEFAULT NULL,
  `country_code2` varchar(60) DEFAULT NULL,
  `admin1_code` varchar(20) DEFAULT '',
  `admin2_code` varchar(80) DEFAULT '',
  `admin3_code` varchar(20) DEFAULT '',
  `admin4_code` varchar(20) DEFAULT '',
  `population` bigint(11) DEFAULT '0',
  `elevation` int(11) DEFAULT '0',
  `dem` int(11) DEFAULT '0',
  `timezone` varchar(40) DEFAULT NULL,
  `mod_date` date DEFAULT '0000-00-00',
  PRIMARY KEY (`auto_id`),
  KEY `country_code_name_idx` (`country_code`,`name`),
  KEY `fk_cities_countries` (`country_code`),
  CONSTRAINT `fk_cities_countries` FOREIGN KEY (`country_code`) REFERENCES `countries` (`iso_3166_2`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

LOAD DATA LOCAL INFILE '/path_to_file/cities1000.txt' REPLACE INTO TABLE cities FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' (id,name,ascii_name,alternate_names,lat,lon,feature_class,feature_code,country_code,country_code2, admin1_code,admin2_code,admin3_code,admin4_code,population,elevation,dem,timezone,mod_date) ;
ANALYZE TABLE cities;

Let's create a table for the test data:

CREATE TABLE IF NOT EXISTS testdata (
  `firstname`        VARCHAR(200) NOT NULL,
  `lastname`         VARCHAR(200) NOT NULL,
  `email`            VARCHAR(200) NOT NULL,
  `birthday`         DATE NOT NULL,
  `city`             VARCHAR(200) NOT NULL,
  `address`          VARCHAR(200) NOT NULL,
  `country`          VARCHAR(200) NOT NULL
);

And insert a row:

SELECT firstname INTO @firstname FROM persons WHERE RAND()<(SELECT ((1/COUNT(*))*10) FROM persons) ORDER BY RAND() LIMIT 1;
SELECT lastname INTO @lastname FROM persons WHERE RAND()select * from testdata;
SELECT DISTINCT co.iso_3166_2 INTO @countrycode FROM countries co INNER JOIN cities ci ON (co.iso_3166_2 = ci.country_code) ORDER BY RAND() LIMIT 1;
SELECT `name` INTO @countryname FROM countries WHERE iso_3166_2 = @countrycode;
SELECT `ascii_name` INTO @city FROM (SELECT `ascii_name` FROM cities WHERE country_code = @countrycode ORDER BY RAND()) res LIMIT 1;
INSERT INTO testdata(firstname,lastname,email,birthday,city,address,country)
VALUES(@firstname
      ,@lastname
      ,CONCAT(@firstname,str_random('[.|-|_]'),@lastname,str_random('d(4)@[google|yahoo|live|mail]".com"'))
      ,str_random_date('1920-01-01','1999-12-31','%Y-%m-%d')
      ,@city
      ,str_random('Cc{5}[street|lane|road|park] d{1}d(2)')
      ,@countryname
      );

After running this code a couple of times the table is filled like this:

MYSQL>select * from testdata;
+------------+-----------+------------------------------+------------+---------------------+-----------------+-----------------------------+
| firstname  | lastname  | email                        | birthday   | city                | address         | country                     |
+------------+-----------+------------------------------+------------+---------------------+-----------------+-----------------------------+
| Phyllis    | Lin       | Phyllis_Lin@mail.com         | 1940-09-04 | Ad Dakhla           | Qittopstreet 1  | Western Sahara              |
| Carmen     | Robinson  | Carmen-Robinson879@live.com  | 1985-04-14 | Meedhoo             | Vtgybmstreet 41 | Maldives                    |
| Danielle   | Sutton    | Danielle_Sutton9@live.com    | 1989-03-15 | Schellenberg        | Dbvyisroad 0    | Liechtenstein               |
| Jeremy     | Sellers   | Jeremy-Sellers58@live.com    | 1951-07-08 | Lucea               | Geaoswstreet 3  | Jamaica                     |
| Lisa       | MacDonald | Lisa.MacDonald@google.com    | 1929-09-05 | Marigot             | Fqpaippark 10   | Saint Martin                |
| Courtney   | Newton    | Courtney.Newton5@yahoo.com   | 1943-01-17 | La Condamine        | Irmkoplane 6    | Monaco                      |
| Rebecca    | Mathews   | Rebecca_Mathews037@live.com  | 1994-09-01 | Las Acequias        | Ucdjmhpark 0    | Argentina                   |
| Marshall   | Kearney   | Marshall_Kearney3@mail.com   | 1921-01-29 | Kirakira            | Qbmfrsroad 6    | Solomon Islands             |
| Christian  | Pollock   | Christian-Pollock65@live.com | 1953-03-06 | Laayoune / El Aaiun | Hctktmstreet 66 | Western Sahara              |
| Jamie      | Rosenberg | Jamie-Rosenberg8@google.com  | 1952-03-06 | Hadrut              | Nidsbfpark 03   | Azerbaijan                  |
| Jose       | Morton    | Jose-Morton34@yahoo.com      | 1987-06-21 | Ilorin              | Vrvcawstreet 23 | Nigeria                     |
| Marguerite | Melton    | Marguerite.Melton@mail.com   | 1948-04-17 | Port-aux-Francais   | Iqfdbylane 48   | French Southern Territories |
| Debbie     | Langston  | Debbie.Langston414@mail.com  | 1974-02-12 | Ugurchin            | Zyroszpark 9    | Bulgaria                    |
+------------+-----------+------------------------------+------------+---------------------+-----------------+-----------------------------+
13 rows in set (0.00 sec)

The birthday is generated with a small function that generates a random date between a date range. You can format the output with the third parameter.
E.g.:

MYSQL>SELECT str_random_date('1920-03-12','2012-01-31','%Y-%m-%d') AS RandomMysqlDate
     , str_random_date('1920-03-12','2012-01-31','%Y-%m-%d:%T') AS RandomDateWithTime
     , str_random_date('2012-01-01','2012-12-31','%u') AS RandomWeeknumberIn2012
     , str_random_date('2012-02-01','2012-03-01','%e') AS RandomDayInFebruary
\G
*************************** 1. row ***************************
       RandomMysqlDate: 1992-09-17
    RandomDateWithTime: 1943-03-10:14:28:02
RandomWeeknumberIn2012: 46
   RandomDayInFebruary: 20
1 row in set (0.00 sec)

You can download the persons and countries tables in this zip file: countries_persons_tables_2012-11-28.zip
You can download the str_random_date function in this zip file: str_random_date_2012-07-29.zip
You can download the functions for generating random data and see examples of generating useful test data here: http://moinne.com/blog/ronald/mysql/howto-generate-meaningful-test-data-using-a-mysql-function

Have fun!

Mysql: creating a link to your glossary while fetching text for a webpage

by Ronald Speelman

This MySql function will generate a link to your glossary when you are fetching text for a web page.
All code is provided in a zip file.

The requirements:
I needed to add a new functionality for an existing application. I needed to add a link to the glossary page whenever a text contains a word that is in the glossary. The requirements where to avoid a lot of database calls, it needs to be fast and it needed to be implemented in existing application code.

There are three other requirements that are related to adding a link to the glossary:
1: Only link the first occurrence of a word. If a word occurs multiple times in a text, you only add a link to the first occurrence of the word.
2: Keywords are case sensitive. E.g.: the word “August” does not always means a month; “an august person”. Or an apple is different than the company with the same name but always written with an upper case A: “Apple”
3: A keyword can actually be multiple words like: “primary key” or “board of directors”.

Install the code:
First create a glossary table (The script is in the zip file):

mysql>CREATE TABLE glossary(
    glossary_id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT
   ,`name` VARCHAR(100) BINARY NOT NULL
   ,description VARCHAR(4000) NOT NULL
   ,PRIMARY KEY (`glossary_id`)
   ,UNIQUE glossary_name_udx (`name`)
   ) ENGINE = 'InnoDB'
   DEFAULT CHARACTER SET = utf8
   COLLATE = utf8_bin
   ;

A glossary table usually does not has a lot of rows, I have to see the first glossary table with more then 1000 rows. A smallint is enough.
The name must be unique.

Set-up test data:
Populate the glossary table with a 1000+ rows so we can test the performance on a full table. The first 1000 are just random data and the last 5 rows holds real definitions used for our test. Find the script to populate the table in the zip file and execute it:

mysql>source populate_glossary.sql;

Analyze the table:

mysql>ANALYZE TABLE glossary;
+------------------+---------+----------+----------+
| Table            | Op      | Msg_type | Msg_text |
+------------------+---------+----------+----------+
| schemas.glossary | analyze | status   | OK       |
+------------------+---------+----------+----------+
1 row in set (0.00 sec)

Add the function:

mysql>source fn_glossary_link_words.sql;

The function we just created has two parameters:
p_text String: The text that contains the keywords that must be replaced
p_format String: A string that replaces the keyword. This string can contain three place-holders: {id} , {name} , {description}
An example of such a format is:’<a href="/glossary.php?id={id}">{name}</a>’

Test the function:
As an example of how to execute the function let’s first set up a test table that contains the text that we want to show on a webpage:

mysql>CREATE table t1(id INT,c1 VARCHAR(400));
mysql>INSERT INTO t1 VALUES (1,'This text has twice the string: "table" and twice the string "column". Only the first occurrence of the string is marked and the following strings not: table column. This text also has a glossary word that consist of two words like "primary key". Also note that the lower-case string "mysql" is not marked because in the glossary the string is using upper and lower-case like this: MySQL.');

Now select the text from column “c1″ and add the links to the glossary

mysql>SELECT str_add_glossary_words(c1,'<a href="/glossary/{id}">{name}</a>') AS textWithGlossaryLinks
        FROM t1
       WHERE id = 1;

This is the result:
This text has twice the string: "<a href="/glossary/1002">table</a>" and twice the string "<a href="/glossary/1005">column</a>". Only the first occurrence of the string is marked and the following strings not: table column. This text also has a glossary word that consist of two words like "<a href="/glossary/1001">primary key</a>". Also note that the lower-case string "mysql" is not marked because in the glossary the string is using upper and lower-case like this: <a href="/glossary/1004">MySQL</a>.

This is how the result looks on a webpage ( the links do not work, because the glossary page does not exist):
This text has twice the string: “table” and twice the string “column“. Only the first occurrence of the string is marked and the following strings not: table column. This text also has a glossary word that consist of two words like “primary key“. Also note that the lower-case string “mysql” is not marked because in the glossary the string is using upper and lower-case like this: MySQL.

Compare it with the original text:
This text has twice the string: “table” and twice the string “column”. Only the first occurrence of the string is marked and the following strings not: table column. This text also has a glossary word that consist of two words like “primary key”. Also note that the lower-case string “mysql” is not marked because in the glossary the string is using upper and lower-case like this: MySQL.

Conclusion:
It is easy to apply this function to an existing application, you only have to add the function to some of your database calls and leave the rest of the application unchanged. There are no extra database calls and replacing the text is as fast in your application code as in the database.
You can download all code in this zip file: glossary_2012-07-07.zip

MySql Lorum Ipsum generator

by Ronald Speelman

A short MySql function to generate a Lorum Ipsum text.
You can download the code in the zip file below.

The function has three parameters:
p_max_words Number: the maximum amount of words, if no min_words are provided this will be the exact amount of words in the result. Default = 50
p_min_words Number: the minimum amount of words in the result, By providing the parameter, you provide a range. Default = 0
p_start_with_lipsum Boolean: if “1″ the string will start with ‘Lorum ipsum dolor sit amet.’, Default = 0

Create exactly 3 words:

mysql>SELECT str_random_lipsum(3,NULL,NULL);
+--------------------------------+
| str_random_lipsum(3,NULL,NULL) |
+--------------------------------+
| Class facilisi dictum.         |
+--------------------------------+
1 row in set (0.00 sec)

Create exactly 20 words and let it start with “Lorum ipsum dolor sit amet.”

mysql>SELECT str_random_lipsum(20,NULL,1);
+----------------------------------------------------------------------------------------------------------------------------------------+
| str_random_lipsum(20,NULL,1)                                                                                                           |
+----------------------------------------------------------------------------------------------------------------------------------------+
| Lorem ipsum dolor sit amet. Penatibus eleifend cum amet sagittis sem per vestibulum sociosqu. Sem etiam vehicula urna scelerisque dui. |
+----------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Create max 30 words, but minimal 10, let it start with “Lorum ipsum dolor sit amet.”

mysql>SELECT str_random_lipsum(30,10,1);
+----------------------------------------------------------------------------------------------------+
| str_random_lipsum(30,10,1)                                                                         |
+----------------------------------------------------------------------------------------------------+
| Lorem ipsum dolor sit amet. Senectus erat sed est urna sollicitudin litora pulvinar felis. Varius. |
+----------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

You can download the function in this zip file: str_random_lipsum_2012-07-07.zip

Howto generate meaningful test data using a MySQL function

by Ronald Speelman

You can use this MySQL function to generate names, (e-mail)addresses, phone numbers, urls, bit values, colors, IP address, etc..
As usual, the code is provided in a zipfile and the code is fully documented.

When creating random test data it is important that the data matches your domain values. Just a random string generated with a MD5 hash does not resemble real data and cannot be used for testing. There are some tools that you can buy to generate meaningful test data, but you can do this with a simple function in your database, free of charge ;-)

It is as simple as this:
MYSQL>SELECT str_random('[http|https]://www.c{4}c(15).[com|co.uk|org|net]') AS url;

How does it work?
This functionality makes use of masks, or a pattern, to generate the random values. During the generation process the masking characters will be replaced with a random character matching the mask values.
A mask is any of these characters: c, C, A, d, D, b, X, *
Besides the mask you can give directives on how to interpret the masks: nn times, random times, fixed text or an option list.

This is the full description of the masks and directives:

MASKS:
c returns lower-case character [a-z]
C returns upper-case character [A-Z]
A returns either upper or lower-case character [a-z A-Z]
d returns a digit [0-9]
D returns a digit without a zero [1-9]
b returns a bit [0-1]
X returns hexadecimal character [0-F]
* returns characters, decimals and special characters [a-z A-Z 0-9 !?-_@$#]

DIRECTIVES
“text” : text is taken literally
{nn} : repeat the last mask nn times
(nn) : repeat random, but max nn times
[item|item] : pick a random item from this list, items are separated by a pipe symbol
All other characters are taken literally

How to install?
In the zip file are two functions:
fn_str_random.sql and fn_str_random_character.sql
Install first fn_str_random_character.sql and then fn_str_random.sql because function str_random() is dependent on function str_random_character().

str_random() loops through the pattern,handles the directives and collects the whole string. We will use only str_random() in the following examples.
str_random_character() is a helper function for str_random(), this function generates a single random character based on the mask.

You can input a mask of 200 characters and the output is limited to 2000 characters. You can change these values if you feel the need for it.

You can call the function in any INSERT INTO statement like:
INSERT INTO t1(c1) VALUES(str_random('add your mask here'));
Adding it in a loop in a stored procedure or execute the SQL statement in a loop in your (PHP) application will generate lot’s of test data for your application.

Examples:
These examples only show the functionality of the function, not inserting the generated values in a table.

The first example is a simple one, we want to generate a Dutch zip code like “1234 AB”.
We can do this is two ways:

MYSQL>SELECT str_random('dddd CC') AS DutchZip1,  str_random('d{4} C{2}') AS DutchZip2;
+-----------+-----------+
| DutchZip1 | DutchZip2 |
+-----------+-----------+
| 8868 GS   | 8911 DH   |
+-----------+-----------+
1 row in set (0.00 sec)

d is a digit[0-9] and C is an upper-case character[A-Z]
In the second we used a directive to repeat the digit exactly four times and the upper-case character exactly two times.
The space in between the masks is taken literally because it is not a known mask or a directive.

Now what if we have the need for not an exact amount of characters, for example we want to generate a password of at least 6 characters?

MYSQL>SELECT str_random('*{6}*(5)') AS pass;
+-----------+
| pass      |
+-----------+
| 0rTM?rY6P |
+-----------+
1 row in set (0.00 sec)

The star * returns all alpha numeric, numeric and some special characters.
We want at least 6 characters so we use *{6} and a random number on top of that.
For a random number we use (nn)

Now what if we have the need of a * in our test data? The * is a known mask so we have to tell the function to take it literally.
We can do that by surrounding it in double quotes like this:

MYSQL>SELECT str_random('"***"d{2}"***"') AS test;
+----------+
| test     |
+----------+
| ***95*** |
+----------+
1 row in set (0.00 sec)

Another useful feature is the possibility to add a list of possible values. This might be useful for example when generating an url.
You do not want the top domain name completely random but a value like:”.com” or “.org”

MYSQL>SELECT str_random('[http|https]://www.c{4}c(15).[com|co.uk|org|net]') AS url;
+----------------------------+
| url                        |
+----------------------------+
| http://www.rdokmknmt.co.uk |
+----------------------------+
1 row in set (0.00 sec)

The list is surrounded by square brackets and the items in the list are separated by the pipe symbol.

Here are some other useful patterns you can use:

MYSQL>SELECT str_random('c{3}c(5)[.|_]c{8}c(8)@[google|yahoo|live|mail]".com"') AS email1,
       str_random('c{3}c(5)[.|_]c{8}c(8)@c{4}c(3).[com|de|co.uk|fr|org|net]') AS email2;
+-----------------------------+--------------------------------+
| email1                      | email2                         |
+-----------------------------+--------------------------------+
| jfvqgj.adngpgnuir@yahoo.com | uvucb_uxcxcwzjhwidsch@fdars.fr |
+-----------------------------+--------------------------------+
1 row in set (0.00 sec)

MYSQL>SELECT str_random('#X{6}') AS htmlcolor, str_random('"("d{3},d{3},d{3}")"') AS RGBcolor ;
+-----------+---------------+
| htmlcolor | RGBcolor      |
+-----------+---------------+
| #7E4A88   | (055,246,985) |
+-----------+---------------+
1 row in set (0.00 sec)

MYSQL>SELECT str_random('Cc{3}c(4)') AS firstname ,str_random('Cc{5}c(6)') AS lastname
      ,str_random('Cc{5}[street|lane|road|park] d{1}d(2)') AS adres
      ,str_random('Cc{5}c(6)') AS city ,str_random('d{5}-d{4}') AS zip
      ,str_random('[0011|+1|+61|007] d{8}') AS phone
      ,str_random('[m|f]') AS sex;
+-----------+-----------+----------------+--------+------------+--------------+------+
| firstname | lastname  | adres          | city   | zip        | phone        | sex  |
+-----------+-----------+----------------+--------+------------+--------------+------+
| Svczwp    | Jazytysax | Lpobqastreet 9 | Kdmbwd | 09572-9971 | +61 23223828 | m    |
+-----------+-----------+----------------+--------+------------+--------------+------+
1 row in set (0.00 sec)

MYSQL>SELECT str_random('b') AS flag
      ,str_random('b{8}') AS Byte
      ,str_random('X{1}X(8)') AS Hex
      ,str_random('d{1}d(10) "kb"') AS filesize
      ,str_random('d{1}d(3).d{1}d(3).d{1}d(3).d{1}d(3)') AS IPaddress;
+------+----------+------+------------+---------------+
| flag | Byte     | Hex  | filesize   | IPaddress     |
+------+----------+------+------------+---------------+
| 1    | 01101111 | F1B8 | 5416832 kb | 20.37.702.821 |
+------+----------+------+------------+---------------+
1 row in set (0.00 sec)

You can download the code in the zip file here: str_random_2012-06-21.zip.
Please let me know if you like it, if you find any bugs or you have a request for added functionality or even if you have added new functionality yourself.

MySql: cleaning user input before storing the data

by Ronald Speelman

This article describes a single MySql function to sanitize strings.
The provided options are:

  • replace multiple spaces to a single space
  • remove Unix and Windows linebreaks and replace them with a space
  • remove tabs and replace them for a single space
  • remove multiple backslashes
  • completely remove htmlentities
  • replace htmlentities with the corresponding character
  • remove all HTML tags

All code is available in a zip file.

Where do you stop verifying and cleaning your data?
The question of where your data verification and cleaning should be done is too application-specific to have one right answer.
Only relying on client-side scripting is definitely the wrong answer.
Verifying the data for SQL injection, valid numbers and dates and formats should definitely be done on the server side before it is send to your database.
Cleaning user input on the server side should ideally also be done before the data is sent to the database.
As a DBA, the question rises if your can really rely on it? And if so, what about cleaning migrated data or cleaning up legacy data from a predecessor?

You can use this MySql function to clean up strings before inserting the string in a table.
Because this is a single function that can be used in different scenarios it has a lot of parameters to customize the output.
Feel free to remove some functionality from this function to limit the number of parameters.
Alternatively you can use a wrapper function to call this function without parameters. e.g.:

DELIMITER //
CREATE FUNCTION mycleanup (s TEXT)
RETURNS TEXT DETERMINISTIC
RETURN str_clean(s , FALSE , TRUE , FALSE , TRUE , TRUE , FALSE , FALSE);
//
DELIMITER ;
SELECT mycleanup(@text);

Install the function in your database:
The functionality to html-UnEncode a string is done by calling another function. The credits for this function goes to Charlie Armor.
You can find it here: http://forums.mysql.com/read.php?98,246527,246527 and I have included it in the zip file because my code relies on it.
Unzip the file and on the mysql prompt:

mysql>source fn_str_html_unencode.sql;
mysql>source fn_str_clean.sql;

This assumes both files are in the same directory where you have started the mysql client.

Using the function:
The function is called str_clean() and has eight parameters:

p_text              i         string : the text that needs to be cleaned
p_remove_multiple_blanks      Boolean: replace multiple spaces to a single space
p_remove_linebreaks           Boolean: remove Unix and Windows linebreaks and replace them with a space
p_remove_tabs                 Boolean: remove tabs and replace them for a single space
p_remove_multiple_backslashes Boolean: remove multiple backslashes
p_remove_htmlentities         Boolean: completely remove htmlentities
p_html_unencode               Boolean: replace htmlentities with the corresponding character
p_remove_html                 Boolean: remove all HTML tags

This is also documented in the code.

Now let’s clean up a string:

mysql&gt;SET @text := ' This  &amp;text \\\\contains double spaces         and tabs as well as <a href="www.google.com">tags</a> and
linebreaks. ';

mysql&gt;SELECT @text as originalText
 , str_clean(@text , FALSE , FALSE , FALSE , FALSE , FALSE , FALSE , FALSE) as onlyTrim
 , str_clean(@text , FALSE , TRUE  , FALSE , FALSE , FALSE , FALSE , FALSE) as removeLineBreaks
 , str_clean(@text , FALSE , TRUE  , TRUE  , FALSE , FALSE , FALSE , FALSE) as removeTabs
 , str_clean(@text , TRUE  , TRUE  , TRUE  , FALSE , FALSE , FALSE , FALSE) as removeMultiBlanks
 , str_clean(@text , TRUE  , TRUE  , TRUE  , TRUE  , FALSE , FALSE , FALSE) as removeMultipleBackSlashes
 , str_clean(@text , TRUE  , TRUE  , TRUE  , TRUE  , TRUE  , FALSE , FALSE) as removeHtmlEntities
 , str_clean(@text , TRUE  , TRUE  , TRUE  , TRUE  , FALSE , TRUE  , FALSE) as htmlUnEncode
 , str_clean(@text , TRUE  , TRUE  , TRUE  , TRUE  , TRUE  , TRUE  , TRUE)  as removeHtml
\G
*************************** 1. row ***************************
             originalText:  This  &amp;text \\contains double spaces         and tabs   as well as <a href="www.google.com">tags</a> and
linebreaks.
                 onlyTrim: This  &amp;text \\contains double spaces         and tabs    as well as <a href="www.google.com">tags</a> and
linebreaks.
         removeLineBreaks: This  &amp;text \\contains double spaces         and tabs    as well as <a href="www.google.com">tags</a> and linebreaks.
               removeTabs: This  &amp;text \\contains double spaces         and tabs as well as <a href="www.google.com">tags</a> and linebreaks.
        removeMultiBlanks: This &amp;text \\contains double spaces and tabs as well as <a href="www.google.com">tags</a> and linebreaks.
removeMultipleBackSlashes: This &amp;text \contains double spaces and tabs as well as <a href="www.google.com">tags</a> and linebreaks.
       removeHtmlEntities: This text \contains double spaces and tabs as well as <a href="www.google.com">tags</a> and linebreaks.
             htmlUnEncode: This &amp;text \contains double spaces and tabs as well as <a href="www.google.com">tags</a> and linebreaks.
               removeHtml: This text \contains double spaces and tabs as well as tags and linebreaks.
1 row in set (0.00 sec)

Note: The html is messing up the ampersand on line with the htmlUnencode.

You can download the functions in this zip file: str_clean_2012-06-02.zip

What kind of data cleanup do you use in your database application?