Life, the Internet, and Everything!

Cool preg_replace Solution

A two-dimensional array stored as a one-dimens...

The situation: Read a product description from a product database.

The problem: The block of text for the product description in the comma delimited file is not formatted, and in fact looks stripped of whatever formatting it had previously. Chunks of text squished together like “changes.Special” (no space after the period), and “Disc BrakeHeadset” (obviously missing a space between Brake and Headset).

I surmised that the missing spaces were previously line breaks (\n or <br>). If a scan/replace function had been used and the line break characters replaced with a null value, it would fit the puzzle perfectly.

My first thought was to break the string up into a large array of individual characters, read through the array, and check the current character against the prior character and the next character to determine if there should be a new space or line break. This is where my mind went based on past experience with similar situations while coding in RPG. It should work great for punctuation, but the other stuff – we’ll see.

I set up the following as a PHP function

$array = str_split($string);
$fixed = array();
$punc = array(‘.’, ‘:’, ‘)’);
foreach ($array as $key => $value) {
if ($key > 0 and in_array($value, $punc) and ctype_alpha($array[$key-1]) and         ctype_alpha($array[$key+1])) {
$fixed[] = $value;
$fixed[] = ‘<br><br>’;
} else {
$fixed[] = $value;

It worked great – fun punctuation. The line breaks made sense – but there was a bunch of text at the bottom of the description that the routine did not fix. I had to figure out a way to break the text when a character bumped up against another character.

Enter preg_replace

After googling various terms, I finally found a hint in a forum post. The example did not work, but it lead me in the right direction. The final solution looked like this:

$string = preg_replace(“/([a-z])([.:\)])([A-Z])/”, “$1$2<br><br>$3”, $string);
$string = preg_replace(“/([a-z])([A-Z])/”, “$1<br><br>$2”, $string);

Here’s how it works. The first operation looks for the punctuation isolated between a lower case character and an upper case character, and inserts the breaks between the punctuation ($2) and the upper case character ($3). The operation looks for “nothing” in between a lower case and upper case character, and insert the breaks in that null space.

It works like a charm. At this point it handles most of the situations in this particular case. Others may arise, but I’m thinking preg_replace will handle those as well.

Pretty cool.

Tags: , , , , , , , , , , ,
Previous Post
The Walking Dead

The Walking Dead: Merle, We Hardly Knew Ya

Next Post
The Walking Dead

The Walking Dead: The “Andrea”sode

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: