Strip HTML and WORD formatting

**Nick1973** · 15 Apr 2016, 12:32 PM

Does anybody know of a reliable way to strip out redundant html tags, inline styles and WORD formatting from product descriptions?

For example:

MY CONTENT

would become:

MY CONTENT

or

MY CONTENT



would become

MY CONTENT

or

 MY CONTENT

would become

 MY CONTENT

or



<UL style="MARGIN-TOP: 0cm" type=circle><li class=MsoNormal style='mso-list:l3 level1 lfo3;tab-stops:list 36.0pt'>
<o:p>MY CONTENT</o:p></li>

would become

MY CONTENT

I found this but not sure how to implement it - I've tried both and neither appear to work but then it could be something I am doing wrong. I would like to target '.mainDescription' content in product descriptions.

this link which explains it http://tim.mackey.ie/CommentView,gui...6602d5718.aspx

and also this code:

function cleanHTML(input) {
// 1. remove line breaks / Mso classes
var stringStripper = /(\n|\r| class=(")?Mso[a-zA-Z]+(")?)/g;
var output = input.replace(stringStripper, ' ');
// 2. strip Word generated HTML comments
var commentSripper = new RegExp('','g');
var output = output.replace(commentSripper, '');
var tagStripper = new RegExp('<(/)*(meta|link|span|\\?xml:|st1:|o:|font)(.*?)>','gi');
// 3. remove tags leave content if any
output = output.replace(tagStripper, '');
// 4. Remove everything in between and including tags '<style(.)style(.)>'
var badTags = ['style', 'script','applet','embed','noframes','noscript'];

for (var i=0; i< badTags.length; i++) {
tagStripper = new RegExp('<'+badTags[i]+'.*?'+badTags[i]+'(.*?)>', 'gi');
output = output.replace(tagStripper, '');
}
// 5. remove attributes ' style="..."'
var badAttributes = ['style', 'start'];
for (var i=0; i< badAttributes.length; i++) {
var attributeStripper = new RegExp(' ' + badAttributes[i] + '="(.*?)"','gi');
output = output.replace(attributeStripper, '');
}
return output;
}

**DrByte** · 15 Apr 2016, 03:58 PM

I believe the CKEditor plugin has a button to "Paste from Word" which does this cleanup for you.

**Nick1973** · 15 Apr 2016, 04:10 PM

Ok I realise that. However this is a database of 100's, probably a 1000 products so I would rather skip editing 1000 products if I can which is why I am looking towards JQUERY or JAVASCRIPT, or even PHP

**Nick1973** · 15 Apr 2016, 04:13 PM

Do you think it could be done with this?

http://php.net/manual/en/function.strip-tags.php

added to

<?php echo stripslashes($products_description); ?>

in tpl_product_info_display.php

**DrByte** · 15 Apr 2016, 05:23 PM

No. Running strip_tags would remove ALL your formatting, including intentional formatting.

Sounds like the problem is in your original data, not a Zen Cart issue. So, better to go back to the source of your original data. Clean that up, then re-import.

**soxophoneplayer** · 15 Apr 2016, 05:39 PM

Originally Posted by Nick1973

Does anybody know of a reliable way to strip out redundant html tags, inline styles and WORD formatting from product descriptions?

...

This is a real hack and probably wrong headed in many ways - but I had similar issue once, and did the following to do some cleanup:

To remove extraneous formatting from all products I exported copy of db, opened it in Notepad++ , went down to bottom of db where product descriptions are and did a ‘find’ on the offending formatting and 'replace' with nothing. I took a (big) chance and did a 'replace all'. Things like 'find ' and 'replace with ' also worked. When done I saved that db, dropped the original db in phpMyAdmin (after backing up), then imported the changed db. It worked for me. I found a few products that needed tweaking but this saved a huge amount of time overall.

I'm not recommending this approach. Only noting its how I muddled through a similar problem.

**Nick1973** · 15 Apr 2016, 06:01 PM

Ok Dr Byte, however this is also a common issue when clients try to copy and paste data direct from MS Word. Regardless of how much you try to train them not to, they will always do something different. I'm fully aware the problem exists in the original data, but it would take me an age to go through every single product and strip out the code by searching and deleting. It's not a workable option. And neither is copying and pasting the text for each product through the CKEditor plugin button "Paste from Word". I could be at it for weeks, probably very thin from malnutrition, and on the verge of insanity.

The idea is to eventually override any inline styles with CSS too.

Both javascript and jquery do the tasks I am after, however I am not sure how they should be implemented.

**DrByte** · 15 Apr 2016, 06:12 PM

Originally Posted by Nick1973

Ok Dr Byte, however this is also a common issue when clients try to copy and paste data direct from MS Word. Regardless of how much you try to train them not to, they will always do something different.

Not denying that.

Originally Posted by Nick1973

The idea is to eventually override any inline styles with CSS too.

Ideal for sure.

Originally Posted by Nick1973

Both javascript and jquery do the tasks I am after, however I am not sure how they should be implemented.

I'm not sure how javascript and jquery can be used to clean up the data in your database. Are you planning to write a script to loop through every record, from a jquery task running in your browser?

Please expand more on what you envision here ...

**Nick1973** · 15 Apr 2016, 06:25 PM

The scripts already exist. I'm not after a script to rewrite/clean the entire database. I can most likely override spans with CSS, that is fine and I am aware of how that can be done.

The scripts I posted earlier on in this conversation are third party scripts which are supposed to seek out certain html/word elements and disable them, however they are supposed to keep everything in between those elements. However I could not get these to work with Zen Cart. They do not rewrite the content, just strip away/disable certain tags.

**DrByte** · 15 Apr 2016, 06:51 PM

Okay. So you've got some scripts. The stuff you posted earlier is javascript.

You said you don't want to clean the entire database.

But you said you don't want to edit each product manually.
So that means you *do* want to clean all of them in some automated way.

But javascript code such as you've posted is designed to run in the browser. And the browser has no access to the database.
So you need a script to pull each record from the database individually, put it in your browser, and then perform some sort of cleaning on the data, and then send the data back to your store's admin to save the changes back to the database.

Am I correct that you're asking one of us to provide all of that for you?

Thread: Strip HTML and WORD formatting

Thread Tools

Search Thread

Display

Strip HTML and WORD formatting

Re: Strip HTML and WORD formatting

Re: Strip HTML and WORD formatting

Re: Strip HTML and WORD formatting

Re: Strip HTML and WORD formatting

Re: Strip HTML and WORD formatting

Re: Strip HTML and WORD formatting

Re: Strip HTML and WORD formatting

Re: Strip HTML and WORD formatting

Re: Strip HTML and WORD formatting

Similar Threads

Word HEADER_ bin the menu bar and after the word login

html email formatting

SEO URLs - how to strip HTML tags from product names?

Product listing - Don't strip html!

WYSIWYG downloads? Problem with ms word html and the HTMLarea

Bookmarks

Bookmarks

Posting Permissions