Question Details

No question body available.

Tags

python json dataframe python-polars

Answers (1)

January 7, 2026 Score: 0 Rep: 13,594 Quality: Medium Completeness: 60%

So, I think the key question is:

How are these datafiles created and can the issue with that codepoint be resolved there?

I strongly suspect that an MS Office product is involved at some point as what you see is the classing MS smart right quote.

Just to demonstrate it is there, let's take your observed hex dump:

hexdump = """5B 0A 20 20 7B 0A 20 20 20 20 22 71 75 65 73 74 69 6F 6E 49 64 22 3A 20 39 32 34 36 37 2C 0A 20 20 20 20 22 71 75 65 73 74 69 6F 6E 22 3A 20 22 49 2019 6D 20 73 6F 72 72 79 20 74 6F 20 68 65 61 72 20 74 68 61 74 20 79 6F 75 20 77 65 72 65 20 6E 6F 74 20 73 61 74 69 73 66 69 65 64 20 77 69 74 68 20 79 6F 75 72 20 5B 5B 45 4E 54 45 52 20 50 52 4F 44 55 43 54 20 4E 41 4D 45 20 48 45 52 45 5D 5D 20 2E 20 20 43 61 6E 20 79 6F 75 20 70 72 6F 76 69 64 65 20 75 73 20 73 6F 6D 65 20 64 65 74 61 69 6C 73 20 61 62 6F 75 74 20 77 68 79 20 79 6F 75 20 77 65 72 65 20 6E 6F 74 20 73 61 74 69 73 66 69 65 64 3F 22 0A 20 20 7D 0A 5D"""
stringresult = ''.join(chr(int(h, 16)) for h in hexdump.split())
print(stringresult)

This should result in:

[ { "questionId": 92467, "question": "I’m sorry to hear that you were not satisfied with your [[ENTER PRODUCT NAME HERE]] . Can you provide us some details about why you were not satisfied?" } ]

If you run that through the json package we can see where your \u2019 might come from:

import json stringresult2 = json.dumps(json.loads(stringresult), indent=4) print(stringresult2)

One might expect to see:

[ { "questionId": 92467, "question": "I\u2019m sorry to hear that you were not satisfied with your [[ENTER PRODUCT NAME HERE]] . Can you provide us some details about why you were not satisfied?" } ]

The issue is how to get rid of this kind of these "smart" characters. The right way in my opinion is to prevent them in the first place but here we are now. So, let's handle them by converting them into "dumb" character equivalents. Note that since I'm uncertain of how this is produced, this normalizesmartquotes() method might be overkill or even insufficient.

def normalize
smartquotes(text): # multi-byte wonkieness mojibakefixes = { '’': "'", '“': '"', 'â€\x9d': '"', '‘': "'", } for smart, dumb in mojibakefixes.items(): text = text.replace(smart, dumb)

translation = str.maketrans({ '\u2018': "'", # left single '\u2019': "'", # right single (apostrophe) '\u201C': '"', # left double '\u201D': '"', # right double }) return text.translate(translation)

Now we can use it:

string
result3 = normalizesmartquotes(stringresult) print(stringresult3)

giving us:

[ { "questionId": 92467, "question": "I'm sorry to hear that you were not satisfied with your [[ENTER PRODUCT NAME HERE]] . Can you provide us some details about why you were not satisfied?" } ]

Finally let's circle back and load that into a polars dataframe:

with open("data.json", "r") as filein: cleaned = normalizesmartquotes(filein.read()) df = polars.readjson(io.StringIO(cleaned)) print(df)

That should (with luck) give you:

┌────────────┬─────────────────────────────────┐ │ questionId ┆ question │ │ --- ┆ --- │ │ i64 ┆ str │ ╞════════════╪═════════════════════════════════╡ │ 92467 ┆ I'm sorry to hear that you wer… │ └────────────┴─────────────────────────────────┘

If in the end you wanted to actually preserve the smart quotes then converting them from these multi-byes into proper utf-8 equivalents migth be done via:

def fix
smartquotes(text): mojibakefixes = { '’': '\u2019', # right single / apostrophe '‘': '\u2018', # left single '“': '\u201C', # left double 'â€\x9d': '\u201D', # right double } for mangled, restored in mojibakefixes.items(): text = text.replace(mangled, restored) return text

When I use it like:

with open("data.json", "r") as file
in: cleaned = fixsmartquotes(filein.read()) df = polars.readjson(io.StringIO(cleaned)) print(df)

I get:

┌────────────┬─────────────────────────────────┐ │ questionId ┆ question │ │ --- ┆ --- │ │ i64 ┆ str │ ╞════════════╪═════════════════════════════════╡ │ 92467 ┆ I’m sorry to hear that you wer… │ └────────────┴─────────────────────────────────┘