MM03: fix_csv.py

CatullusCatullus RabbleRouser
This week we're going to normalize CSV files by writing a program, fix_csv.py, that turns a pipe-delimited file into a comma-delimited file. I'll explain how it should work by example.

Your original file will look like this:
Reading|Make|Model|Type|Value
Reading 0|Toyota|Previa|distance|19.83942
Reading 1|Dodge|Intrepid|distance|31.28257
You'll then run your script by typing this at the command line:
$ python fix_csv.py cars-original.csv cars.csv
Note : "$" is not typed; that is simply the beginning of the prompt.

Your fixed file should then look like this:
Reading,Make,Model,Type,Value
Reading 0,Toyota,Previa,distance,19.83942
Reading 1,Dodge,Intrepid,distance,31.28257
Note that it's valid for a comma to be in your input data, but you'll need to surround data cells with commas in them by double quotes when writing your output file.

It's also valid for a quote character to be in your input (you'll need to double up quotes because that's how CSV escaping works.
https://stackoverflow.com/questions/17808511/properly-escape-a-double-quote-in-csv

See the hints if you need help working with CSV files in Python.

Bonus 1
For the first bonus, I want you to allow the input delimiter and quote character (" by default) to be optionally specified. ✔️

For example any of these should work (all specify input delimiter as pipe and the last two additionally specifies the quote character as single quote):
$ python fix_csv.py --in-delimiter="|" cars.csv cars-fixed.csv
$ python fix_csv.py cars.csv cars-fixed.csv --in-delimiter="|"
$ python fix_csv.py --in-delimiter="|" --in-quote="'" cars.csv cars-fixed.csv
$ python fix_csv.py --in-quote="|" --in-delimiter="," cars.csv cars-fixed.csv
This bonus will require looking into parsing command-line arguments with Python. There are some standard library modules that can help you out with this. There are 3 different solutions in the standard library actually, but only one I'd recommend.

Also note that if you're going to need Python to parse your CSV files for this bonus (or else you'll re-implement quite a bit of CSV-parsing code that's baked-in to Python).

Bonus 2
For the second bonus, try to automatically detect the delimiter if an in-delimiter value isn't supplied (don't assume it's pipe and quote, figure it out). ✔️

This second bonus is a bit trickier and I don't expect it to work correctly for all files. Don't be afraid to check the hints for this one.

 
I agree with Cat 100%.

Comments

  • CatullusCatullus RabbleRouser
    Hints
    How to access command-line arguments
    https://stackoverflow.com/questions/4033723/how-do-i-access-command-line-arguments-in-python/35421024#35421024

    Reading and writing CSV files in Python
    https://pymotw.com/3/csv/index.html

    Short video on working with CSV files in Python

    Restricting the number of command-line arguments
    https://treyhunner.com/2018/03/tuple-unpacking-improves-python-code-readability/#Multiple_assignment_is_very_strict

    Parsing command-line arguments more robustly
    http://zetcode.com/python/argparse/

    Automatically detecting the type of a CSV file
    https://docs.python.org/3/library/csv.html#csv.Sniffer

     
    I agree with Cat 100%.
  • CatullusCatullus RabbleRouser
    Tests
    Automated tests for this week's exercise can be found here. You'll need to write your code in a module named fix_csv.py next to the test file. To run the tests you'll run "python test_fix_csv.py" and check the output for "OK". You'll see that there are some "expected failures" (or "unexpected successes" maybe). If you'd like to do the bonus, you'll want to comment out the noted lines of code in the tests file to test them properly.
    from contextlib import contextmanager, redirect_stderr, redirect_stdout
    from io import StringIO
    from importlib.machinery import SourceFileLoader
    import os
    import sys
    import warnings
    import shlex
    from textwrap import dedent
    from tempfile import NamedTemporaryFile
    import unittest
    
    
    class FixCSVTests(unittest.TestCase):
    
        """Tests for fix_csv.py"""
    
        maxDiff = None
    
        def test_pipe_file_to_csv_file(self):
            old_contents = dedent("""
                2012|Lexus|LFA
                2009|GMC|Yukon XL 1500
                1965|Ford|Mustang
                2005|Hyundai|Sonata
                1995|Mercedes-Benz|C-Class
            """).lstrip()
            expected = dedent("""
                2012,Lexus,LFA
                2009,GMC,Yukon XL 1500
                1965,Ford,Mustang
                2005,Hyundai,Sonata
                1995,Mercedes-Benz,C-Class
            """).lstrip()
            with make_file(old_contents) as old, make_file("") as new:
                output = run_program(f'fix_csv.py {old} {new}')
                with open(new) as new_file:
                    new_contents = new_file.read()
            self.assertEqual(expected, new_contents)
            self.assertEqual("", output)
    
        def test_delimiter_in_output(self):
            old_contents = dedent("""
                02|Waylon Jennings|Honky Tonk Heroes (Like Me)
                04|Kris Kristofferson|To Beat The Devil
                11|Johnny Cash|Folsom Prison Blues
                13|Billy Joe Shaver|Low Down Freedom
                21|Hank Williams III|Mississippi Mud
                22|David Allan Coe|Willie, Waylon, And Me
                24|Bob Dylan|House Of The Risin' Sun
            """).lstrip()
            expected = dedent("""
                02,Waylon Jennings,Honky Tonk Heroes (Like Me)
                04,Kris Kristofferson,To Beat The Devil
                11,Johnny Cash,Folsom Prison Blues
                13,Billy Joe Shaver,Low Down Freedom
                21,Hank Williams III,Mississippi Mud
                22,David Allan Coe,"Willie, Waylon, And Me"
                24,Bob Dylan,House Of The Risin' Sun
            """).lstrip()
            with make_file(old_contents) as old, make_file("") as new:
                output = run_program(f'fix_csv.py {old} {new}')
                with open(new) as new_file:
                    new_contents = new_file.read()
            self.assertEqual(expected, new_contents)
            self.assertEqual("", output)
    
        def test_original_file_is_unchanged(self):
            old_contents = dedent("""
                2012|Lexus|LFA
                2009|GMC|Yukon XL 1500
            """).lstrip()
            with make_file(old_contents) as old, make_file("") as new:
                run_program(f'fix_csv.py {old} {new}')
                with open(old) as old_file:
                    contents = old_file.read()
            self.assertEqual(old_contents, contents)
    
        def test_call_with_too_many_files(self):
            with make_file("") as old, make_file("") as new:
                with self.assertRaises(BaseException):
                    run_program(f'fix_csv.py {old} {new} {old}')
    
        # To test the Bonus part of this exercise, comment out the following line
        @unittest.expectedFailure
        def test_in_delimiter_and_in_quote(self):
            old_contents = dedent("""
                2012 Lexus "LFA"
                2009 GMC 'Yukon XL 1500'
                1995 "Mercedes-Benz" C-Class
            """).lstrip()
            expected1 = dedent("""
                2012,Lexus,LFA
                2009,GMC,'Yukon,XL,1500'
                1995,Mercedes-Benz,C-Class
            """).lstrip()
            expected2 = dedent('''
                2012,Lexus,"""LFA"""
                2009,GMC,Yukon XL 1500
                1995,"""Mercedes-Benz""",C-Class
            ''').lstrip()
            with make_file(old_contents) as old, make_file("") as new:
                run_program(f'fix_csv.py {old} {new} --in-delimiter=" "')
                with open(new) as new_file:
                    self.assertEqual(expected1, new_file.read())
                run_program(
                    f'''fix_csv.py --in-delimiter=" " --in-quote="'" {old} {new}'''
                )
                with open(new) as new_file:
                    self.assertEqual(expected2, new_file.read())
    
        # To test the Bonus part of this exercise, comment out the following line
        @unittest.expectedFailure
        def test_autodetect_input_format(self):
            contents1 = dedent("""
                '2012' 'Lexus' 'LFA'
                '2009' 'GMC' 'Yukon XL 1500'
                '1995' 'Mercedes-Benz' 'C-Class'
            """).lstrip()
            expected1 = dedent("""
                2012,Lexus,LFA
                2009,GMC,Yukon XL 1500
                1995,Mercedes-Benz,C-Class
            """).lstrip()
            with make_file(contents1) as old, make_file("") as new:
                run_program(f'fix_csv.py {old} {new}')
                with open(new) as new_file:
                    self.assertEqual(expected1, new_file.read())
            contents2 = dedent("""
                "02"\t"Waylon Jennings"\t"Honky Tonk Heroes (Like Me)"\t"3:29"
                "04"\t"Kris Kristofferson"\t"To Beat The Devil"\t"4:05"
                "11"\t"Johnny Cash"\t"Folsom Prison Blues"\t"2:51"
                "13"\t"Billy Joe Shaver"\t"Low Down Freedom"\t"2:53"
                "21"\t"Hank Williams III"\t"Mississippi Mud"\t"3:32"
                "22"\t"David Allan Coe"\t"Willie, Waylon, And Me"\t"3:24"
                "24"\t"Bob Dylan"\t"House Of The Risin' Sun"\t"5:20"
            """).lstrip()
            expected2 = dedent("""
                02,Waylon Jennings,Honky Tonk Heroes (Like Me),3:29
                04,Kris Kristofferson,To Beat The Devil,4:05
                11,Johnny Cash,Folsom Prison Blues,2:51
                13,Billy Joe Shaver,Low Down Freedom,2:53
                21,Hank Williams III,Mississippi Mud,3:32
                22,David Allan Coe,"Willie, Waylon, And Me",3:24
                24,Bob Dylan,House Of The Risin' Sun,5:20
            """).lstrip()
            with make_file(contents2) as old, make_file("") as new:
                run_program(f'fix_csv.py {old} {new}')
                with open(new) as new_file:
                    self.assertEqual(expected2, new_file.read())
    
    
    class DummyException(Exception):
        """No code will ever raise this exception."""
    
    
    def run_program(arguments="", raises=DummyException):
        """
        Run program at given path with given arguments.
    
        If raises is specified, ensure the given exception is raised.
        """
        arguments = arguments.replace('\\', '\\\\')
        path, *args = shlex.split(arguments)
        old_args = sys.argv
        warnings.simplefilter("ignore", ResourceWarning)
        try:
            sys.argv = [path] + args
            try:
                if '__main__' in sys.modules:
                    del sys.modules['__main__']
                with redirect_stdout(StringIO()) as output:
                    with redirect_stderr(output):
                        SourceFileLoader('__main__', path).load_module()
            except raises:
                return output.getvalue()
            except SystemExit as e:
                if e.args != (0,):
                    raise SystemExit(output.getvalue()) from e
            finally:
                sys.modules.pop('__main__', None)
            if raises is not DummyException:
                raise AssertionError("{} not raised".format(raises))
            return output.getvalue()
        finally:
            sys.argv = old_args
    
    
    @contextmanager
    def make_file(contents=None):
        """Context manager providing name of a file containing given contents."""
        with NamedTemporaryFile(mode='wt', encoding='utf-8', delete=False) as f:
            if contents:
                f.write(contents)
        try:
            yield f.name
        finally:
            os.remove(f.name)
    
    
    if __name__ == "__main__":
        unittest.main(verbosity=2)
    

     
    I agree with Cat 100%.
Sign In or Register to comment.