Fee

Fees in RWTH Aachen University

Monthly Expenditure Minimum Costs per Month
Accommodation € 160
Food and general expenses € 350
Health insurance approx. € 80
University fees (social security contribution, semester ticket), approx. €200 per semester approx. € 38
Book and materials € 70 (may be significantly higher, according to subject studied)
Total approx. € 700 per month (€ 8,400 per year)
廣告

2 PEPs 252 and 253: Type and Class Changes

2 PEPs 252 and 253: Type and Class Changes

The largest and most far-reaching changes in Python 2.2 are to Python’s model of objects and classes. The changes should be backward compatible, so it’s likely that your code will continue to run unchanged, but the changes provide some amazing new capabilities. Before beginning this, the longest and most complicated section of this article, I’ll provide an overview of the changes and offer some comments.

A long time ago I wrote a Web page (http://www.amk.ca/python/writing/warts.html) listing flaws in Python’s design. One of the most significant flaws was that it’s impossible to subclass Python types implemented in C. In particular, it’s not possible to subclass built-in types, so you can’t just subclass, say, lists in order to add a single useful method to them. The UserList module provides a class that supports all of the methods of lists and that can be subclassed further, but there’s lots of C code that expects a regular Python list and won’t accept a UserList instance.

Python 2.2 fixes this, and in the process adds some exciting new capabilities. A brief summary:

  • You can subclass built-in types such as lists and even integers, and your subclasses should work in every place that requires the original type.
  • It’s now possible to define static and class methods, in addition to the instance methods available in previous versions of Python.
  • It’s also possible to automatically call methods on accessing or setting an instance attribute by using a new mechanism called properties. Many uses of__getattr__ can be rewritten to use properties instead, making the resulting code simpler and faster. As a small side benefit, attributes can now have docstrings, too.
  • The list of legal attributes for an instance can be limited to a particular set using slots, making it possible to safeguard against typos and perhaps make more optimizations possible in future versions of Python.

Some users have voiced concern about all these changes. Sure, they say, the new features are neat and lend themselves to all sorts of tricks that weren’t possible in previous versions of Python, but they also make the language more complicated. Some people have said that they’ve always recommended Python for its simplicity, and feel that its simplicity is being lost.

Personally, I think there’s no need to worry. Many of the new features are quite esoteric, and you can write a lot of Python code without ever needed to be aware of them. Writing a simple class is no more difficult than it ever was, so you don’t need to bother learning or teaching them unless they’re actually needed. Some very complicated tasks that were previously only possible from C will now be possible in pure Python, and to my mind that’s all for the better.

I’m not going to attempt to cover every single corner case and small change that were required to make the new features work. Instead this section will paint only the broad strokes. See section 2.5, “Related Links", for further sources of information about Python 2.2’s new object model.

2.1 Old and New Classes

First, you should know that Python 2.2 really has two kinds of classes: classic or old-style classes, and new-style classes. The old-style class model is exactly the same as the class model in earlier versions of Python. All the new features described in this section apply only to new-style classes. This divergence isn’t intended to last forever; eventually old-style classes will be dropped, possibly in Python 3.0.

So how do you define a new-style class? You do it by subclassing an existing new-style class. Most of Python’s built-in types, such as integers, lists, dictionaries, and even files, are new-style classes now. A new-style class named object, the base class for all built-in types, has also been added so if no built-in type is suitable, you can just subclass object:

class C(object):
    def __init__ (self):
        ...
    ...

This means that class statements that don’t have any base classes are always classic classes in Python 2.2. (Actually you can also change this by setting a module-level variable named __metaclass__ — see PEP 253 for the details — but it’s easier to just subclass object.)

The type objects for the built-in types are available as built-ins, named using a clever trick. Python has always had built-in functions named int(), float(), andstr(). In 2.2, they aren’t functions any more, but type objects that behave as factories when called.

>>> int
<type 'int'>
>>> int('123')
123

To make the set of types complete, new type objects such as dict and file have been added. Here’s a more interesting example, adding a lock() method to file objects:

class LockableFile(file):
    def lock (self, operation, length=0, start=0, whence=0):
        import fcntl
        return fcntl.lockf(self.fileno(), operation,
                           length, start, whence)

The now-obsolete posixfile module contained a class that emulated all of a file object’s methods and also added a lock() method, but this class couldn’t be passed to internal functions that expected a built-in file, something which is possible with our new LockableFile.

2.2 Descriptors

In previous versions of Python, there was no consistent way to discover what attributes and methods were supported by an object. There were some informal conventions, such as defining __members__ and __methods__ attributes that were lists of names, but often the author of an extension type or a class wouldn’t bother to define them. You could fall back on inspecting the __dict__ of an object, but when class inheritance or an arbitrary __getattr__ hook were in use this could still be inaccurate.

The one big idea underlying the new class model is that an API for describing the attributes of an object using descriptors has been formalized. Descriptors specify the value of an attribute, stating whether it’s a method or a field. With the descriptor API, static methods and class methods become possible, as well as more exotic constructs.

Attribute descriptors are objects that live inside class objects, and have a few attributes of their own:

  • __name__ is the attribute’s name.
  • __doc__ is the attribute’s docstring.
  • __get__(object) is a method that retrieves the attribute value from object.
  • __set__(object, value) sets the attribute on object to value.
  • __delete__(object, value) deletes the value attribute of object.

For example, when you write obj.x, the steps that Python actually performs are:

descriptor = obj.__class__.x
descriptor.__get__(obj)

For methods, descriptor.__get__ returns a temporary object that’s callable, and wraps up the instance and the method to be called on it. This is also why static methods and class methods are now possible; they have descriptors that wrap up just the method, or the method and the class. As a brief explanation of these new kinds of methods, static methods aren’t passed the instance, and therefore resemble regular functions. Class methods are passed the class of the object, but not the object itself. Static and class methods are defined like this:

class C(object):
    def f(arg1, arg2):
        ...
    f = staticmethod(f)

    def g(cls, arg1, arg2):
        ...
    g = classmethod(g)

The staticmethod() function takes the function f, and returns it wrapped up in a descriptor so it can be stored in the class object. You might expect there to be special syntax for creating such methods (def static f(), defstatic f(), or something like that) but no such syntax has been defined yet; that’s been left for future versions of Python.

More new features, such as slots and properties, are also implemented as new kinds of descriptors, and it’s not difficult to write a descriptor class that does something novel. For example, it would be possible to write a descriptor class that made it possible to write Eiffel-style preconditions and postconditions for a method. A class that used this feature might be defined like this:

from eiffel import eiffelmethod

class C(object):
    def f(self, arg1, arg2):
        # The actual function
        ...
    def pre_f(self):
        # Check preconditions
        ...
    def post_f(self):
        # Check postconditions
        ...

    f = eiffelmethod(f, pre_f, post_f)

Note that a person using the new eiffelmethod() doesn’t have to understand anything about descriptors. This is why I think the new features don’t increase the basic complexity of the language. There will be a few wizards who need to know about it in order to write eiffelmethod() or the ZODB or whatever, but most users will just write code on top of the resulting libraries and ignore the implementation details.

2.3 Multiple Inheritance: The Diamond Rule

Multiple inheritance has also been made more useful through changing the rules under which names are resolved. Consider this set of classes (diagram taken fromPEP 253 by Guido van Rossum):

                class A:
                  ^ ^  def save(self): ...
                 /   \
                /     \
               /       \
              /         \
          class B     class C:
              ^         ^  def save(self): ...
               \       /
                \     /
                 \   /
                  \ /
                class D

The lookup rule for classic classes is simple but not very smart; the base classes are searched depth-first, going from left to right. A reference to D.save will search the classes D, B, and then A, where save() would be found and returned. C.save() would never be found at all. This is bad, because if C‘s save() method is saving some internal state specific to C, not calling it will result in that state never getting saved.

New-style classes follow a different algorithm that’s a bit more complicated to explain, but does the right thing in this situation. (Note that Python 2.3 changes this algorithm to one that produces the same results in most cases, but produces more useful results for really complicated inheritance graphs.)

  1. List all the base classes, following the classic lookup rule and include a class multiple times if it’s visited repeatedly. In the above example, the list of visited classes is [D, B, A, C, A].
  2. Scan the list for duplicated classes. If any are found, remove all but one occurrence, leaving the last one in the list. In the above example, the list becomes [D,B, C, A] after dropping duplicates.

Following this rule, referring to D.save() will return C.save(), which is the behaviour we’re after. This lookup rule is the same as the one followed by Common Lisp. A new built-in function, super(), provides a way to get at a class’s superclasses without having to reimplement Python’s algorithm. The most commonly used form will be super(class, obj), which returns a bound superclass object (not the actual class object). This form will be used in methods to call a method in the superclass; for example, D‘s save() method would look like this:

class D:
    def save (self):
	# Call superclass .save()
        super(D, self).save()
        # Save D's private information here
        ...

super() can also return unbound superclass objects when called as super(class) or super(class1, class2), but this probably won’t often be useful.

2.4 Attribute Access

A fair number of sophisticated Python classes define hooks for attribute access using __getattr__; most commonly this is done for convenience, to make code more readable by automatically mapping an attribute access such as obj.parent into a method call such as obj.get_parent(). Python 2.2 adds some new ways of controlling attribute access.

First, __getattr__(attr_name) is still supported by new-style classes, and nothing about it has changed. As before, it will be called when an attempt is made to access obj.foo and no attribute named “foo" is found in the instance’s dictionary.

New-style classes also support a new method, __getattribute__(attr_name). The difference between the two methods is that __getattribute__ is alwayscalled whenever any attribute is accessed, while the old __getattr__ is only called if “foo" isn’t found in the instance’s dictionary.

However, Python 2.2’s support for properties will often be a simpler way to trap attribute references. Writing a __getattr__ method is complicated because to avoid recursion you can’t use regular attribute accesses inside them, and instead have to mess around with the contents of __dict__. __getattr__ methods also end up being called by Python when it checks for other methods such as __repr__ or __coerce__, and so have to be written with this in mind. Finally, calling a function on every attribute access results in a sizable performance loss.

property is a new built-in type that packages up three functions that get, set, or delete an attribute, and a docstring. For example, if you want to define a sizeattribute that’s computed, but also settable, you could write:

class C(object):
    def get_size (self):
        result = ... computation ...
        return result
    def set_size (self, size):
        ... compute something based on the size
        and set internal state appropriately ...

    # Define a property.  The 'delete this attribute'
    # method is defined as None, so the attribute
    # can't be deleted.
    size = property(get_size, set_size,
                    None,
                    "Storage size of this instance")

That is certainly clearer and easier to write than a pair of __getattr__/__setattr__ methods that check for the size attribute and handle it specially while retrieving all other attributes from the instance’s __dict__. Accesses to size are also the only ones which have to perform the work of calling a function, so references to other attributes run at their usual speed.

Finally, it’s possible to constrain the list of attributes that can be referenced on an object using the new __slots__ class attribute. Python objects are usually very dynamic; at any time it’s possible to define a new attribute on an instance by just doing obj.new_attr=1. A new-style class can define a class attribute named__slots__ to limit the legal attributes to a particular set of names. An example will make this clear:

>>> class C(object):
...     __slots__ = ('template', 'name')
...
>>> obj = C()
>>> print obj.template
None
>>> obj.template = 'Test'
>>> print obj.template
Test
>>> obj.newattr = None
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
AttributeError: 'C' object has no attribute 'newattr'

Note how you get an AttributeError on the attempt to assign to an attribute not listed in __slots__.


2.5 Related Links

This section has just been a quick overview of the new features, giving enough of an explanation to start you programming, but many details have been simplified or ignored. Where should you go to get a more complete picture?

http://www.python.org/2.2/descrintro.html is a lengthy tutorial introduction to the descriptor features, written by Guido van Rossum. If my description has whetted your appetite, go read this tutorial next, because it goes into much more detail about the new features while still remaining quite easy to read.

Next, there are two relevant PEPs, PEP 252 and PEP 253. PEP 252 is titled “Making Types Look More Like Classes", and covers the descriptor API. PEP 253 is titled “Subtyping Built-in Types", and describes the changes to type objects that make it possible to subtype built-in objects. PEP 253 is the more complicated PEP of the two, and at a few points the necessary explanations of types and meta-types may cause your head to explode. Both PEPs were written and implemented by Guido van Rossum, with substantial assistance from the rest of the Zope Corp. team.

Finally, there’s the ultimate authority: the source code. Most of the machinery for the type handling is in Objects/typeobject.c, but you should only resort to it after all other avenues have been exhausted, including posting a question to python-list or python-dev.

excel formating

Preserve the original input.xls formatting when you open it:

from xlrd import open_workbook

input_wb = open_workbook('input.xls', formatting_info=True)

Create a new workbook based on this template:

from xlutils.copy import copy as copy_workbook

output_wb = copy_workbook(input_wb)

Define some new cell styles:

from xlwt import easyxf

red_background = easyxf("pattern: pattern solid, fore_color red;")
black_with_white_font = easyxf('pattern: pattern solid, fore_color black; font: color-index white, bold on;")

Evaluate and modify your cells:

input_ws = input_wb.sheet_by_name('StackOverflow')
output_ws = output_wb.get_sheet(0)

for rindex in range(0, input_ws.nrows):
   for cindex in range(0, input_ws.ncols):
       input_cell = input_ws.cell(rindex, cindex)
       if input_cell.value[ input_cell.value.rfind('.'): ] == 'pf':
           output_ws.write(rindex, cindex, input_cell.value, red_background)
       elif input_cell.value.find('deleted') >= 0:
           output_ws.write(rindex, cindex, input_cell.value, black_with_white_font)
       else:
           pass  # we don't need to modify it

Save your new workbook

output_wb.save('output.xls')

Using the above example, unmodified cells should have their original formatting intact.

Should you need to alter the cell content AND would like to preserve the original formatting (i.e. NOT use your custom easyxf instance), you may use this snippet:

def changeCell(worksheet, row, col, text):
    """ Changes a worksheet cell text while preserving formatting """
    # Adapted from http://stackoverflow.com/a/7686555/1545769
    previousCell = worksheet._Worksheet__rows.get(row)._Row__cells.get(col)
    worksheet.write(row, col, text)
    newCell = worksheet._Worksheet__rows.get(row)._Row__cells.get(col)
    newCell.xf_idx = previousCell.xf_idx

# ...

changeCell(worksheet_instance, 155, 2, "New Value")

For the comparisons, you can use the string methods find and rfind (which searches from the right). They return the index of the position of the substring within the string. They return -1 if the substring is not found. Ergo, you see above input_cell.value.find('deleted') >= 0 to evaluate whether or not the substring ‘deleted’ exists. For the .pf comparison, I used rfind as well as something in Python called slicing.

kaggle airbnb

Data Files

File Name Available Formats
countries.csv .zip (546 b)
age_gender_bkts.csv .zip (2.46 kb)
test_users.csv .zip (1.05 mb)
sessions.csv .zip (59.14 mb)
sample_submission_NDF.csv .zip (478.27 kb)
train_users_2.csv .zip (4.07 mb)

In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user’s first booking destination will be. All the users in this dataset are from the USA.

There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. Please note that ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.

The training and test sets are split by dates. In the test set, you will predict all the new users with first activities after 7/1/2014 (note: this is updated on 12/5/15 when the competition restarted). In the sessions dataset, the data only dates back to 1/1/2014, while the users dataset dates back to 2010.

File descriptions

  • train_users.csv – the training set of users
  • test_users.csv – the test set of users
    • id: user id
    • date_account_created: the date of account creation
    • timestamp_first_active: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
    • date_first_booking: date of first booking
    • gender
    • age
    • signup_method
    • signup_flow: the page a user came to signup up from
    • language: international language preference
    • affiliate_channel: what kind of paid marketing
    • affiliate_provider: where the marketing is e.g. google, craigslist, other
    • first_affiliate_tracked: whats the first marketing the user interacted with before the signing up
    • signup_app
    • first_device_type
    • first_browser
    • country_destination: this is the target variable you are to predict
  • sessions.csv – web sessions log for users
    • user_id: to be joined with the column ‘id’ in users table
    • action
    • action_type
    • action_detail
    • device_type
    • secs_elapsed
  • countries.csv – summary statistics of destination countries in this dataset and their locations
  • age_gender_bkts.csv – summary statistics of users’ age group, gender, country of destination
  • sample_submission.csv – correct format for submitting your predictions

Rossmann Store Sales, Winner’s Interview: 1st place, Gert Jacobusse

Rossmann operates over 3,000 drug stores in 7 European countries. In their first Kaggle competition, Rossmann Store Sales, this drug store giant challenged Kagglers to forecast 6 weeks of daily sales for 1,115 stores located across Germany. The competition attracted 3,738 data scientists, making it our second most popular competition by participants ever.

Gert Jacobusse, a professional sales forecast consultant, finished in first place using an ensemble of over 20 XGBoost models. Notably, most of the models individually achieve a very competitive (top 3 leaderboard) score. In this blog, Gert shares some of the tricks he’s learned for sales forecasting, as well as wisdom on the why and how of using hold out sets when competing.

The Basics

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

My hobby and daily job is to work on data analysis problems, and I participate in a lot of Kaggle competitions. With my own company Rogatio I deliver tailored sales forecasts for several companies – product specific as well as overall. Therefore I knew how to approach the problem.

Gert's profile on Kaggle

How did you get started competing on Kaggle?

I don’t remember, somehow it has become a part of my life. I enjoy the competitions so much that it is really addictive for me. But in a good way: it is nice exposure for my skills, I learn a lot of new techniques and applications, I get to know other skilled data scientists and if I am lucky I even get paid!

What made you decide to enter this competition?

A sales forecast is a tool that can help almost any company I can think of. Many companies rely on human forecasts that are not of a constant quality. Other companies use a standard tool that is not flexible enough to suit their needs. As an individual researcher I can create a solution that really improves business. And that is exactly what this competition is about. I am very eager to further develop and show my skills – therefore I did not hesitate a moment to enter this competition.

Let’s Get Technical

What preprocessing and supervised learning methods did you use?

The most important preprocessing was the calculation of averages over different time windows. For each day in the sales history, I calculated averages over the last quarter, last half year, last year and last 2 years. Those averages were split out by important features like day of week and promotions. Second, some time indicators were important: not only month and day of year, but also relative indicators like number of days since the summer holidays started. Like most teams, I used extreme gradient boosting (xgboost) as a learning method.

-Figure 1 a/b. Illustration of the task: predict sales six weeks ahead, based on historical sales (only last 3 months of train set shown).

What was your most important insight into the data?

The most important insight was that I could reliably predict performance improvements based on a hold out set within the trainset. Because of this insight, I did not overfit the public test set, so my model worked very well on the public test set as well as the unseen private test set that was four weeks further ahead.

Do you always use hold out sets to validate your model in every competition?

Yes, sometimes using cross-validation (with multiple holdout sets) and sometimes with a single holdout set, like I did in this competition. The advantage of a holdout set is that I can use the public test set as a real test set, not a set that gives me feedback to improve my model. As a consequence, I get reliable feedback about how much I overfitted my own holdout set. Therefore, I do not like competitions where the train/ test split is not-random, while the public/ private split is random: in such competitions, you can build a better model by using feedback from the public leaderboard. I do not like that because I am not aware of any real life problem that would require such an approach. This competition was ideal for me: the train test split was time based, and so was the public/private split!

Do you have any recommendations for selecting data for a hold out set and using it most effectively?

For selecting a hold out set, I always try to imitate the way that the train and test set were split. So, if it is a time split, I split my holdout sample time based; if it is a geographical split by city, I split my holdout set by city; and if it is a random split, then my holdout split will be random as well. You can effectively use a holdout set to push the limit towards how much you can learn from the data without overfitting. Don’t be afraid to overfit your holdout set, the public leaderboard will tell you if you do so.

Were you surprised by any of your findings?

Yes, I was surprised that a model without the most recent month of data (that I used to predict sales further ahead) did almost as well as a model that did include recent data. This finding is very specific for the Rossmann data, and it means that short term changes are less important than they often are in forecasting.

rossmann1_fig22

Which tools did you use?

For preprocessing I loaded the data into an SQL database. For creating features and applying models, I used Python.

How did you spend your time on this competition?

I spent 50% on feature engineering, 40% on feature selection plus model ensembling, and less than 10% on model selection and tuning.

What was the run time for both training and prediction of your winning solution?

The winning solution consists of over 20 xgboost models that each need about two hours to train when running three models in parallel on my laptop. So I think it could be done within 24 hours. Most of the models individually achieve a very competitive (top 3 leaderboard) score.

Figure 3. A time indicator for the time until store refurbishment (last four days on the right of the plot) reveals how the sales are expected to change during the weeks before a refurbishment.

Words of Wisdom

What have you taken away from this competition?

More experience in sales forecasts and a very solid proof of my skills. Plus a nice extra turnover of $15,000 dollars that I had not forecasted.

Do you have any advice for those just getting started in data science?

  1. make sure that you understand the principles of cross validation, overfitting and leakage
  2. spend your time on feature engineering instead of model tuning
  3. visualize your data every now and then

Just for Fun

If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?

You have proven to be very good at creating competitions, I don’t have an idea to improve on that right now 😉 But I have the opportunity so let me share one idea for improvement: to create good models and anticipate the kind of error that can be expected, I often miss explicit information on how the train/test and public/private sets are being split. A competition is (even) more fun for me when I don’t have to guess at what types of mechanisms impact model performance.

What is your dream job?

Work for a variety of customers – and help them with data challenges that are central to the success of their business. And have enough spare time to participate in Kaggle competitions!

linux command

Copy or move file

sudo cp -r <source> <destinate>

Ex:  cp -r home/ayu/xyz/. /tmp

sudo mv -r <source> <destinate>

Ex: mv drupal_commons/ xyz/

Rename file

mv <old file name> <new file name>

 

File Permission

chmod

The chmod command is used to change the permissions of a file or directory. To use it, you specify the desired permission settings and the file or files that you wish to modify. There are two ways to specify the permissions, but I am only going to teach one way.

It is easy to think of the permission settings as a series of bits (which is how the computer thinks about them). Here’s how it works:

rwx rwx rwx = 111 111 111
rw- rw- rw- = 110 110 110
rwx --- --- = 111 000 000

and so on...

rwx = 111 in binary = 7
rw- = 110 in binary = 6
r-x = 101 in binary = 5
r-- = 100 in binary = 4

Ex: sudo chmod 600 some_file

Here is a table of numbers that covers all the common settings. The ones beginning with “7″ are used with programs (since they enable execution) and the rest are for other kinds of files.

Value Meaning
777 (rwxrwxrwx) No restrictions on permissions. Anybody may do anything. Generally not a desirable setting.
755 (rwxr-xr-x) The file’s owner may read, write, and execute the file. All others may read and execute the file. This setting is common for programs that are used by all users.
700 (rwx——) The file’s owner may read, write, and execute the file. Nobody else has any rights. This setting is useful for programs that only the owner may use and must be kept private from others.
666 (rw-rw-rw-) All users may read and write the file.
644 (rw-r–r–) The owner may read and write a file, while all others may only read the file. A common setting for data files that everybody may read, but only the owner may change.
600 (rw——-) The owner may read and write a file. All others have no rights. A common setting for data files that the owner wants to keep private.